<<

Bruno Cuconato Claro

A computational for Portuguese

Rio de Janeiro 2019

Bruno Cuconato Claro

A computational grammar for Portuguese

Dissertação submetida à Escola de Matemática Aplicada como requisito parcial para a obtenção do grau de Mestre em Modelagem Matemática

Fundação Getulio Vargas Escola de Matemática Aplicada Mestrado em Modelagem Matemática Ênfase em Modelagem e Análise da Informação

Supervisor: Alexandre Rademaker

Rio de Janeiro 2019 Dados Internacionais de Catalogação na Publicação (CIP) Ficha catalográfica elaborada pelo Sistema de Bibliotecas/FGV

Claro, Bruno Cuconato A computational grammar for Portuguese / Bruno Cuconato Claro. – 2019.

112 f.

Dissertação (mestrado) -Fundação Getulio Vargas, Escola de Matemática Aplicada. Orientador: Alexandre Rademaker. Inclui bibliografia.

1. Linguística - Processamento de dados. 2. Teoria dos tipos. 3. Processamento da linguagem natural (Computação). . Rademaker, Alexandre. II. Fundação Getulio Vargas. Escola de Matemática Aplicada. IV. Título.

CDD – 006.35

Elaborada por Maria do Socorro Almeida – CRB-7/4254

Acknowledgements

I thank my significant other for the love, patience, and help. know well how much you helped me through this. I thank my family for the love and support. The choices you made for me in the past allowed me to choose this path now. I thank my advisor Alexandre Rademaker for the many ideas, discussions, and support. I have learned a multitude of things under your guidance, only some of which appear here. I thank professor Flávio Coelho introducing me to the Unix tradition. hasn’t been a day where I don’t use what I learned with/through the book you lent me. I thank professor Paulo Carvalho for the help, advice, and teachings – your intro- duction to mathematics still echoes in everything I’ve done since then. I thank Cirlei de Oliveira, Elisângela Santana, Conceição Lima, Cristiane Guimarães, and Mônica Souza for the help, support, and conversations. I thank my friends and colleagues for the companionship and the conversations (which more than once led to interesting ideas, even if were just talking nonsense). Specially I’d like to mention Laura Sant’Anna, Kátia Nishiyama, Henrique Muniz, Guil- herme Passos, Pedro Delfino, Harllos Arthur, Alessandra Cid, Alexandre Tessarollo, João Carabetta, Fernanda Scovino. Finally, I’d like to say I’d have given up on this dissertation’s idea if not for the extreme patience and kindness of a former stranger which is now a dear friend – thank you, Inari Listenmaa.

“Look at the raw PGF”

Abstract

In this work we present a freely-available type-theoretical computational grammar for Portuguese, implemented in the Grammatical Framework (GF) multilingual formalism. Such a grammar can be used for both syntactic parsing and natural language generation. We first describe the formalism itself, discussing its logico-mathematical foundations; we then present our grammar. We evaluate our grammar’s productions with respect to syntactical correctness, show possible applications, and discuss future work.

Keywords: Type theory. Natural Language Processing. Computational linguistics. Natural Language Generation. Grammar engineering.

Resumo

Este trabalho descreve a criação de uma gramática computacional do Português imple- mentada no formalismo Grammatical Framework. Nele apresentamos o formalismo e a nossa gramática. Avaliamos nossa gramática com respeito à corretude sintática de suas produções, demonstramos possíveis aplicações, e discutimos trabalhos futuros.

Palavras-chave: Teoria de tipos. Processamento de Linguagem Natural. Linguística computacional. Geração de Linguagem Natural. Engenharia de gramáticas.

List of Figures

Figure 1 – Constituent analysis of a ...... 24 Figure 2 – A functor implementation of Haskell and Lisp list ...... 39 Figure 3 – The PGF grammar for Haskell lists...... 46 Figure 4 – A context-free aproximation for Haskell lists without an intermediary category for empty lists ...... 46 Figure 5 – Parse for the string [x , x] using CFG approximation with on-the-fly specialization...... 47 Figure 6 – Parsing deduction rules...... 50 Figure 7 – Parse deduction for the string [x , x] ...... 52 Figure 8 – Failed parse deduction for the string [x , x]: unexpected token. . . . . 53 Figure 9 – Failed parse deduction for the string [x , x]: unexpected token. . . . . 53 Figure 10 – German Prime ...... 56 Figure 11 – RGL module structure (condensed)...... 59 Figure 12 – GF RGL category system ...... 60 Figure 13 – Module structure of the Portuguese mini resource grammar ...... 63 Figure 14 – Abstract trees for John is from the city; the hole in the diamond- shaped node can be filled by both UseComp and UseComp_estar. . . . . 79 Figure 15 – Abstract syntax tree for what did is important...... 80 Figure 16 – Two analyses for the sentence John is not a doctor ...... 80 Figure 17 – Abstract syntax tree for John saw no animals...... 83 Figure 18 – Screenshot of the DG demo app...... 92 Figure 19 – Different kinds of trees...... 97 Figure 20 – Converting decorated parse tree to a dependency tree...... 99 Figure 21 – Dependency trees converted from GF trees ...... 100

List of Tables

Table 1 – Inflection table for the gramática ...... 69 Table 2 – RGL and Romance tenses, and how inflect in a Portuguese sentence ...... 85

Listings

Listing 1.1 – Example GF code listing...... 27 Listing 2.1 – Abstract grammar Foods...... 30 Listing 2.2 – Concrete for Foods...... 30 Listing 2.3 – Abstract syntax for linked lists in GF...... 32 Listing 2.4 – GF shell session with abstract module only...... 33 Listing 2.5 – Lisp list linearization types...... 33 Listing 2.6 – Lisp list linearization rules...... 34 Listing 2.7 – GF shell session with single concrete module...... 34 Listing 2.8 – Haskell List linearization types...... 35 Listing 2.9 – Haskell List linearization rules...... 35 Listing 2.10–Cons linearization rule using GF tables...... 36 Listing 2.11–Translation between List concrete syntaxes...... 36 Listing 2.12–Cons using a consWith oper ...... 37 Listing 2.13–List syntax interface module...... 38 Listing 2.15–Haskell list syntax instantiation...... 39 Listing 2.16–Haskell list functor instantiation...... 39 Listing 2.17–Lisp list syntax instantiation...... 39 Listing 2.18–Lisp list functor instantiation...... 39 Listing 2.14–List functor module...... 40 Listing 3.1 – A hand-written rule...... 56 Listing 3.2 – Using the RGL constructors directly...... 56 Listing 3.3 – Using the RGL API...... 56 Listing 3.4 – mkCl overloaded function in the RGL API...... 57 Listing 3.5 – Prime predicate in Portuguese and English...... 57 Listing 3.6 – prime_A in German, Portuguese, and English...... 57 Listing 3.7 – Portuguese constructors...... 58 Listing 3.8 – Concrete representation of and how they are built in the Portuguese mini-resource...... 64 Listing 3.9 – Portuguese parameter definitions in the Portuguese mini-resource. . . . 64 Listing 3.10–Concrete representation of and noun in the Portuguese mini-resource...... 65 Listing 3.11–Linearization of the UsePron constructor in the Portuguese mini-resource. 65 Listing 3.12–GF parameter representing forms in the Portuguese mini-resource. 66 Listing 3.13–GF verbal concrete representations in the Portuguese mini-resource. . . 66 Listing 3.14–GF verbal complementation in the Portuguese mini-resource...... 67 Listing 3.15–GF building in the Portuguese mini-resource...... 68 Listing 3.16–The Portuguese lexical constructor for noun gramática ...... 69 Listing 3.17–Naive smart paradigm for verbs...... 70 Listing 3.18–Portuguese verb smart paradigm ...... 71 Listing 3.19–The GenRP constructor from Extend ...... 73 Listing 3.20–The AdjAsCN constructor from Extend ...... 74 Listing 3.21–The AdvImp constructor from ParseExtend ...... 75 Listing 3.22–A function mapping temporal order and anteriority to verb forms . . . 86 Listing 4.1 – Dependency configurations for a few RGL functions...... 96 Listing 4.2 – Raw output of the GF to UD conversion for the sentence “há uma vaca na floresta” ...... 101 Listing 4.3 – Partially corrected output of the GF to UD conversion for the sentence “há uma vaca na floresta” ...... 101 Contents

1 INTRODUCTION...... 23 1.1 Motivation ...... 23 1.2 Scope and Contributions ...... 25 1.3 Structure ...... 26 1.4 Typesetting Conventions ...... 27

2 GRAMMATICAL FRAMEWORK ...... 29 2.1 A GF tutorial ...... 30 2.1.1 A simple GF grammar ...... 31 2.1.2 Refactoring ...... 36 2.2 GF, mathematically ...... 38 2.2.1 Definitions ...... 40 2.2.2 From GF to PGF ...... 44 2.2.3 Parsing ...... 45 2.2.3.1 The example grammar ...... 45 2.2.3.2 The algorithm ...... 46 2.2.3.3 Parsing as deduction ...... 47 2.2.3.3.1 Production items ...... 48 2.2.3.3.2 Active items ...... 48 2.2.3.3.3 Passive items ...... 49 2.2.3.3.4 Initial predict ...... 49 2.2.3.3.5 Predict ...... 49 2.2.3.3.6 Scan ...... 50 2.2.3.3.7 Complete ...... 50 2.2.3.3.8 Combine ...... 50 2.2.3.4 An example parse ...... 51 2.2.4 Linearization ...... 53

3 A COMPUTATIONAL GRAMMAR FOR PORTUGUESE ...... 55 3.1 The GF resource grammar library ...... 55 3.1.1 Motivation ...... 55 3.1.2 Usage ...... 56 3.1.3 Structure ...... 58 3.2 The Portuguese resource grammar ...... 61 3.2.1 A Portuguese miniresource grammar ...... 62 3.2.2 ...... 68 3.2.3 Extra modules ...... 72 3.2.3.1 Morphological modules ...... 72 3.2.3.2 Grammar extensions ...... 73 3.3 Evaluation ...... 75 3.3.1 UD examples corpus ...... 75 3.3.1.0.1 Copular verbs and ...... 76 3.3.1.0.2 type in of copula ...... 77 3.3.1.0.3 Incorrect trees ...... 78 3.3.1.0.4 Romance clause ...... 79 3.3.1.0.5 Romance pronouns ...... 81 3.3.1.0.6 The no quantifier ...... 82 3.3.2 Matrix MRS test suite ...... 82 3.3.2.1 Discussion ...... 84 3.3.2.1.1 Mismatched tenses ...... 84 3.3.2.1.2 Whose as ...... 86 3.3.2.1.3 nouns ...... 87 3.3.2.1.4 placement ...... 87 3.3.2.1.5 Incorrect tense choices ...... 88 3.3.2.1.6 Tag ...... 88 3.3.2.1.7 Date and time units ...... 88 3.3.2.1.8 Incomplete sentences ...... 89 3.3.2.1.9 Iberic negative imperatives ...... 89

4 APPLICATIONS...... 91 4.1 Health-domain application grammar ...... 91 4.2 Attempto Controlled English and GF ...... 91 4.2.1 ACE ...... 92 4.2.2 ACE-in-GF ...... 93 4.2.3 Implementation and example usage ...... 94 4.3 GF to UD ...... 95 4.3.1 Trees ...... 95 4.3.1.0.1 Abstract syntax trees ...... 95 4.3.1.0.2 Parse trees ...... 96 4.3.1.0.3 Dependency trees ...... 96 4.3.1.0.4 Abstract dependency trees ...... 96 4.3.2 The algorithm ...... 96 4.3.2.0.1 Example ...... 98 4.3.2.0.2 Corpus ...... 98 4.3.3 Discussion ...... 100 5 CONCLUSION...... 103 5.1 Future Work ...... 103 5.1.1 Known issues with the Portuguese RG ...... 103 5.1.1.0.1 Preposition + contractions ...... 103 5.1.1.0.2 Enclitic and mesoclitic pronoun contractions ...... 104 5.1.1.0.3 Preposition + pronouns contractions ...... 104 5.1.1.0.4 Reflexive two-place adjectives ...... 104 5.1.2 Possible applications ...... 105 5.1.2.0.1 WordNet gloss corpus ...... 105 5.1.2.0.2 Data as text ...... 106

BIBLIOGRAPHY ...... 107

23 1 Introduction

This thesis presents a computational grammar of Portuguese developed under the Grammatical Framework (GF) formalism. GF is a programming language designed for grammar writing, whose aim is to diminish the cost of creating : it features a robust type-system to catch errors during compile-time (versus having these errors showing up at run-time), an elegant functional programming approach to describing grammars in a generative way [30], a standard library to share and encapsulate common aspects of all natural language grammars [32], and a common, shared ecosystem including the GF compiler, two runtime systems, and a testing tool able to generate minimal and representative test cases for grammatical constructions of choice [23]. Natural language processing tools such as grammars are often faced with a coverage versus correctness tradeoff: tools like Google Translate will produce output for any given input, but their results are often bad (although improving by the day); tools like GF grammars, on the other hand, will not be able to produce output for any input, but when they do they are often correct, or at least easily fixed in most cases (we give several examples of errors that were corrected in section 3.3). Because of GF’s lack of coverage, its main goals are that of domain-specific natural language processing – where the and linguistic phenomena are more restricted – and that of natural language generation (NLG). Its ideal application in this respect is that of generating grammatically-correct natural language from data such as ontologies or database entries. Our GF grammar has the same characteristics of GF as a whole: it is unsuitable for general purpose natural language processing, and thus aims to

1. generate grammatically-correct Portuguese;

2. provide an application-programmer interface capable of supporting domain-specific grammars and applications.

1.1 Motivation

A grammar is a set of rules that govern a language. It tells us how to combine and compose sentences from its constituents. A computational grammar is an encoding of such rules in a way that allows a computer to analyse sentences to its constituents (an analysis which is often represented as a tree like the in figure1), or to generate sentences according to these rules (gloss1 shows a randomly generated sentence from our ). The act of analysing a sentence string to produce analysis 24 Chapter 1. Introduction

Figure 1 – Constituent analysis of a sentence trees is called parsing, while transforming an analysis tree into a sentence string is called linearization (or generation). Note that parsing may successfully analyse the sentence string in more than one way or not at all, while the linearization of a well-formed tree will produce at least one sentence tree.

(1) deixe a maior parte de distância ser quebrada, poucos taças de ninguém, distâncias, distâncias a garrafa de muitos copos de ninguém e copos de todo a maior parte de algo falado diretos que não teria sido doente se tais que teria tido fome, que não teriam sido assustados então que serão prontos fáceis a si tais que teria havido francês e se que não tinham estado errados ou que não teriam estado certos, tais que tudo não se chamaria nada, que não serão prontos então que não teriam sido, ou tais que alguém não se teria chamado todos ou tais que ninguém não terá sido ou ou que terão sido prontos, que tinham estado certos ou que estão certos, tais que que todos não terão sido casados com nada porque todos teriam tido sede se tinha tornado mais velho que este esquerda de ninguém e tanto que têm sede eles mesmos e tais que não estava certo quanto tais que teria feito tanto casado

Grammars are useful because the text analyses they produce – be them syntactic or semantic – are a possible way of understanding texts automatically. Of course, grammars are not the only way of producing such analyses (statistical methods of doing so are 1.2. Scope and Contributions 25

actually more popular), nor are they the only way of achieving this ultimate goal of artificial intelligence. Grammars are useful for their introspectability (it is relatively simple to see where mistakes stem from and how to fix them, compared to statistical methods) and for their portability (they encode a lot of information in relatively little space). Grammars do have their problems in incompleteness (they can not parse any given text), cost (writing a grammar is a lot of work) and performance (parsing text can be an expensive operation time- and space-wise). However, as Ranta puts it [33, section 1.3], even though grammars (as abstractions of natural language) leak, all we need to justify them is that they are useful. And their usefulness comes from a very common-sense principle (the title of [39]): don’t guess if you know. This means that there is no need for us to try to determine statistically (i. e., guess) the behaviour of regular, well-known patterns of language (like -verb , lemmatization, etc). Let us employ these methods to phenomena which are highly irregular (or which are simply out of our grammar’s scope) instead.

1.2 Scope and Contributions

In this work we present a GF Portuguese grammar in the GF formalism whose linearizations are tested against two GF treebanks. The intended minimal coverage of this grammar is that of all GF Resource Grammar Library constructors (see section 3.1 for details), only including its extensions where they might be needed to generate our test treebanks of choice correctly. A list of the linguistics constructors covered by the RGL can be found at its synopsis1. Our foremost contribution is the freely-available GF Portuguese grammar (a preview of which appeared in [7]). While unsuitable for general domain parsing, this grammar was able to generate grammatical sentences from the syntactic trees in our test treebanks in more than 90% of the cases (see section 3.3 for details). In the process of developing it, we have made contributions to the Romance grammar it is based on, which in turn meant contributions for languages such as Spanish and Catalan. We have also made non-academic contributions to the GF community, providing documentation patches and maintenance work (we set up continuous integration testing for the GF repositories and provided documentation fixes, for example). A complete Portuguese grammar involves not only syntactic coverage but also lexical coverage. To be able to provide a GF dictionary we needed a Portuguese lexicon. Our work in developing MorphoBr [8], a full-form lexicon of Portuguese open class was inspired by this need. Because of our work [6,12] involving the Portuguese WordNet, we were naturally inclined to contribute to the gf-wordnet project. We have produced a Portuguese in-

1 http://www.grammaticalframework.org/lib/doc/synopsis/index.html 26 Chapter 1. Introduction stantiation of this project, based on our GF Portuguese grammar and on the Portuguese WordNet [11]. Although the project (and its Portuguese version) is still ongoing, we were also able to contribute corrections to the Portuguese WordNet. Finally, we have also made three existing GF applications support the by virtue of our Portuguese grammar. Although most of them are more prototypes than industrial-grade applications, they should give an idea of what GF is capable of. One of these applications is the GF to Universal Dependencies conversion developed by Kolachina & Ranta [20]. This conversion project has created a GF treebank which is used to test both the project itself and the GF grammar library. On top of adding support for Portuguese in this project, we have also revised its treebank, contributing tree corrections and removing duplicate trees.

1.3 Structure

In chapter2 we aim to acquaint the reader with GF. We first present the ideas behind GF; then we provide an example-based tutorial to familiarize the reader with GF syntax and semantics. We then describe what GF functors are, and how they can be used to generalize grammatical notions, promoting sharing between different concrete implementations. We conclude with a more in-depth view of GF, where we follow Angelov [1] in defining GF mathematically, including how its parsing and linearization work formally. Understanding this last part is not required to understand the remainder of this thesis. Chapter3 is the backbone of this work. It presents the GF Resource Grammar Library (RGL), discussing the motivation behind its existence, its uses, and its structure. We then describe our Portuguese implementation of a resource grammar, which is based on the RGL Romance functor. We fully describe a miniature version of a resource grammar for Portuguese, and then discuss the main points of the full implementation, including mor- phology and extension modules. Finally, we evaluate our work under syntactically-correct language generation respect by linearizing two treebanks from well-known computational linguistics projects, analysing errors and discussing potential solutions. Chapter4 presents three pre-existing GF applications which have been extended to use our Portuguese grammar implementation. The first application is a demonstration of a health-domain translation chat app developed by a company run by the GF core development team. The second application is a mapping from the Attempto Controlled English controlled natural language to GF, which allows one to use GF and the RGL to both parse and linearize ACE-like text in languages other than English itself. The last application is a converter of GF abstract syntax trees to Universal Dependencies (UD) trees [20], which allows GF to be used as rule-based dependency parser, or as a bootstrapper of UD treebanks, among other uses. Our focus is in showing how simple it 1.4. Typesetting Conventions 27 is to add a new language to a GF multilingual app, given a resource grammar for the language is available. Finally, in chapter5 we summarize our work and discuss future projects.

1.4 Typesetting Conventions

Throughout this thesis we discuss linguistic examples. Single examples are usually typeset inline in italic font, as are linguistic terms. More elaborate examples are typeset as two or three pieces of text like in gloss2. If there are just two text fragments, the first is an English version, and the second a Portuguese one – whether it is produced from our grammar or is simply an example should be clear from the context. When there are three fragments, the first is the English version, and the following two fragments are Portuguese. It should be clear from the context whether the first Portuguese fragment is the final grammar output – in which case the second fragment is either the ideal output or another possible version – or whether the first Portuguese fragment was the original grammar output before a correction was made, which then produced the second Portuguese fragment.

(2) a. I always thought that there was something fundamentally wrong with the universe

b. eu sempre pensei que havia algo de fundamentalmente errado com o universo

When an example is ungrammatical, we prefix it with an asterisk (*); when the example is grammatical but is the ideal one (where the sense of ‘ideal’ is given by the surrounding context), we prefix it by an exclamation mark (!). Finally, when an example is semantically incorrect but otherwise grammatical, we prefix it by a hash sign (#). We also typeset inline GF elements differently – including judgement names, modules, constructors, categories, and functions. GF code is shown as in listing 1.1.

−− this isa comment fun Hello x = {s = "Hello" ++ x.s} ; Listing 1.1 – Example GF code listing

29 2 Grammatical Framework

We begin this chapter with an overview of GF. We proceed to a short tutorial on GF, whose purpose is to make the reader familiar with GF syntax and semantics. We conclude the chapter by giving a mathematical definition of an important subset of GF, and showing formally how parsing and linearization are done in it. Grammatical Framework (GF) is a domain-specific programming language for grammar writing. It is a functional programming language, with syntax inspired by the Haskell programming language [25]; it draws from intuitionistic type theory [26] for its type system. A GF program is called a grammar, and it defines parsing, generation and translation from the same declarative source. GF’s forte lies at multilingual processing. It applies to natural languages the distinction made for programming languages: that of an abstract and a concrete syntax. Separating them allows GF to specify a single abstract grammar for several concrete languages. Translation between two natural languages, for instance, becomes parsing of concrete syntax to its abstract representation, and then further linearization to the target language.

Foods> parse -lang=Eng "this fish is fresh" Pred (This Fish) Fresh Foods> parse -lang=Por "este peixe é fresco" Pred (This Fish) Fresh Foods> linearize -all Pred (This Fish) Fresh this fish is fresh este peixe é fresco

The idea is that both sentences above carry the same (abstract) information, which is represented in each language by the respective string. A GF grammar demands a description of its abstract syntax and how this syntax translates into the concrete syntaxes, i.e., the strings. The abstract syntax encompasses both trees (such as [Pred (This Fish) Fresh]) and their construction rules (see listing 2.1 for an example of an abstract GF grammar). The concrete syntaxes, on the other hand, must specify rules for the translation of trees into strings of the desired language. Although natural languages are the main focus, GF can and has been used to generate any kinds of strings, e.g. LATEX code. Listing 2.2 shows an English concrete grammar for the abstract grammar of listing 2.1; both are a 30 Chapter 2. Grammatical Framework version of the Foods grammar appearing in GF documentation and literature, for example in [33, chapter 3]. abstract Foods = { cat Comment ; Item ; Kind ; Quality ; fun Pred : Item → Quality → Comment ; This : Kind → Item ; Fish : Kind ; Fresh : Quality ; }; Listing 2.1 – Abstract grammar Foods.

concrete FoodsEng of Foods = { lincat Comment, Quality = {s : Str} ; Kind = {s : Number ⇒Str} ; Item = {s : Str ; n : Number} ; lin Pred item quality = {s = item.s ++ copula ! item.n ++ quality .s} ; This = det Sg "this" ; Fish = noun "fish" " fish " ; Fresh = adj "fresh" ; param Number = Sg | Pl ; oper det : Number →Str → {s : Number ⇒Str} → {s : Str ; n : Number} = λn,det,noun → {s = det ++ noun.s ! n ; n = n} ; noun : Str → Str → {s : Number ⇒Str} = λman,men →{s = table {Sg ⇒man ; Pl ⇒ men}} ; regNoun : Str → {s : Number ⇒Str} = λcar → noun car (car + "s"); adj : Str → {s : Str} = λcold → {s = cold} ; copula : Number ⇒Str = table {Sg ⇒ " is " ; Pl ⇒ "are"}; }; Listing 2.2 – Concrete English grammar for Foods.

2.1 A GF tutorial

In this short tutorial we give an overview of GF as a programming language. We refrained from using an example in the natural language domain, because any such 2.1. A GF tutorial 31 example would need to be much simplified for our current purpose of presenting GF syntax and semantics. Therefore we opt for a complete example which has the added benefit of being familiar to most programmers. The linguistics-inclined reader will have to wait until chapter3 for linguistically-motivated GF code. By the end of this section we expect the reader to be able to read and understand GF with reasonable fluence, even if writing it is another matter. Despite this stated goal, we still offer information which might be useful to the reader intends to learn GF, e. g., how to use the GF shell. To this kind of reader we recommend following the examples in your computer, with GF installed. The definitive resource to learn GF is the GF book [33]; the freely-available online reference1, from where we take many of the definitions in this section, is also useful to look up syntactic or semantic details; finally, Inari Listenmaa’s blog post2 is a great resource to avoid common GF pitfalls.

2.1.1 A simple GF grammar

In this section we present a grammar for linked lists, capable of translating between two different syntaxes for them. The Lisp syntax uses () for the empty list and represents a non-empty list as a series of elements separated by spaces, delimited by parentheses on both ends, like so: (0 1 1 2 3 5). The Haskell syntax uses [] for the empty list, while non-empty lists are elements separated by (with optional spaces), delimited by square brackets on both ends, as in [0,1,1,2,3,5] . Linked lists are a data structure which is defined recursively. A list is either:

(1) the empty list (also called Nil); or

(2) a list with an element added at the front.

Because adding an element to the front of a list is how you construct all lists except for the empty one, this process is called Cons. Given this definition, Cons X (Cons X (Cons X Nil)) is an abstract tree representing the three-element list which in Lisp syntax is (x x x). A GF grammar is composed of an abstract syntax and any number of concrete syntaxes. While the abstract syntax declares categories and how they can be combined as in a logical framework, the concrete syntaxes define the linearization of the created abstract syntax trees to strings.

1 http://www.grammaticalframework.org/doc/gf-refman.html 2 https://inariksit.github.io/gf/2018/08/28/gf-gotchas.html 32 Chapter 2. Grammatical Framework

Each GF module is composed by a header3 and a body, which is simply a set of judgements. The GF compiler demands that the programmer specify what form of judgement she is making by using a judgement keyword, so, e.g., we precede category declarations with cat. We may specify a judgement keyword for each judgement, or (more conveniently) we may omit subsequent judgement keywords of the same form. Each judgement is separated by a semicolon (;). Listing 2.3 shows an abstract syntax for linked lists in GF syntax. Note how it follows the recursive definition of lists. We use the cat judgement to declare categories; it has the form cat C.4 In this example, we define the List category that we are interested in, and a dummy element category Elem. We also add the S (for sentence) category, which is special in GF in that it is the default start category for a grammar. The start category is the category we start parsing from. The fun judgment declares how we can construct a member of a category from members of its categories; it has the form fun f : T, where T is a type built from basic types (i. e., declared categories and built-in categories like Str) using the type constructor →. We declare two -argument constructors (which are constants) and the Cons constructor which is the same as in the recursive definition of lists. To be able to parse lists from the start category S, we add the LS constructor, which lifts lists to the sentence category. abstract List = {

cat Elem ; List ; S ;

fun X : Elem ; Nil : List ; Cons : Elem → List → List ; LS : List → S;

}; Listing 2.3 – Abstract syntax for linked lists in GF.

We can already import the file List.gf5 in the GF shell, and play with the generate_trees and generate_random commands. Note that all GF shell commands have shorter aliases. To know more about GF shell commands, call help for a summary of all available commands, or help [command], for help on the command [command].

3 Module headers specify what kind of module it is, imports, and other things; See the GF reference at [33] or http://www.grammaticalframework.org/doc/gf-refman.html for more about headers. 4 This form is actually a simplification; for the dependently-typed subset of GF we use more elaborate category declarations. 5 Note that GF files must have the same name as the module they contain. 2.1. A GF tutorial 33

>i List . gf List > gt −number=3 Cons X (Cons X (Cons X (Cons X Nil))) Cons X (Cons X (Cons X Nil)) Cons X (Cons X Nil) Listing 2.4 – GF shell session with abstract module only.

In order to implement a concrete syntax for this abstract syntax, we must ask ourselves how we want to linearize each constructor defined in the abstract syntax. If we take Lisp lists as an example, we want to have

Nil <=> () Cons X (Cons X (Cons X Nil)) <=> (x x x)

In order to achieve this behaviour, we first have to define the linearization types of the categories declared in the abstract module, using the lincat judgement keyword, as in listing 2.5. concrete ListLisp of List = {

lincat Elem, S = {s : Str} ; List = {bp,b : Str} ; Listing 2.5 – Lisp list linearization types.

lincat judgements are composed of category names, an equal sign (=), and a valid linearization type. A linearization type can be either a basic type, a parameter type (user- or library-defined), a table and/or records of those, but not a function type (we will explain these shortly). We decide on the Elem and S types being a record type with one field of Str type, and on the List type being a record with two string fields, bp to store the beginning parenthesis, and b to store the body of the list (see listing 2.5).6 If more than one category share a linearization type, we can define them in a single judgement by separating the category identifiers by commas.7 We may also omit the linearization type of a category, which gives it the default {s : Str} linearization type.

Listing 2.6 contains the actual linearizations. The dummy X element is linearized to "x", Nil is linearized to a record containing the strings for (), and Cons updates

6 We could make Elem and S be simple Str types, however it is a GF convention to use record types for their extensibility – it is much easier to add a field to an existing record than to change something into a record and then add a field to it. 7 Something analogous works in abstract modules for the declaration of several constructors of the same type. 34 Chapter 2. Grammatical Framework

(∗∗) its second argument (the list argument) with a new body field consisting of the concatenation (++) of the string field of the element argument (x) and of the body field of list argument (xs). The extension record operator ∗∗ is used to both extend and update records; it essentially performs the union of the records’ fields. When the two records’ field labels are not disjoint, the second record’s fields get precedence, essentially updating the values of the first record’s fields. Our use of ∗∗ could have been written equivalently as Cons x xs = {bp = xs.bp ; b = x.s ++ xs.b}. We finally define LS to simply be the concatenation of the beginning parenthesis with the body of the input list.

lin X = {s = "x"}; Nil = {bp = "("; b = ")" }; Cons x xs = xs ∗∗ { b = x.s ++ xs.b }; LS xs = {s = xs.bp ++ xs.b} ; Listing 2.6 – Lisp list linearization rules.

>i ListLisp . gf List > parse −lang=Lisp "( x x x x )" LS (Cons X (Cons X (Cons X (Cons X Nil)))) Listing 2.7 – GF shell session with single concrete module.

Now let’s define another concrete syntax, this time for Haskell lists. We want to have:

Nil <=> [] Cons X (Cons X (Cons X Nil)) <=> [x,x,x]

There is an issue here with the commas. We only want to insert a when there is an existing element in the input list (i. e., the input list is not the empty list). If we try to define Cons as something similar to the Lisp definition in listing 2.6, but with a comma in-between the concatenation of the elements (as in x.s ++ "," ++ xs.b), we will get the wrong linearization when the input list is empty. To solve this problem, we give lists a more complex linearization type, as can be seen in listing 2.8.8 For this, we introduce the param judgement, which is used to declare a new parameter type, together with its constructors. A parameter type is either introduced by a param judgement or is a 8 We could have solved this problem by creating a separate category of empty lists, but this has two downsides: it makes the Lisp concrete more complicated than it has to be, and it does not follow precisely the recursive definition of linked lists. An additional downside is that it prevents the introduction of useful GF concepts to the reader. 2.1. A GF tutorial 35 record type whose fields are all parameters types. Parameter types must be finite (we may not have a parameter constructor take an infinite type such as a string as an argument) and thus non-recursive.

param Boolean = T | F ;

lincat Elem, S = {s: Str} ; List = {s : Str ; null : Boolean} ; Listing 2.8 – Haskell List linearization types.

We define a boolean parameter whose values might be T and F (for true and false), and make it the type of a field in the linearization type of lists. This will allow us to give different linearizations to different values of the null field. The reader may have noticed that the bp field of Lisp lists was always the same; to remove this redundancy in the Haskell concrete we remove the bp field, and rename the b field to s. We use the LS rule to add the beginning square brackets of the lists. The linearization rules of Haskell linked lists can be seen in listing 2.9.

lin X = {s = "x"}; Nil = { s = "]" ; null = T }; Cons x xs = case xs.null of { T ⇒ {s = x.s ++ xs.s ; null = F} ; _ ⇒ xs ∗∗ {s = x.s ++ "," ++ xs.s} }; LS xs = {s = "[" ++ xs.s} ; Listing 2.9 – Haskell List linearization rules.

Cons’s linearization rule introduces GF case expressions, which allows us to pattern-match on parameter values. GF case expressions are simply syntactic sugar for table selections. GF tables are finite functions of type P ⇒ T , where P is a pa- rameter type and T is any type. Tables can be written with the table keyword like so: table { V1 ⇒ t1 ; ... ; Vn ⇒ tn }, where the V’s are parameters and the t’s are expres- sions of the output type of the table. Tables are finite functions because parameter types are themselves finite, so it is possible to enumerate all input-ouput pairs. You can apply a table to an expression that evaluates to an instance of its input parameter type using the table selection operator !. 36 Chapter 2. Grammatical Framework

Because case is just syntactic sugar for tables, the linearization of Cons in listing 2.9 is equivalent to the one in listing 2.10 (note the use of the record selection operator). The underscore (_) is called a wildcard pattern and matches any value.

Cons x xs = table { True ⇒xs ∗∗ {s = x.s ++ xs.s ; null = F} ; _ ⇒ xs ∗∗ {s = x.s ++ "," ++ xs.s} } ! xs. null Listing 2.10 – Cons linearization rule using GF tables.

In order to linearize Cons, then, we check if its list argument is the empty list by checking its null field; if it is, we update its s field with the concatenation of the element string field and its current s field (which is just the closing square bracket). Its null field is also updated, to the value of F, indicating that the list is not empty anymore. If the list argument is not an empty list, we simply update its body field to be the concatenation of the string field of the element argument, a comma, and its previous body field. The other linearization rules are what one would expect. Now that we have two concrete syntaxes, we can translate between them by parsing one concrete to the abstract syntax and then linearizing the abstract tree to the target concrete syntax:

>i ListLisp . gf ListHask. gf List > p −tr −lang=Lisp "( x x x )" | l −lang=Hask Cons X (Cons X (Cons X Nil)) [ x , x , x ] Listing 2.11 – Translation between List concrete syntaxes.

The -tr option of a GF shell command traces its output (i.e., prints it), while the pipe command (|) uses the output of the previous command as the input to the next one.

2.1.2 Refactoring

If we analyse the two concrete grammars we developed in section 2.1.1, we can see that they share many definitions, like that of the lincat of Elem or that of the LS constructor. As a functional programming language, GF has the means to avoid most code boilerplate and repetition. Two of the most prominent ways of doing so is by using functors and the oper judgement. In this section we refactor our implementation of the List grammar using both of these. Note that the gains of using these GF constructs in such a small grammar are negligible, but that they are very useful in larger grammars. 2.1. A GF tutorial 37

An oper is akin to a function definition in most programming languages. It is of the form oper h : T = t, where T is a type and t is a term of type T. The type can be omitted if the compiler is able to infer it, and type and definition can also be given by two separate oper judgements. Note that in many cases the term definition t of an oper will be an anonymous function (or lambda abstraction) of the form λx → t, which is a one-argument function whose application is computed by substituting its argument for x in t. Listing 2.12 shows the declaration and definition of an oper consWith that can be used to avoid the repetition in the two branches of the case expression in the previous definition of the Cons constructor, as is done in listing 2.14. It is also possible to define an oper using a shorthand like the one used by lin judgements.

oper consWith : Str → Elem → List → List = λsep,x,xs → xs ∗∗ {s = x.s ++ sep ++ xs.s; null = False}; −−−− could alsobe defined as two separate judgements: −− consWith: Str →Elem → List → List; −− consWith sepx xs= xs ∗∗{s=x.s ++ sep ++ xs.s; null= False}; Listing 2.12 – Cons using a consWith oper

In larger GF grammmars it is customary to define parameters and functions in a resource module, which can then be imported by several concrete modules for use. The other element used in our refactoring of the List grammar are functors. GF functor modules are inspired by the parametrized modules found in functional programming languages like ML.9 A GF functor module is any module that opens (imports) an interface module. An interface module is a module that only declares oper judgements, possibly defining them. The header syntax for a functor module is like below: incomplete concrete ListI of List = open ListSyntax in

The incomplete keyword highlights the fact that a functor without its instantiation is not a valid (concrete) grammar. The ListSyntax module is the interface module that is opened by the functor. What allows wider use of functors is the possibility of using abstract syntax modules as interface modules, and concrete syntax modules as their instances, performing the following mapping:

cat C ↔ oper C : Type

9 If you know the functional programming language Haskell, be warned: its functors are a different matter, not concerning modules at all. 38 Chapter 2. Grammatical Framework

fun f : A ↔ oper f : A

lincat C = T ↔ oper C : Type = T

lin f = t ↔ oper f : A = t

To refactor our list grammar into a functor, we try to generalize our two original implementations, trying to abstract what they do not have in common to the interface module. In the case of lists, we know that Haskell and Lisp syntax differ only in what kind of separator they use and what kind of boundary characters they use, so we end up with the ListI functor module in listing 2.14 and the ListSyntax interface module in listing 2.13. In order to build ListI, we took the most general implementation out of the two (that of Haskell), and changed a few strings to parametrized opers, which are declared in ListSyntax. To complete our implementation, we only need to instantiate both modules for each concrete we want, which we do in figure2. Because the type signatures of the opers are already given in the interface module, we do not need to give them in its instantiations.

We can think of the List functor module as a function at the module-level, with a type signature like: ListI : instance of ListSyntax → concrete of List

Functor instantiation then would resemble function application.

interface ListSyntax = { oper elemSep : Str ; leftBound : Str ; rightBound : Str ; }; Listing 2.13 – List syntax interface module.

2.2 GF, mathematically

GF grammars can be compiled to a binary format called portable grammar format (PGF) [2]. This format is ideal for a formal description of GF since it abstracts away syntactic details of GF as a programming language. In this section we present the mathe- matical definition of this format, discuss on a high-level how GF code is translated into it, and show how parsing and linearization is done by a PGF interpreter. The main source for this section is Krasimir Angelov’s PhD thesis [1], which we follow closely in text and notation; we have contributed nothing to his work. 2.2. GF, mathematically 39

instance ListSyntaxHask of ListSyntax = {

oper elemSep = "," ; leftBound = "[" ; rightBound = "]" ;

}; Listing 2.15 – Haskell list syntax instantiation.

concrete ListFHask of List = ListI with (ListSyntax=ListSyntaxHask) ∗∗{

}; Listing 2.16 – Haskell list functor instantiation.

instance ListSyntaxLisp of ListSyntax = {

oper elemSep = "" ; −− no element separator but for spacing leftBound = "(" ; rightBound = ")" ;

}; Listing 2.17 – Lisp list syntax instantiation.

concrete ListFLisp of List = ListI with (ListSyntax=ListSyntaxLisp) ∗∗ {

}; Listing 2.18 – Lisp list functor instantiation.

Figure 2 – A functor implementation of Haskell and Lisp list syntaxes. 40 Chapter 2. Grammatical Framework incomplete concrete ListI of List = open ListSyntax, Prelude in {

lincat Elem, S = {s : Str} ; List = { s : Str ; −− Bool is actuallya pre −defined parameter type, so we don’t −− need to redefine it null : Bool };

lin X = {s = "x"}; Nil = { s = rightBound ; null = True }; Cons x xs = case xs.null of { True ⇒consWith "" x xs ; _ ⇒ consWith elemSep x xs }; −− we can use lambda abstractions for lin definitions too! LS = λxs →{s = leftBound ++ xs.s} ;

oper consWith : Str → Elem → List → List = λsep,x,xs → xs ∗∗ {s = x.s ++ sep ++ xs.s; null = False}; −−−− could alsobe defined as two separate judgements: −− consWith: Str →Elem → List → List; −− consWith sepx xs= xs ∗∗{s=x.s ++ sep ++ xs.s; null= False};

}; Listing 2.14 – List functor module.

We only present a subset of GF in that we don’t discuss dependently-typed GF code.

2.2.1 Definitions

Definition 1. A PGF grammar G is a pair of an abstract grammar A and a finite set of concrete syntaxes C1,..., Cn:

G = hA, {C1,..., Cn}i

Definition 2. An abstract syntax is a triple of a set of abstract categories,10 a set of

10 By ‘category’ we mean GF categories, not category in the Category Theory sense. 2.2. GF, mathematically 41 abstract constructors with their type signatures, and a start category:

A = hN A,F A,Si

• N A is a finite set of abstract categories.

• F A is a finite set of abstract constructors. Every element in the set is of the form f : τ where f is a constructor symbol and τ is its type. The type is either a category A C ∈ N or a constructor type τ1 → τ2 where τ1 and τ2 are also types.

• S ∈ N A is the start category.

Definition 3. A concrete syntax C is a parallel multiple context-free grammar (PMCFG) complemented by a mapping from its categories and constructors to the abstract syntax:

C = hG, ψN , ψF , di

• G is a PMCFG: an extension of a context-free grammar (CFG) where a syntactic category is defined not as a set of symbols but as a set of tuples of strings. We apply constructors over tuples of input categories to obtain a tuple in the result category. A formal definition of PMCFG is given in definition4, but a PMCFG is mainly composed of a set of production rules which define how to construct a given category by applying some constructor. An example using a constructor f and categories A, B, and C is the production rule:

A → f[B,C]

• ψN is a mapping from the concrete categories in G to the set of abstract categories N A.

• ψF is a mapping from the concrete constructors in G to the set of abstract constructors F A. A concrete constructor f C has the same arity of its corresponding abstract constructor: C C a(f ) = a(ψF (f )) where a is a mathematical function which takes a GF constructor and returns its arity. The notation for the definitions of concrete funtions is tailored to simplify the deduction rules we will write later; an example is:

f := (h0; 0ib, h1; 0ih0; 1i) 42 Chapter 2. Grammatical Framework

where f is the constructor name. The notation hd; ri stands for the constituent number r of argument d, so f creates a tuple of two strings where the first one (h0; 0ib) is built from the first constituent of the first argument and by adding the terminal b at the end. The second one (h1; 0ih0; 1i) concatenates the first constituent of the second argument with the second constituent of the first argument of the first argument.

• d assigns a positive integer d(A), called dimension, to every abstract category A ∈ N A. A given category may have different dimensions in the different concrete syntaxes.

Definition 4. A parallel multiple context-free grammar (PMCFG) [36, 37] is a 5-tuple:

G = hN C,T,F C,P,Li

• F C is a finite set of concrete categories. For every A ∈ N C, the equation d(A) =

d(ψN (A)) defines the dimension for every concrete category as equal to the dimension in the current concrete syntax of the corresponding abstract category.

• T is a finite set of terminal symbols.

• F C is a finite set of concrete constructors. For every f ∈ F C, the dimensions r(f)

(the number of constituents in the output of f) and di(f) (the dimension of the i-th argument category) (with 1 ≤ i ≤ a(f)) are given. For every positive integer d, (T ∗)d denotes the set of all d-tuples of strings over T . So if T = {a, b}, then (T ∗)3 = {(a, a, a), (a, a, b), (a, b, a), (b, a, a), (b, b, a), (b, a, b), (a, b, b), (b, b, b)}. Each constructor f ∈ F C is a total mapping from (T ∗)d1(f) × (T ∗)d2(f) × · · · × (T ∗)da(f)(f) to (T ∗)r(f), and is defined as

f := (α1, α2, . . . , αr(f))

Here αi is a sequence of terminals and hk; li pairs, where 0 ≤ k ≤ a(f) is called

argument index and 0 ≤ l ≤ dk(f) is called constituent index. We also use the

notation rhs(f, l) to refer to constituent αl of the constructor f.

• P is a finite set of productions of the form:

A → f[A1,A2,...,Aa(f)]

C C where A ∈ N is called result category, A1,A2,...,Aa(f) ∈ N are called argument categories and f ∈ F C is a constructor symbol. For the production to be well formed,

the conditions di(f) = d(Ai) (with 1 ≤ i ≤ a(f)) and r(f) = d(A) must hold. 2.2. GF, mathematically 43

• A default linearization function L of a category C is a function which produces an of the linearization type of C when applied to a string. L ⊂ N C × F C is a set which defines the default linearization functions for those concrete categories that have default linearizations. If the pair (A, f) is in L then f is a default linearization function for A. We will also use the :

lindef(A) = {f | (A, f) ∈ L}

to denote the set of all default linearization functions for A. For every f ∈ lindef(A)

it must hold that r(f) = d(A), a(f) = 1, and d1(f) = 1.

The abstract syntax defines a grammar of typed lambda terms; similarly, the concrete syntaxes allow us to construct concrete syntax trees:

Definition 5. (f t1 . . . ta(f)) is a concrete tree of category A if ti is a concrete tree of category Bi and there is a production:

A → f[B1 ...Ba(f)]

The abstract notation for to say that t is a tree of category A is t : A. When a(f) = 0, then the tree does not have children and the node is called a leaf.

A concrete syntax tree can be bottom-up linearized to a tuple of strings in the following way: leaves (trees with no arguments) are already tuple of strings; to linearize a tree with one or more arguments, one linearizes the arguments first, and them combines the linearization into a tuple of strings. To define a linearization procedure, we employ a helper constructor K which produces a string from a sequence of tuples of strings (the linearized arguments) and a sequence αi of terminals and hk; li pairs. The output string is produced by the substitution of each hk; li with the string for constituent l from argument k: −→ K σ (β1hk1; l1iβ2hk2; l2i . . . βn) = β1σk1l1 β2σk2l2 . . . βn

∗ −→ where βi ∈ T and σ is the sequence of linearized arguments. With K, we can define L, the actual linearization constructor:

L (f t1t2 . . . ta(f)) = (x1, x2, . . . xr(f)) (2.1) where

xi = K (L(t1), L(t2) ... L(ta(f))) αi and C f := (α1, α2, . . . α(f)) ∈ F 44 Chapter 2. Grammatical Framework

2.2.2 From GF to PGF

In this section we discuss on a high level how GF code is transformed into PGF. As shown in section 2.2.1, the PGF format is very simple – its production rules simply generate tuples of strings from other tuples of strings. So it is natural to ask how the GF compiler can encode the relatively rich language of GF in this simpler system. The first thing to notice here is that although GF enjoys pattern-matching and function definitions (opers), these are compiled away when GF is transformed into canonical GF [24,30], which is a variant of GF without syntactic sugar and where partial evaluation has been applied. To partially evaluate GF code means to inline all functions definitions, and evaluate all expressions as far as possible, until they depend on runtime variables (as constructors do). This is possible because GF functions may not be recursive nor co-recursive. We can also compile away pattern-matching because GF parameter types (on which we pattern match) are finite, and can thus be enumerated. Canonical GF then only has category and constructor declarations and definitions, containing only argument variables, strings, records, tables, and parameters, plus the concatenation, record projection and table selection operators. Thus we only have to worry about converting these to PGF. Records and tables can be straightforwardly flattened to tuples of strings. However it is not so clear how can we encode record fields whose values are parameters, nor how the appropriate elements of tables are selected using parameters. Let us take the Haskell concrete of the List grammar from section 2.1.1 as an example grammar being converted to PGF. The Elem category is straightforwardly represented as a one-element string tuple. For instance, X is just "x"; if we had other Elems, they’d be simple strings too. The way to represent a record where one field is a parameter (like in the case of the List category) is to enumerate all possible values of the parameter (which is always possible since parameters must be finite), and create a new concrete category for each such value. Therefore what was just one category in GF (List) becomes 11 two concrete categories in the PGF representation: Listempty and Listnonempty. Both concrete categories are then represented as one-element tuple of strings; the value of their null fields is encoded by the concrete category they belong to. For example, Nil is then represented as just h")"i. Now that we know that categories can get split in the GF to PGF conversion, it is easier to see how the linearization definition of the Cons constructor in the Haskell concrete of the List grammar (see listing 2.9) is handled. It contains a case expression whose value depends on the two-valued Bool parameter from the null field of its input argument of List. But there is no List category anymore, so we split the Cons constructor into two concrete constructors: one which takes a Listempty and another that takes a Listnonempty. There is no case expression in any of them (and there could not be, nor is there a need to

11 These names are just for mnemonics; they are actually integers in the actual PGF format. 2.2. GF, mathematically 45

be). For example, the Consnonempty concrete constructor takes a non-empty List as input, and produces a non-empty List; we know that non-empty Lists are just a one-string tuple; using the notation from definition3, we have:

Consnonempty := (h0; 0i“,"h1; 0i)

That is, we produce the first constituent (corresponding to the s field) of the one-element tuple by concatenating the first constituent of the first argument (which is an

Elem), a comma, and the first constituent of the second argument (which is a Listnonempty). We can obtain a human-readable PGF from any GF grammar by using the print_grammar command of the GF shell. Its output is very close in notation from the one presented here. In summary, although PGF preserves abstract categories and constructors defined in GF, it replaces linearization categories and rules by concrete categories and constructors. In the general case there is more than one concrete category and constructor to each linearization category and rule (and thus to each abstract category and constructor). We use ψN and ψF from definition3 to preserve the mapping from concrete categories to abstract categories and from concrete constructors to abstract constructor, respectively.

2.2.3 Parsing

In this section we show how GF parsing from a PGF representation works using a running example. We first explain the idea behind the parse algorithm, show how we can represent it as a series of deduction steps, and explain the deductions rules we use. We then present the example grammar, and show a complete parse. More details and proofs of soundness (i. e., it is impossible to parse a string not accepted by the grammar) and completeness (i. e., it is possible to parse every string produced by the grammar) of the parsing algorithm can be found in Krasimir Angelov’s thesis [1].

2.2.3.1 The example grammar

Our running example is still that of Haskell lists from section 2.1.1. Figure3 shows the PGF version of the Haskell list grammar from section 2.1.1.

There are two construction rules for the top-level category S (whose dimension is one), one which takes one argument of category N (for empty lists, formerly Listempty), and another which takes one argument of category L (for non-empty lists, formerly List).Both rules use the same constructor ls, which concatenates the beginning square bracket with its first argument’s first constituent.12 As discussed in section 2.2.2, the Cons becomes 12 The actual PGF has the ls constructor accept a single category of L*, and adds two coercions from empty lists and non-empty lists to L*. 46 Chapter 2. Grammatical Framework

S → ls[N] S → ls[L] L → ce[E,N] L → co[E,L] N → nl[] E → e[]

ls := (“[”h0, 0i) ce := (h0, 0ih1, 0i) co := (h0, 0i“, ”h1, 0i) nl := (“]”) e := (“x”)

Figure 3 – The PGF grammar for Haskell lists.

S → “[”L L → “]” | “x”L | “x”“, ”L

Figure 4 – A context-free aproximation for Haskell lists without an intermediary category for empty lists two concrete constructors: one that takes empty lists (ce), and the other which takes non-empty lists (co); the former simply concatenates the first constituent of the first argument (the element) to the first constituent of the second (the list) argument; the latter includes a comma between these two constituents. The element category has only one nullary constructor e, which is a single tuple; the PGF-only empty list category also has a single nullary constructor nl.

The syntax tree for the string [x, x] is (ls (ce e (co e nl))), while that of the empty list is simply (ls nl).

2.2.3.2 The algorithm

GF’s parsing algorithm is a generalization of a context-free parsing algorithm. Figure4 shows a context-free approximation of the grammar Haskell lists that does not use an intermediate category for the empty list.13 To restrict this grammar so that it only produces syntactically correct Haskell lists, we can create new production rules and categories at every rule application. This

13 It is possible to define the grammar exactly using a CFG by adding a category for empty lists; if we did so, however, we would not be able to show how GF parses context-sensitive grammars. 2.2. GF, mathematically 47

S → “[” L L → “]” | “x” L | “x, ” L 1 S 2 “[” L 3 “[” “x, ”L1 L1 → “x” | “x, ” L2 4 “[” “x, ”“x” L2 L2 → “]” 5 “[” “x, ”“x” “]”

Figure 5 – Parse for the string [x , x] using CFG approximation with on-the-fly special- ization.

on-the-fly specialization of the parser prevents it from accepting strings which are not in our example grammar.

Figure5 shows how parsing the list string [x, x] would work. We begin the parse from the start category S, which has only one branch, so we parse the beginning square bracket, and soon find the L category, which might be formed in three ways; we continue with the only one matching our input in step 3. Now, given that we have parsed an element-with-a-comma, we want to have only two valid continuations: we can either parse the another element-with-a-comma, or we can parse a normal element; this is enforced by

the on-the-fly creation of a new category L1 – a specialization of L that does not have the ending square bracket branch. We continue with parsing the normal element branch (because it matches our input), which means there is no following element – the list must end now or the parse will fail – so we want to have only one valid continuation where we

parse the ending square bracket – this is reflected by the new specialized rule L2. Notice that if we had a malformed input which had another element at this point, the parse would fail; if we were using our pure CFG approximation, the former case would still be parsed. We successfully parse an ending square bracket because it matches our input, and have now fully consumed the input with a successful parse. A parser for this grammar would look like parser for a context-free language, except for the creation of new categories and rules on-the-fly.

2.2.3.3 Parsing as deduction

We present the parsing algorithm as a deduction, following Shieber et al. [38] Each rule application derives a set of items: X ...X 1 n hside conditions on X ...X i Y 1 n

where the premises Xi are items, and the derivation Y is also an item. We take the input

string to be a sequence w1 . . . wn of tokens. We first explain the kind of items derived by the parser. After presenting the formal notation, we show concrete examples to help make the notation clearer. If these examples 48 Chapter 2. Grammatical Framework are not enough, their use in section 2.2.3.4 should be enough to explain their significance. We finally present and explain the parsing deduction rules; their use in section 2.2.3.4 should help clarify their semantics. The parsing deduction system generates active, passive, and production items.

2.2.3.3.1 Production items

In Shieber et al. deduction systems, the grammars are constant. Because in the case of GF parsing the grammars are extended at runtime (as intuited by section 2.2.3.2), the set of productions must be a part of the deduction set, and the productions from the original grammar are considered axioms included in the initial deduction set.

2.2.3.3.2 Active items

Active items represent the parsing state at a given point of the deduction:

k −→ [j A → f[B ]; l : α • β], j ≤ k

This notation encodes a constructor f with the following production: −→ A → f[B ]

f := (γ1, . . . , γl−1, αβ, . . . , γr(f)) such that the tree (f t1 . . . ta(f)) will produce the substring wj+1 . . . wk as prefix in con- stituent l for any sequence of arguments ti : Bi. The sequence α is the part that produced the substring:

K (L(t1), L(t2) ... L(ta(f))) α = wj+1 . . . wk and β is the part that is not processed yet. Take the following example active item from a parsing using the Haskell list grammar at figure3 (page 46):

1 [1L → ce[E,N]; 0 : • h0, 0i h1, 0i]

It denotes the ongoing parse of a non-empty list (category L), starting at index 1 of the input (the lower 1) and currently at index 1 of the input (it has not successfully parsed anything yet). Note that all indices start from zero. This non-empty list is being built from the ce constructor, which takes an element (category E) and an empty list (category N). It is currently parsing the first (thus 0) and only constituent of the result category; its point is just before two argument-constituent pairs corresponding to the first constituent of the first argument and to the first constituent of the second argument. The components after the constituent index correspond exactly to the ones at the ce rule at figure3. 2.2. GF, mathematically 49

2.2.3.3.3 Passive items

Passive items are written as:

k [j A; l; N], j ≤ k

and are proof of at least one production: −→ A → f[B ]

f := (γ1, γ2, . . . , γr(f))

and a tree (f t1 . . . ta(f)): A such that the constituent with index l in the linearization

of the tree is equal to wj+1 . . . wk. Contrary to the active items in the passive the whole constituent is matched:

K(L(t1), L(t2) ... L(ta(f))) γl = wj+1 . . . wk

Every time there is a completion of an active item, a passive item is derived along with

a new category N which accumulates all productions for A that produce the wj+1 . . . wk

substring from constituent l. All trees of category N must produce wj+1 . . . wk in the constituent l. Observe the following passive item from a parse using the grammar at figure3 (page 46): 2 [1E; 0; C0] It witnesses the parsing of the first (and only) constituent of a member of category E. It spans one token of the input, starting at index 1. It has also produced a specialized

category C0 with the production rule C0 → e[], where e is the zero-argument constructor from figure3. Its corresponding tree e : E is such that the constituent with index 0 in the linearization of this tree (“x”) is equal to token of the index 1 of the input. The deduction rules of the parsing system are shown in figure6.

2.2.3.3.4 Initial predict

Derive an item spanning the 0–0 range for each production whose result category is mapped to the start category in the abstract syntax.

2.2.3.3.5 Predict

Given an active item with a dot before an argument-constituent pair hd; ri and a matching production rule, derive an active item where the dot is in the beginning of the constituent r of the constructor function. 50 Chapter 2. Grammatical Framework

Initial Predict −→ A → f[B ] ψN (A) = S, S the start category in A, α = rhs(f, 1) 0 −→ [0A → [B ]; 1 : •α]

Predict

−→ k −→ Bd → g[C ][j A → [B ]; l : α • hd; riβ] γ = rhs(g, r) k −→ [kBd → g[C ]; r : •γ]

Scan

k −→ [j A → f[B ]; l : α • sβ] s = wk+1 k+1 −→ [j A → f[B ]; l : αs • β]

Complete

k −→ [j A → f[B ]; l : α•] N = (A, l, j, k) −→ k N → f[B ][j A; l; N]

Combine

u −→ k [j A → f[B ]; l : α • hd; riβ][uBd; r; N] k −→ [j A → f[B {d := N}]; l : αhd; ri • β]

Figure 6 – Parsing deduction rules

2.2.3.3.6 Scan

Given an active item with a dot before a terminal s and that s matches the current element wk of the input string, derive a new active item where the dot is moved to the next position.

2.2.3.3.7 Complete

Derive a passive item from an active item whose dot is at the end. The resulting category N in the passive item is a fresh category. Also derive a new production for N which has the same constructor and arguments as the category in the active item.

2.2.3.3.8 Combine

Derive an active item from matching active and passive items. For the purposes of Combine, an active item matches a passive one if its dot is before the parsing of the r-th 2.2. GF, mathematically 51 constituent of the d-th argument of its constructor function, and the passive item derives just such a constituent, with matching indices.

The item in the premise of Complete was at some point predicted by a Predict of some other item. The Combine rule then replaces the occurence of the original category A (in the premise of Predict) with its specialization N. A parsing is sucessful if we have derived a passive item in the start category of the grammar which spans the whole length of the input text.

2.2.3.4 An example parse

In this section we show the derivation of the parse of the string [x , x] using the Haskell list PGF grammar in figure3. To save space, figure7 shows the derivation omitting steps that lead to an unsuccessful parse; figure8 and figure9 show two branches of the complete derivation that led to an unsuccessful parse, but other such branches exist. The GF parser has no way of knowing which branches will lead to successful parses without trying them, of course, so it will explore all branches – unless told to stop after a certain number of successful parses. Instead of the usual deduction tree, we present the steps as a table where the first column is an integer identifier for the items derived in that step, the second column shows the derived items themselves, and the last column is the name of the rule applied in that step, along with the identifiers of the items the rule was applied to. That is, step 6 from figure7 shows the item derived from the application of the Predict rule to the items derived in steps 2 and 5. Note that the integer identifiers have no meaning and follow no particular order. The first few lines of the derivation are the productions from the grammar in figure3. Steps 0 and 1 tell us that the start category may be formed using the ls constructor taking either an empty list or a non-empty list. Figure8 shows that following the empty list branch leads us to a successful parse of the opening square bracket token, but then to a dead-end where we have the closing square bracket at point but it does not match the next token in the input stream. Because this branch leads us to a parse failure, we continue at step 9 by applying the Initial Predict rule, which yields us an item with point (•) before two components. The first component (the opening square bracket) matches the next token of our input string, so we apply the Scan rule and advance point by one token. We are trying to parse a sentence built from a non-empty list; recall the Predict rule from figure6: we may predict only the category matching the argument index of the argument-constituent pair in front of the point; this category is that of non-empty lists and we have two production rules for those. The one at step 2 would lead us to try to parse [x], because it would force the parse of an empty list following the first element. So 52 Chapter 2. Grammatical Framework

0 S → ls[N] 1 S → ls[L] 2 L → ce[E,N] 3 L → co[E,L] 4 N → nl[] 5 E → e[] 0 9 [0S → ls[L]; 0 : • “[” h0, 0i] Initial Predict 1 1 10 [0S → ls[L]; 0 : “[” • h0, 0i] Scan 9 1 17 [1L → co[E,L]; 0 : • h0, 0i “, ” h1, 0i] Predict 3 10 1 18 [1E → e[]; 0 : • “x”] Predict 5 17 2 19 [1E → e[]; 0 : “x” •] Scan 18 2 20 C1 → e[][1E; 0; C1] Complete 19 2 21 [1L → co[C1,L]; 0 : h0, 0i • “, ” h1, 0i] Combine 17 20 3 22 [1L → co[C1,L]; 0 : h0, 0i “, ” • h1, 0i] Scan 21 3 28 [3L → ce[E,N]; 0 : • h0, 0i h1, 0i] Predict 2 22 3 29 [3E → e[]; 0 : • “x”] Predict 5 28 4 30 [3E → e[]; 0 : “x” •] Scan 29 4 31 C3 → e[][3E; 0; C3] Complete 30 4 32 [3L → ce[C3,N]; 0 : h0, 0i • h1, 0i] Combine 31 28 4 33 [4N → nl[]; 0 : • “]”] Predict 4 32 5 34 [4N → nl[]; 0 : “]” •] Scan 33 5 35 C4 → nl[][4N; 0; C4] Complete 34 5 36 [3L → ce[C3,C4]; 0 : h0, 0i h1, 0i •] Combine 35 32 5 37 C5 → ce[C3,C4][3L; 0; C5] Complete 36 5 38 [1L → co[C1,C5]; 0 : h0, 0i “, ” h1, 0i •] Combine 37 22 5 39 C6 → co[C1,C5][1L; 0; C6] Complete 38 5 40 [0S → ls[C6]; 0 : “[” h0, 0i bullet] Combine 39 10 5 41 C7 → ls[C6][0S; 0; C7] Complete 40

Figure 7 – Parse deduction for the string [x , x] we continue by applying Predict to the rule at step 3. We are now trying to parse a non-empty list built with the co constructor. We have already parsed one token, and are now to parse the first constituent of the first argument of the co constructor, which is an element. The Predict rule tells us that we are to parse an element (E), which amounts to parsing a “x” token; because it matches our next token in the input stream, we Scan it. We have successfully parsed an element, so we apply the Complete rule to produce a passive item witnessing this. At step 21, we use this passive item to advance on our parse of the non-empty list from step 17. Point is now in front of a comma token, which matches the next token in the input stream, so we may Scan it. Point then comes to the first constituent of the second argument of the list constructor co; we are trying to parse a non-empty list. We again Predict two branches to follow, corresponding to the two constructors of non-empty lists. If we follow the co constructor, we will be trying to parse a list with at least 3 elements, since we have already parsed one element, and co takes an element and a non-empty list, 2.2. GF, mathematically 53

0 6 [0S → ls[N]; 0 : • “[” h0, 0i] Initial Predict 0 1 7 [0S → ls[N]; 0 : “[” • h0, 0i] Scan 6 1 8 [1N → nl[]; 0 : • “]”] Predict 4 7

Figure 8 – Failed parse deduction for the string [x , x]: unexpected token.

3 23 [3L → co[E,L]; 0 : • h0, 0i “, ” h1, 0i] Predict 3 22 3 24 [3E → e[]; 0 : • “x”] Predict 5 23 4 25 [3E → e[]; 0 : “x” •] Scan 24 4 26 C2 → e[][3E; 0; C2] Complete 25 4 27 [3L → co[C2,L]; 0 : h0, 0i • “, ” h1, 0i] Combine 26 23

Figure 9 – Failed parse deduction for the string [x , x]: unexpected token. which itself has at least one element. Figure9 follows this branch up until a parse failure due to an unexpected token.

Therefore we continue at step 28 with the ce constructor. The element at point is the first constituent of the first argument, which is an element. We apply the Predict, Scan and Complete rules in succession to successfully parse another element, and Combine our resulting passive item at step 32 with our prior active item trying to parse a non-empty list built by the ce constructor, advancing point to the next argument-constituent pair. The next argument-constituent pair corresponds to an empty list, which can only be constructed by one concrete constructor (this application of the Predict rule has no sibling branch). We successfully parse the empty list because the closing square bracket token matches the next token in our input stream. This allows us to Complete the parsing of the empty list, which Combined with step 32 Completes the parsing of the non-empty list component of the the first non-empty list we were trying to parse. This list is also completed by this step at 39, which finishes parsing our input sentence. Our final passive item at step 41 witnesses the parsing of the first (and only) constituent of an item in the sentence category spanning the whole 5-token input.

2.2.4 Linearization

Linearization in GF is the process that maps an abstract syntax tree to a tuple of strings (usually one string at the topmost level). We have already discussed how concrete syntax trees are linearized to tuples of strings in section 2.2.1, definition5 and equation (2.1); here we show how we can map abstract syntax trees to concrete syntax trees, so that we can also linearize the former. For simplicity, we do not handle the case where the abstract trees are incomplete, but the interested reader can refer (again) to Krasimir Angelov’s thesis [1]. Before we describe the transformation of an abstract syntax tree to a concrete

−→ syntax tree, we definePf A,B ∈ P , the subset of productions whose argument categories are 54 Chapter 2. Grammatical Framework

−→ B and whose concrete constructors map to the same abstract constructor F A: −→ −→ −→ C C C A Pf A,B = {A → f [B ] | A → f [B ] ∈ P ; ψF (f ) = f }

We can write a transformation of an abstract syntax tree tA to a concrete syntax tree tC of category B as tA ⇓ tC : B, and describe it as a deduction rule:

tA ⇓ tC : B . . . tA ⇓ tC : B C −→ A C 1 1 1 n n n A → f [B ] ∈ P, n = a(f ) = a(f ) A A A C C C f t1 . . . tn ⇓ f t1 . . . tn : A

To transform a tree where a constructor is applied to more than zero arguments, we first −→ transform the arguments. The vector B from the linearization of the arguments and f A is the abstract constructor of the input tree. To complete the linearization after having computed tA ⇓ tC : B, we just calculate L(tC) using equation (2.1). 55 3 A computational grammar for Portuguese

In this section we describe the GF resource grammar library [32][33, chapters 5,9,10] and our implementation of a Portuguese resource grammar [7]. We also evaluate our grammar with respect to syntactical correctness.

3.1 The GF resource grammar library

The GF resource grammar library (RGL) is a collection of natural language grammars (each one individually called a resource grammar, or RG) that intend to provide an usable and syntactically-correct interface to the languages they describe, so that they can be reused by other, domain-specific grammars [32, section 7.1]. Their usability comes in part from the fact that the grammars share a common framework – they use the same categories and provide the same constructors. As of January 2019, the RGL comprises grammars for more than 40 languages in different levels of completeness.

3.1.1 Motivation

When writing domain-specific grammars naively, one often needs to reimplement linguistic constructions that show up in other grammars. This is a consequence of the fact that even small fragments of natural language may contain non-trivial linguistic phenomena that are difficult to capture properly. Handling these at each grammar instance leads to code repetition and unmaintainability. In the GF book [33, section 1.7], it is shown how a simple sentence such as the number 2 is prime in a multilingual application may need an implementation of mood variations, number agreement, , and different noun cases. This kind of general syntactic knowledge is better encapsulated in such a way that it can be reused, and in GF this encapsulation is in the form of a resource grammar. As GF supports grammar composition (as in defining a grammar in terms of another grammar) out of the box, domain-specific grammar writers can benefit from a standard library that encloses specialized linguistic knowledge. This library can then be employed by someone with domain-specific knowledge but less linguistic knowledge, promoting a division of labour of sorts. The resemblance of a RG to software libraries has not been missed [31]. A RG can be seen as a software library, and its use has the same advantages as the use of software libraries: besides the aforementioned avoidance of code repetition and the increased ease of writing (needing less specialized authors means more people can write grammars/programs), the library centralizes development and bug reporting in one place, saving resources and 56 Chapter 3. A computational grammar for Portuguese lin Prime x = \\ord,mod ⇒ let ist = case hmod,x.ni of { hInd, Sgi ⇒ " ist " ; hInd, Pli ⇒ "sind" ; hConj,Sgi ⇒ " sei " ; hConj,Pli ⇒ "seien" } in case ord of { Main ⇒ x.s ! Nom ++ist ++ "unteilbar" ; Sub ⇒ x.s ! Nom ++"unteilbar" ++ ist ; Inv ⇒ ist ++ x.s ! Nom ++"unteilbar" } Listing 3.1 – A hand-written rule. lin Prime x = UseCl (TTAnt TPres ASimul) PPos (PredVP x (UseComp (CompAP (PositA unteilbar_A)))) Listing 3.2 – Using the RGL constructors directly. lin Prime x = mkS (mkCl x unteilbar_A) Listing 3.3 – Using the RGL API.

Figure 10 – German predicate Prime gathering improvements faster than if each user had to develop their own library.

3.1.2 Usage

Again similarly to a software library, the GF RGL makes available an application programming interface (API). The syntactic part of the API is language-independent, so that it is simpler to write multilingual grammars because a user does not need to learn an API for each language. Figure 10 adapted from [33, section 1.7] shows three differents approaches to defining the rule for the predicate Prime (as in the number 2 is prime) in German. As shown in listing 3.3, the RGL API is mainly composed of functions of the form mkX, where X is result category of the application. These are overloaded functions, so that the user does not need to memorise too many function names – if they want to create a clause, they can browse the RGL API1 and see 31 ways of composing one, some of which can be seen in listing 3.4.

1 http://www.grammaticalframework.org/lib/doc/synopsis/index.html 3.1. The GF resource grammar library 57 mkCl : NP →V → Cl ; −− she sleeps mkCl : NP →V2 → NP →Cl ; −− she loves him mkCl : NP →V3 → NP →NP →Cl ; −− she sends it to him mkCl : NP →VV → VP →Cl ; −− she wants to sleep mkCl : NP →A → Cl ; −− she is old mkCl : NP →A2 → NP →Cl ; −− she is married to him mkCl : NP →Adv → Cl ; −− she is here mkCl : SC → VP → Cl −− that she sleeps is good mkCl : N → Cl ; −− there isa house mkCl : V → Cl ; −− it rains Listing 3.4 – mkCl overloaded function in the RGL API.

These clause-building functions handle things like agreement, , forming, and tenses – so that the application grammar writer does not have to implement them herself. Because the API is multilingual, the rule shown in listing 3.3 for German is almost the same for Portuguese and English – see listing 3.5. lin Prime x = mkS (mkCl x primo_A) lin Prime x = mkS (mkCl x prime_A) Listing 3.5 – Prime predicate in Portuguese and English.

Note that only content words need to be changed. There is an API for content words too, but it is language-dependent. Despite this, the morphological APIs are still very uniform in that the function names are mostly of the same mkX form. These lexical paradigms can be browsed online at the RGL synopsis,2 and are defined and documented in each language’s Paradigms module. The prime can then be defined in German, Portuguese, and English as in listing 3.6. lin unteilbar_A = mkA "unteilbar" lin primo_A = mkA "primo" lin prime_A = mkA "prime" Listing 3.6 – prime_A in German, Portuguese, and English.

These lexical constructors are also overloaded; the ones that take less forms as input are deemed smart, and try to guess the remaining forms from the ones it has been provided. An extreme example is the Finnish noun constructor mkN: it has a definition that receives only one form, and produces 30+ forms, with 87% accuracy against the KOTUS lexicon gold standard [13]. For the remaining 13%, Finnish mkN also has a “worst-case” 2 http://www.grammaticalframework.org/lib/doc/synopsis/index.html 58 Chapter 3. A computational grammar for Portuguese definition that takes 10 forms, and three other definitions taking 2, 3, and 4 arguments each. As an example of the uniformity of lexical constructors, we note that Portuguese and English also have noun constructors called mkN; they simply differ on which arguments they take. While Finnish had five different constructors, all of them taking different numbers of input strings, Portuguese has four: the main smart one that guesses gender and form from the singular form, the worst-case one that takes singular and plural forms and a gender, and two intermediate ones that use the smart one but force either the gender or the plural form (their type signatures are shown in listing 3.7). mkN : (revolução : Str) → N; −− predictable nouns mkN : (alemão, alemães : Str) → N; −− force noun plural, guess gender mkN : (mapa : Str) → Gender → N; −− force gender, guess plural mkN : (mão,mãos : Str) → Gender → N; −− worst−case Listing 3.7 – Portuguese noun constructors.

3.1.3 Structure

In this section we give an overview of the structure of RGL, describing in a high- level its main modules. The dependency tree in figure 11 (taken from the GF book [33]) shows the overall structure of the RGL. The modules with solid contours are visible to the end-user, while those with dashed contours are internal; rectangular modules are pairs of abstract and concrete grammar, while ellipses are resource or instance modules, and diamond-shaped modules are interfaces. If a module has a name in brackets, it is derived mechanically from other modules.

The Cat and Common modules define the backbone of the RGL category system, which can be seen in figure 12 (from the RGL synopsis). Categories in rectangular boxes are open lexical categories. The categories declared in Common usually share the same concrete implementation.3 For the categories declared in Cat no attempt is made at a standard implementation.

The syntactic API we describe in section 3.1.2 is declared by Syntax (and its children), while the morphological API is declared by Paradigms, so that accessing them both from an application grammar is done by opening the language modules that implement them, i. e., SyntaxX and ParadigmsX, where X is the language of choice. The definitions in SyntaxX are not explicitly defined, but derived using a functor (explained in section 2.1.2) from other RGL constructors. This means that all languages that implement the necessary constructors get the Syntax API for free. Listings 3.2 and 3.3 give an idea of how this is 3 Text is always a string, but Adv for which is usually just a string needs more structure in the case of languages like Chinese or Sindhi. 3.1. The GF resource grammar library 59

Figure 11 – RGL module structure (condensed)

done: knowing that both definition trees are equivalent, one can presume that mkS : Cl → S can be defined as the anonymous function λ cl → UseCl (TTAnt TPres ASimul) PPos cl, even if one does not know exactly what each constructor is. Indeed this is how the functor defines this overloaded instance of the mkS constructor, and for that it only needs the definition of UseCl (as TTAnt, TPres, ASimul, and PPos are all defined in Common for all RGL languages).

Although not part of the user-facing API, the module MorphoX is customary, and is used to implement ParadigmsX for each language. Not being part of the API, its use by other grammars is strongly discouraged, as it is not committed to backwards-compatibility as user-facing modules are.

Another module which is not part of the user-facing API but it is customary is ResX. It is where parameters and functions useful for several modules are placed, so that they can be reused across the RG. ResEng for instance defines the Case and VForm (for verb forms) parameters (among others), and implements the toAgr (that builds an agreement parameter from a number, a person, and a gender) and predV (which builds a verb from a verb) functions (among others). 60 Chapter 3. A computational grammar for Portuguese

Figure 12 – GF RGL category system

Other modules that are useful to end-users are IrregX and DictX. While the LexiconX module contains a small lexicon of 300-odd words, IrregX and DictX offer larger bootstrapped from morphological resources. Trustworthy lexicons are desirable because application programmers of multilingual grammars do not need to worry about getting word inflections correct using the appropriate paradigms when they are not native speakers (or speakers at all) of the language they are implementing. These modules are not available for all languages, however.

The modules ExtraX and ExtendX (not pictured in figure 11) are extensions of the RGL grammatical core, offering more categories and constructors. The latter is intended to be an improved version of the former, but is still under development.4 A significant difference between them is that ExtraX is language-specific: there are common constructors defined in the abstract Extra, but every language implements its own constructors, which makes it impossible to translate between them directly, since they don’t share a common abstract syntax.5 Unlike ExtraX, ExtendX has a common abstract syntax; because of this it must define constructors that may not be relevant to language X; as an example, the English RG ends up defining youFem_Pron, even though this is mostly6 relevant for languages

4 https://groups.google.com/d/msg/gf-dev/YWajYB5CcEg/Q6MJXExvDgAJ 5 Users may still open ExtraX to help define constructions in an application grammar without sacrificing interlingual translation. 6 I say mostly because although another version of you_Pron introduces redundancy it also allows one to 3.2. The Portuguese resource grammar 61 where the second person pronouns are gendered. This disadvantage to ExtendX is mitigated by a functor that it instantiates, which includes default linearizations for constructors such as youFem_Pron, which are only relevant to a subset of the RGL languages. The abstract module Grammar defines the syntactic core of the RGL. It is composed of modules with evocative names such as Noun (which mostly declares constructors whose results are noun phrases and common nouns) and Question (which declares constructors relating to interrogative constructions such as question ), plus Idiom for idiomatic constructs (such as progressive verb phrases like estar dormindo) and Structural for structural words like alguém.

To work on or test a resource grammar in the GF shell, the LangX or AllX modules may be imported. These are top-level modules that also contain the LexiconX module, and can thus be used for testing. (If GrammarX is imported directly, there will be no content words and most trees will not be able to be generated.) If using just one language, both LangX or AllX will do (and the latter might even be preferred because of its additional imports), but for more languages LangX is the correct choice, as the AllX do not share an abstract syntax by virtue of their import of ExtraX.7

3.2 The Portuguese resource grammar

In this section we present our implementation of a Portuguese resource grammar. We begin by presenting a miniature version of it, explaining its main constructs and thereby giving a high-level view of the syntax of Portuguese. We continue by showing our work on Portuguese morphology and explaining how morphology is encoded in a GF grammar and how it is exposed to the end-user of the grammar. We then give a brief overview of a few modules of the Portuguese grammar which are not part of the core RGL modules, and are thus not provided by the Romance functor. Finally, we perform two evaluation experiments using GF treebanks,8 and discuss the grammatical correctness of our Portuguese linearizations. We must note that our implementation mostly follows , although we sometimes try to offer compatibility with too. Our implementation of a Portuguese RG was an instantiation of the Romance functor. The Romance functor generalizes the grammar for the following languages: French, Italian, Catalan, Spanish, and Portuguese. Although technically a Romance language, Romanian has a separate implementation because of considerable differences from the other languages in the functor (see [15]).

parse French “tu est intellingente” and get a feminine abstract syntax tree instead of a gender-neutral one. 7 This might change when Extend replaces Extra in All. 8 By GF treebank we simply mean a treebank where the trees are GF abstract syntax trees. 62 Chapter 3. A computational grammar for Portuguese

The functor implements most of the syntax of in a common way, the difference between languages being encapsulated in the DiffRomance interface module (see section 2.1.2), to be instantiated by each language. The Romance functor delegates to its instantiations the language’s morphology, its idiomatic constructs, structural words, and its lexicon. The syntactic parts of our Portuguese implementation are thus restricted to the DiffPor, StructuralPor, and IdiomPor modules, plus the extra syntactic modules discussed in section 3.2.3.2 (page 73). The Romance functor follows the RGL module structure presented in section 3.1.3, adding a CommonRomance and a ResRomance module. Parameters and operations that can be reused across Romance languages are defined in the former; these include the Gender and AForm (adjective form) parameters, and the numForms (for building a table from numbers to strings) and Noun (defining a noun type) functions. The ResRomance module is a normal resource module, except that it is an interface that gets extended by each Romance instantiation (refer back to section 2.1.2 (page 36) for the definition of these terms).

3.2.1 A Portuguese miniresource grammar

The GF book [33] recommends starting a new RG from a miniature resource grammar. A miniature RG is separate grammar which intends to be a simplified version of a full RG, comprising less categories and constructors than those available from the RGL. A mini-resource grammar is useful as both a training exercise in preparation to the implementation of a full resource grammar and as a better didactic instrument, which is how we use it here. We eventually produced three versions of a miniresource grammar.9 The latter one was used as an example grammar in the GF summer school 2018, and is the one we will present here. Our presentation will focus on syntax, since we will describe the morphological implementation of Portuguese word classes for the full resource. The structure of the Portuguese mini resource can be seen in figure 13. Rectangular boxes are abstract modules, solid ellipses are concrete modules, and dashed ellipses are resource modules. The mini-resource grammar is a simplified version of the RGL that tries to follow its conventions, so its modules should be familiar from section 3.1. The MiniGrammar lumps together category definitions and grammatical constructors, whereas in the RGL it is an extension of several modules that implement these separately. It uses the same category and constructor names as the RGL, removing some of them for simplicity. In the

9 The versions are available at following links: https://github.com/GrammaticalFramework/gf-contrib/ tree/master/mini, https://github.com/GrammaticalFramework/gf-contrib/tree/master/mini/newmini, and https://github.com/GrammaticalFramework/gf-summerschool-2018/tree/master/resource. 3.2. The Portuguese resource grammar 63

Figure 13 – Module structure of the Portuguese mini resource grammar

mini-resource, for example, polarity is represented in much the same way, while the tense system is greatly simplified to have only a simultaneous and an anterior tense (compared to 8 RGL tenses). Instead of the hundreds of RGL constructors, the mini-resource has only 49 constructors (excluding the lexical ones) and 22 categories. It is a significant fragment of the full RGL, though: it includes imperatives (não ande com ela) and question sentences (vocês andam com o cachorro ?) and has noun and verb phrases, pronouns and conjunctions. We begin by describing the linearization of the mini-resource categories for Por- tuguese, then explain how some of the main constructor linearizations are implemented. Listing 3.8 shows how pronouns are created and represented, along with an example of a pronoun definition. We define a Pron type for pronouns, which shorthand for a record with two fields: the main s field is table from to a string, while the a field represents the agreement of the pronoun. Case, Agreement, Gender, Number, and Person are all parameters defined in listing 3.9. Listing 3.10 shows how nouns are represented in the mini resource grammar for Portuguese. A noun is simply a record with two fields: one that stores the noun’s inherent gender, another that stores a table from the noun’s variable number to strings. (Common nouns, although not shown, get the same representation). Noun phrases are slightly more 64 Chapter 3. A computational grammar for Portuguese

−− Pron oper Pron : Type = {s : Case ⇒ Str ; a : Agreement} ;

mkPron : (_,_,_ : Str) → Gender → Number →Person → Pron ; mkPron eu me mim g n p = { s = table {Nom ⇒eu ; Acc ⇒me ; Dat ⇒ mim} ; a = Agr g n p }; Listing 3.8 – Concrete representation of pronouns and how they are built in the Portuguese mini-resource.

param Gender = Masc −− aluno | Fem ; −− aluna

Number = Sg −− alunos | Pl ; −− alunas

Case = Nom −− as subject,e.g. ele | Acc −− as object,e.g.o | Dat ; −− as prepositional object,e.g., lhe

Person = Per1 −− eu,nós | Per2 −− tu, você | Per3 ; −− ele,ela, eles, elas

Agreement = Agr Gender Number Person ; Listing 3.9 – Portuguese parameter definitions in the Portuguese mini-resource.

involved, although they also have two fields: one stores their inherent agreement (which is parameter representing gender, number and person), and another storing a table from a case (either nominative, accusative, or dative) to another record, which has two fields, one of which is a string (the object), and the other (the clitic) is a record of a string and a boolean value (which encodes whether the optional clitic is actually present or not). To make the representation of noun phrases clearer, we show in listing 3.11 examples of constructors that create NPs. It includes the UsePron constructor which lifts pronouns to noun phrases. This code includes a syntactic novelty: the double slash (\\) shorthand for one-branch tables. Because the table in the s field of the DetCN constructor would have the same definition modulo the case parameter it receives, it can be abbreviated in a fashion similar to lambda abstractions: \\p ⇒ t is equivalent to table {p ⇒ t}, where p is either a variable or a wildcard (_).

As its name suggests, DetCN builds a NP from a and a common noun. 3.2. The Portuguese resource grammar 65

oper Noun : Type = {s : Number ⇒Str ; g : Gender} ;

NP : Type = { s : Case ⇒ { clit : Clit ; obj : Str} ; a : Agreement }; Listing 3.10 – Concrete representation of nouns and noun phrases in the Portuguese mini-resource.

lin DetCN det cn = { s = \\c ⇒ { clit = emptyClit ; obj = det.s ! cn.g ! c ++ cn.s ! det.n }; a = Agr cn.g det.n Per3 ; };

UsePN pn = { s = \\_ ⇒{ clit = emptyClit ; obj = pn.s } ; a = Agr pn.g Sg Per3 };

UsePron p = { s = table { Nom ⇒{ clit = emptyClit ; obj = p.s ! Nom }; Acc ⇒ { clit = {s = p.s ! Acc ; hasClit = True}; obj = [] }; Dat ⇒ { clit = emptyClit ; obj = p.s ! Dat } }; a = p.a }; Listing 3.11 – Linearization of the UsePron constructor in the Portuguese mini-resource. 66 Chapter 3. A computational grammar for Portuguese

Its agreement field inherits from both, taking the gender from the CN and the number from the determiner, while its string field has no clitic and its object is the appending of the selection of the determiner by the noun’s gender and the ’s case, and the CN’s string selection by the determiner’s number.10 Proper nouns are simply nouns which do not vary in number; their lifting to NPs ignores case and puts the PN’s string field as object, and their agreement features are standard, except for the gender.

Regardless of case, the NP inherits the pronoun’s agreement features, and then different values of clitic and object are chosen depending on case. (The emptyClit constant is just a record of an empty string with a false value on the field that indicates if the clitic is present). Verbs are represented as simple tables from verb forms to strings. Because the tense system is simplified, verb forms are much less complicated than in the full Portuguese RG: there are only the infinitive, present, past, and imperative forms (see listing 3.12).

VForm = VInf −− amar | VPres Number Person −− amo,amam | VPast Number Person −− amei, amou | VImp ImpNumPer ; −− ame Listing 3.12 – GF parameter representing verb forms in the Portuguese mini-resource.

On the other hand, verb phrases are more involved. They have four fields: one stores the main verb, another an optional clitic, another the clitic agreement features, and the final one is a table from agreement features to a string, represent the complement (see listing 3.13).

Verb : Type = {s : VForm ⇒Str } ;

VP = { verb : Verb ; clit : Clit ; clitAgr : ClitAgr ; compl : Agreement ⇒Str ; }; Listing 3.13 – GF verbal concrete representations in the Portuguese mini-resource.

A simple (VP) can be built from a verb by adding the verb to the appropriate field, and filling in empty versions of the other fields, as is done by the

10 Although we did not show the linearization of the determiner category, it is not difficult to see what it is – try it! 3.2. The Portuguese resource grammar 67

UseV constructor. A more interesting version of VP formation is given by the ComplV2 constructor, which takes a verb that expects a complement and a noun phrase. A verb that expects a complement (simply called a V2) is like a verb, but has an additional field for case. The main verb of the resulting VP is of course the input V2, while the VP’s clitic and its agreement are inherited from the NP’s clitic (if it exists). Finally, the VP’s complement is the NP’s object field. The code in listing 3.14 has two syntactic novelties: the use of let ... in for intermediary definitions, and GF tuples, which are enclosed by hi angled brackets.

lin UseV v = { verb = v ; clit = emptyClit ; clitAgr = CAgrNo ; compl = \\_ ⇒[] };

ComplV2 v2 np = let nps = np.s ! v2.c in { verb = {s = v2.s} ; clit = nps. clit ; clitAgr = case hnps. clit . hasClit ,v2.ci of { hTrue,Acci ⇒CAgr np.a ; _ ⇒ CAgrNo }; compl = \\_ ⇒nps.obj }; Listing 3.14 – GF verbal complementation in the Portuguese mini-resource.

The mini-resource includes other constructors for building both VPs and NPs (see section 3.2.1), whose names and types are sufficient to tell what they do.

−− VP formation fun AdvVP : VP →Adv →VP ; fun ComplV2 : V2 →NP →VP ; fun UseAP : AP →VP ; fun UseAdv : Adv →VP ; fun UseNP : NP →VP ; fun UseV : V → VP ; −− NP formation fun DetCN : Det →CN →NP ; fun MassNP : CN →NP ; fun UsePN : PN →NP ; fun UsePron : Pron → NP ;

We now describe how simple NP-VP clauses are built by the PredVP constructor (see listing 3.15). The clause (and question clause) representation is rather simple: it is a table 68 Chapter 3. A computational grammar for Portuguese from two boolean values to a string; one boolean value encodes the clause’s polarity (i. e., if it’s positive or negative), and the other encodes the tense (a simple choice between past and present). The clause resulting from PredVP is the concatenation of subject, a possible “não” string if the clause selected is negative, the clitic, the verb (its form depends on tense), and the (optional) object. The subject is the of the NP, while the object is the VP’s complement selected by the NP’s agreement. The clitic is inherited from the VP, and the verb form (modulo tense) is chosen using the NP’s features (number and person) by the agrV function. Building a sentence from a clause is pretty straighforward, it is just a matter of selecting the appropriate tense and polarity.

PredVP np vp = { s = \\isPos, isPres ⇒ subj ++ neg isPos ++ clit ++ verb ! isPres ++ obj } where { −− is syntactic sugar for the let ... in expression subj = (np.s ! Nom).obj ; obj = vp.compl ! np.a ; clit = vp. clit .s ; verb = agrV vp.verb np.a }; Listing 3.15 – GF clause building in the Portuguese mini-resource.

3.2.2 Morphology

As discussed in section 3.1.3, API access to a resource grammar is implemented in the ParadigmsX module, (usually) with support modules named ResX and MorphoX. The API (with documentation) for Portuguese morphological paradigms can be found at RGL synopsis.11 In this section we discuss its implementation and presentation to the user. Inflection is the process which modifies a word according to a grammatical feature such as case, agreement or tense. For example, in Portuguese nouns are inflected by number, verbs by tense, and adjectives by both gender and number, while adverbs are invariant – i. e., have no inflection. An inflection paradigm (or a paradigm for short) is a regular inflection pattern of a language. As an example, the most common paradigm for the pluralization of Portuguese nouns is the one which adds a s suffix to the noun; however another paradigm is the one which applies to some words ending in ão and pluralizes them by changing this suffix to ões. An inflection table for a word maps grammatical features to the correct inflections of the word given those features. Listing 3.16 features a GF inflection table on number in the string field of the lexical constructor for the word gramática, while table1 shows a regular inflection for the same word one might find in an online dictionary.

11 http://www.grammaticalframework.org/lib/doc/synopsis/index.html 3.2. The Portuguese resource grammar 69

grammar_N = { s = table { Sg ⇒ "gramática" ; Pl ⇒ "gramáticas" }; g = feminine }; −− same as grammar_N= mkN"gramática"; Listing 3.16 – The Portuguese lexical constructor for noun gramática

Form Singular gramática Plural gramáticas

Table 1 – Inflection table for the word gramática

Getting inflection tables right is a large part of making a grammar produce correct output. Because word inflections in a language are repetitive, it is GF good practice to encapsulate these in morphological functions called paradigms, and make them available to the end-user of the grammar. The implementation of morphological paradigms in GF follows a pattern: a worst-case function taking all possible forms is devised, and then “smarter” functions (which take less input forms and try to guess the same output) are defined using the worst-case function. For readability, it is customary to define single paradigms too, i. e., paradigms that do no guessing despite taking fewer forms, and follow the inflection patterns specified by their names. An example is the peneirar_Besch function defined in the BeschPor12 module: it expects a string corresponding to a verb infinitive form, and creates the full inflection table (63 forms) for a verb, assuming the inflection of the input verb matches that of the peneirar verb. Listing 3.17 shows a naive example implementation of a smart paradigm for Portuguese verbs: it takes the infinitive form of the desired verb, and tries to match its ending to one of the endings characteristic of the three biggest Portuguese conjugation groups; if, say, the verb form ends in -er, it applies the comer paradigm to the infinitive form, resulting in an inflection table that follows that of this verb. In case the infinitive form does not match any of the cases, a fall-through case simply builds a verb with a constant inflection table using the worst-case paradigm. GF always demands a value for each parameter of an inflection table, so the GF grammarian has two main choices when a word does not have a form for a given inflection: they can either use the special nonExist token, or they generate a ‘theoretical’ form, i. e., one that follows a regular inflection pattern. Because the former may cause runtime errors,

12 Besch comes from the famous Bescherelle French books, which lists verb inflection tables for French and other languages. 70 Chapter 3. A computational grammar for Portuguese the latter is often preferred. smartVerb : Str → Verb = λinf → case inf of { −− 1st conj group, ending in −ar am +"ar" ⇒ verbAmar inf −− 2nd conj group, ending in −er com +"er" ⇒ verbComer inf −−3rd conj group, ending in −ir part + " ir " ⇒ verbPartir inf −− onlypôr, supor and derivatives _ ⇒ verbPor inf }; Listing 3.17 – Naive smart paradigm for verbs

The actual implementation of a smart paradigm for Portuguese verbs is more involved, but follows the same pattern. In the BeschPor module we define several verb paradigms, following the ones found in Portuguese verb conjugation tables [16,35]. Then we define a smart paradigm that takes the infinitive form of a verb and tries to guess which paradigm to apply to it, as can be seen in listing 3.18. In the case of Portuguese verbs, it is impractical for the end-user to use the worst-case paradigm, since that would require inputting more than 60 verb forms. Because of this, we do not even make it available to the end-user API – but it is nevertheless useful to define the other paradigms in terms of it. It is customary for a resource grammar API to offer intermediate paradigms, that take less forms than the worst-case paradigm, but that take more than the smarter paradigm; in the case of Portuguese verbs, we found no heuristic that would result in a practical paradigm (i. e., one with not too many forms that performed significantly better than regV). Therefore we encourage end-users to simply lookup the correct paradigm for a verb in case the smart paradigm guesses it wrongly, or just import the verb (if available) from the IrregPor module (see section 3.2.3.1 at page 72). For Portuguese nouns, we define paradigms for nouns that follow several inflection patterns (like the ones of the nouns vinho-vinhos, areia-areia, alemão-alemães, falcão- falcões, cidadão-cidadãos, nuvem-nuvens, rapaz-rapazes, canal-canais, réptil-répteis, etc), and then create a smart paradigm that tries to guess the correct paradigm that yields the noun’s plural form and its gender from its singular form. Intermediate paradigms use the smart paradigm under the hood, simply forcing it to choose a certain gender, or a certain plural form. The adjective smart paradigm actually uses the smart paradigm for nouns under the hood; its only work is guessing the feminine singular form from the masculine singular form. After that both forms are passed separately to the smart paradigm for nouns, which guesses their plural forms, and these are combined into the full adjective. 3.2. The Portuguese resource grammar 71

regV : Str → V; regV s = case s of { chamar +"−se" ⇒ reflV (regV’ chamar) ; _ ⇒ regV’ s };

regV’ : Str → V; regV’ v = let xr = Predef.dp 2 v ; −−− ar z = Predef.dp 1 (Predef.tk 2 v) ; −− i in −iar paradigm = case xr of { " ir " ⇒ case z of { "g" ⇒ redigir_Besch ; "a" ⇒ sair_Besch ; "u" ⇒ distribuir_Besch ; _ ⇒ garantir_Besch }; "er" ⇒ case z of { "c" ⇒ aquecer_Besch ; "g" ⇒ proteger_Besch ; "o" ⇒ moer_Besch ; _ ⇒ vender_Besch }; "ar" ⇒ case z of { "c" ⇒ ficar_Besch ; "ç" ⇒ começar_Besch ; "e" ⇒ recear_Besch ; "g" ⇒ chegar_Besch ; "i" ⇒ anunciar_Besch ; "j" ⇒ viajar_Besch ; "o" ⇒ perdoar_Besch ; "u" ⇒ suar_Besch ; _ ⇒ comprar_Besch }; "or" | "ôr" ⇒ pôr_Besch ; _ ⇒ comprar_Besch −− fallback } in lin V (verboV (paradigm v)) ; Listing 3.18 – Portuguese verb smart paradigm 72 Chapter 3. A computational grammar for Portuguese

3.2.3 Extra modules

In this section we briefly describe our implementation of modules which are not part of the core RGL modules.

3.2.3.1 Morphological modules

Although not part of the core RGL because they are language-specific, big lexical modules are included in the RGs of the most complete languages. These modules help application programmers by allowing them to just import the correct lexical entry instead of having to decide which paradigm to use. This is even more important in the recurring case where the application programmer is not a native speaker (or a speaker at all) of the language they are implementing, in which case they would have to enlist a language speaker’s help in order to check their choices. If a dictionary module is available, the application programmer only needs to now the correct entry to import, and can rest assured that its inflections forms are all correct.13

Unwritten conventions of the RGL allow for two big lexical modules: DictX and IrregX. While the latter contains only irregular verbs, the former usually contains verbs, nouns, adjectives, and adverbs. To help define morphological resources for our Portuguese RG, we have participated in the creation of the MorphoBr resource [8], a large-scale full-form lexicon of Portuguese that consolidates existing lexical resources for Portuguese, corrects some of their gaps and mistakes, and adds new entries, most of them in the field of . This work has been fundamental to the definition of the DictPor module, providing us the data for the correct inflection of adjectives and nouns. On the other hand, we have defined the verbs in IrregPor by applying the appropriate paradigms from BeschPor. We have discarded MorphoBr data on verbs because it includes no verb valency information and thus adds little to the IrregPor and BeschPor modules. Adverb informa- tion was also discarded, since adverbs in the Portuguese RG are records of a single string field, and by knowing their lemmas they be can safely constructed – there is no need to include them in the dictionary module.

The IrregPor and DictPor modules are used in at least one application: the in- development wide-coverage grammar gf-wordnet project,14 which also uses wordnet data from the original Priceton wordnet [27] and its Portuguese version, OpenWordNet-PT [11].

13 The importance of large monolingual dictionaries has been highlighted by Aarne Ranta in a recent email to the GF list (https://groups.google.com/d/msg/gf-dev/YWajYB5CcEg/Q6MJXExvDgAJ), which finally pushed me to include one for Portuguese. 14 https://github.com/grammaticalframework/gf-wordnet 3.2. The Portuguese resource grammar 73

3.2.3.2 Grammar extensions

The RGL grammatical core is expressive, even though it is limited. Its stated goal [32] is not syntatic coverage, but semantic coverage: “The user must be able to find some way to express any content she wants to express”. Despite this non-goal, it is sometimes useful to extend the RGL grammatical back- bone to add coverage; section 3.1.3 (page 58) explains how the ExtraX and ExtendX modules fulfill this role. The Portuguese RG has implementations of both modules. ExtraPor is partially inherited from the Romance functor, while ExtendPor was started from scratch – Portuguese is the second Romance language to have such a module (since Extend is a new module, started in late 2017), and it has the most complete Romance implementation at the moment. When a Spanish Extend module was implemented, we instead decided to refactor all existing Romance Extend modules into a functor, and create Extend modules for the Romance languages that did not have it.

(3) a. (has a car been through here?) a green one has

b. (um carro passou por aqui?) um verde passou

(4) a. the man whose car has died sleeps

b. o homem cujo carro morreu dorme

The Extend module mostly has syntactic extensions like the GenRP constructor in list- ing 3.19, which adds a corresponding to Portuguese cujo, or syntactic coercions like the AdjAsCN constructor in listing 3.20, which coerces an adjectival phrase to a common noun, allowing the grammar to recognize sentences like the ones in glosses3 and4.

GenRP nu cn = { s = \\_b,_aagr,_c ⇒cujo ! g ! n ++ num ++cn.s ! n ; a = aagr g n ; hasAgr = True } where { cujo = genNumForms "cujo" "cuja" "cujos" "cujas" ; g = cn.g ; n = nu.n ; num = if_then_Str nu.isNum (nu.s ! g) [] }; Listing 3.19 – The GenRP constructor from Extend 74 Chapter 3. A computational grammar for Portuguese

AdjAsCN ap = { s =\\n ⇒ap.s ! (genNum2Aform Masc n) ; g = Masc }; Listing 3.20 – The AdjAsCN constructor from Extend

Another extension that has been partially implemented for Portuguese is the ParseExtend module of the gf-wordnet project. When complete, gf-wordnet is to be a wide-coverage translation grammar based on the GF RGL and WordNet. It needs grammatical extensions to the RGL so that it can attain its goal of syntactic coverage. Even though the project is ongoing, we have contributed a dictionary module based on both OWN-PT and MorphoBr, and the grammar extension module.

(5) a. is intelligent enough to love the dog

b. ele é inteligente o suficiente para amar o cão

(6) a. I wanted not to live

b. eu queria não viver

(7) a. I wanted to live

b. eu queria viver

(8) a. lower your hands slowly

b. abaixe suas mãos lentamente

(9) a. slowly lower your hands

b. lentamente abaixe suas mãos

The ParseExtend module declares several new constructors. Many of them add linguistic constructions like the special modification of adjectives made by enough – see gloss5, and notice that in English most adverbs modifying adjectives will come before the adjective, as in he is completely stupid or in she is very smart.

Other functions generalize RGL constructors, like the ParseExtend version of ComplVV, which takes a verb with a verb phrase complement and a verb phrase like the 3.3. Evaluation 75

one from the RGL, but additionally takes an anteriority and a polarity, allowing us to generate the sentence in gloss6 (whereas the RGL version only permits gloss7). Another example is that of the AdvImp constructor: it allows us to add an adverb modifying an imperative in front of the verb, like in gloss9, whereas the most idiomatic way in English would be to put it after the verb (as in the RGL-supported gloss8). However, if the goal is to have syntactic coverage we must be able to parse such constructions, even if they are relatively rare.

−− AdvImp: Adv →Imp →Imp; AdvImp adv imp = { s = \\pol,impform,g ⇒ adv.s ++ imp.s ! pol ! impform ! g }; Listing 3.21 – The AdvImp constructor from ParseExtend

3.3 Evaluation

In this section we discuss two evaluations of the Portuguese resource grammar. In the first evaluation experiment we use a previously available GF treebank to assess the grammatical correctness of the Portuguese linearizations produced by our grammar. In the second experiment we derive a GF treebank from an existing parallel treebank. We not only discuss the grammatical correctness of Portuguese linearizations, but also their appropriateness as translations of the English versions. In the later experiment we additionally test and discuss the RGL coverage of the syntactic phenomena present in the original treebank.

3.3.1 UD examples corpus

The Universal Dependencies (UD) project is a collection of multilingual treebanks and a set of common guidelines to annotate them [9]. UD is heavily influenced by the Stanford dependencies project [10], which focused on English syntax. The latest UD release (2.3) counts 129 treebanks spanning over 76 languages. As a byproduct of their work on converting GF trees to UD trees [20, section 6.1] (of which we will more about in section 4.3), Ranta & Kolachina constructed a GF treebank from sentences found in the English-language documentation of the UD project. The sentences were parsed and manually disambiguated (only the most appropriate analysis tree of a sentence was chosen, even if it had more available analyses). This treebank became a part of the testing treebanks of the RGL,15 and we used it to test the linearizations provided by the Portuguese RG. 15 https://github.com/GrammaticalFramework/gf-rgl/blob/master/treebanks/ud-rgl-trees.txt 76 Chapter 3. A computational grammar for Portuguese

Before linearizing the UD examples treebank, we revised its trees (there were at least three wrong trees) and removed some duplicated entries. A native Brazilian Portuguese speaker with no knowledge of GF judged that out of the 63 trees, 8 produced ungrammatical output. If we disconsider trees that are wrong (we discuss why we consider some trees wrong below) and use a grammar extension, this number gets reduced to 3 trees, one of which actually produces unidiomatic Brazilian Portuguese, but perfectly acceptable European Portuguese (see gloss 20). The numbers above refer to the latest version of the grammar; some of the issues we describe below have been fixed. We also discuss whether these issues are due to our implementation, the underlying Romance functor, the RGL syntactic coverage, or the treebank itself.

3.3.1.0.1 Copular verbs and adjectives

(10) a. John is handsome

b. João é bonito

c. João está bonito

(11) a. he was ready when I saw him

b. # ele era pronto quando eu o via

c. ele estava pronto quando eu o via

Copulas are words that join a sentence’s subject to its complement, as does is in the first sentence of gloss 10. In many languages copulas are verbs, but some languages do not even have copular words. While the English RG only has one copular verb (the verb to be), the Portuguese RG currently declares three copular verbs – the verbs ser, estar, and ficar. In many cases where an English sentence will use the to be copula, the appropriate Portuguese version of the sentence might use any of the Portuguese copulas (as is the case of gloss 10), even though there are cases (like gloss 11) where there is a clear choice. The difference between these cases is most of the time semantic and not syntactic, so there is little the resource grammarian can do to produce the appropriate tree for every case. One thing they can do, however, is to allow the user to specify which copula they want to use. We have thus modified the linearization type of adjectives in the Romance functor to do exactly this. The concrete representation of adjectives in the Romance functor was 3.3. Evaluation 77

A = {s : Degree −− e.g., comparative ⇒ AForm −− e.g., Masculine Singular ⇒ Str ; isPre : Bool −− positioned before or after noun? }; encoding the fact that degree and agreement forms are variable but an adjective’s position- ing is inherent, e. g., the use of espertas (and not, say, esperto) in elas são espertas depends on something (in this case, the sentence’s subject), while the fact that we sometimes say o bom velhinho instead of o velhinho bom as we usually do for other adjectives (e. g., o velhinho teimoso) is inherent to the adjective, not depending syntactically on anything on the sentence. Our modification of the Romance functor consisted in adding another inherent parameter defining which kind of copula an adjective uses, as in

A = {s : Degree ⇒ AForm ⇒Str ; isPre : Bool ; copTyp : CopulaType} ;

Of course, the same adjective might use different copular verbs in different sentences, but nothing prevents the user from declaring two adjectives which differ only in their choice of copular verb. This promotes flexibility at the cost of increasing ambiguity in translations from languages that do have such a distinction – although it also allows more possible translations, as in having both ele é arrumado and ele está arrumado be translations of he is tidy. This implementation decision is in line with GF’s preference for language generation, as it makes it easier for an application programmer to choose what kind of copular verbs they want to use, specially when coupled with a new overloaded instance of the mkA function: mkA : A →CopulaType →A; −− force copula type

3.3.1.0.2 Copula type in complement of copula

(12) a. John is from the city

b. # João está da cidade

c. João é da cidade

The solution to the copula selection problem we have previously discussed is restricted to adjectival phrases being used as complements to the copula. Gloss 12 shows an example where the default copula choice (in this case the estar verb) is the wrong one, yielding wrong Portuguese linearizations. This is due to the definition of the CompAdv constructor, 78 Chapter 3. A computational grammar for Portuguese which builds a complement to a copula from an adjective, which simply selects the estar copula regardless of its input adverb.16 A possible solution to this problem would be akin to that of the copula selection in adjectival phrase complements: we simply add a parameter to adverbs saying which copula they take. In this case we abstain from following this path, however. Adverbs are less frequently complements of copulas than adjectives, and we know of no semantic nor syntactic motivation for them to take one copula or another. We still would like the end-user to be able to choose which copula they want to have, so we offer a general solution to all copula complements: we have taken a coercion constructor that forces the use of the estar copula in copula complements from a language- specific module of the Spanish RG, and added it to the RGL-wide Extend module. This solution introduces ambiguity in languages which do not have a ser copula: if before this change we would get one analysis tree for a sentence with a copula, now we will get two trees, one whose linearization will correspond to the estar copula in the languages that have it, and another which corresponds to the usual copula in the languages that have it (see figure 14).

3.3.1.0.3 Incorrect trees

(13) a. what she did is important

b. * que ela fazia é importante

Gloss 13 would seem to show a wrong Portuguese linearization. This is not the case because the analysis tree is incorrect – it just happens to produce coherent English text. Figure 15 shows the abstract syntax tree for the sentences at gloss 13. We can see that tree corresponds to an incorrect syntactic interpretation where the main clause is composed of an embedded question sentence corresponding to what she did, but this constituent is not a question in this case, so the linearization is off. We were not able to produce the correct analysis of the sentence in gloss 13 using the RGL or its extensions.

(14) a. John is not a doctor

b. * João é nem um médico

A similar situation is displayed by gloss 14. The English linearization seems OK, with the Portuguese one being wrong. If we analyze the original analysis tree of the treebank though, we find that it is incorrect, and only produces a proper English sentence

16 Note that the GF RGL treats prepositional phrases like from the city as adverbs, hence our use of the term here. 3.3. Evaluation 79

Figure 14 – Abstract syntax trees for John is from the city; the hole in the diamond-shaped node can be filled by both UseComp and UseComp_estar. by chance. Figure 16 shows both the incorrect analysis (a.) and the correct analysis (b.) for the sentence from gloss 14. (a.) takes not to be a predeterminer (as in not a single dog barked), not a negation, which is why the Portuguese linearization produced is incorrect. Unlike the case of gloss 13, for this sentence we were able to produce a correct abstract syntax tree, which gives us the correct linearizations in gloss 15.

(15) a. John isn’t a doctor

b. João não é um médico

3.3.1.0.4 Romance clause inversion

Romance clause inversion was implemented with French in mind, and therefore was performing an inversion which is not idiomatic in Portuguese, as in gloss 16. 80 Chapter 3. A computational grammar for Portuguese

Figure 15 – Abstract syntax tree for what she did is important.

(a) (b)

Figure 16 – Two analyses for the sentence John is not a doctor 3.3. Evaluation 81

(16) a. which book do you like

b. de qual livro você gosta?

c. ! de qual livro gosta você ?

Our first solution was simply to not invert clauses in any questions, but that is not correct either, for it get us unnatural-sounding sentences (even if they are grammatical), as in gloss 17.

(17) a. what is love ?

b. que é o amor ?

c. ! que o amor é ?

Portuguese inverts word order in question clauses when a question starts with an interrog- ative word (like onde or que) and the verb is copular. We therefore edited the Romance functor to allow finer-grained choice of when inversion would be performed, and changed the Portuguese (and Spanish) grammars to invert copular questions but not to perform the inversion in other kinds of questions. This allows us to have the intented linearizations. If the subject of a copular question is a pronoun or a proper name and it starts with an interrogative, the inversion is optional; however the Portuguese RG will always perform it for predictability. Predictability is important for natural language generation because it avoids surprising the user, plus it makes for a consistent (and thus easier to use) API. Another instance of word inversion in Portuguese questions is in very formal writings, see in gloss 18 an example taken from [41]. The RGL does not implement formality in a general way: it is not the case that every abstract syntax tree has a formal/informal version. What the RGL does have is constructors that express formality in a few grammatical constructions: there is a constructor for formal and informal imperatives, for example.

(18) a. Why did the president refuse to sign the decree?

b. Por que se recusou o presidente a assinar o decreto?

3.3.1.0.5 Romance clitic pronouns

(19) a. I wanted to eat it

b. * eu queria comer o 82 Chapter 3. A computational grammar for Portuguese

The default implementation of clitic pronouns in the Romance functor fails for Portuguese, as can be seen in gloss 19. We have not yet changed it to correctly implement Portuguese enclitic and mesoclitic pronouns because we do not see a way of doing it without sacrificing the grammar’s performance.

3.3.1.0.6 The no quantifier

(20) a. John saw no animals

b. * João não via nenhum animais

c. João não via nenhuns animais

The correct Portuguese linearization in gloss 20 is not idiomatic in Brazilian Portuguese (even though it is in European Portuguese). Despite this consideration, it is nevertheless a good linearization for the corresponding syntax tree, which is displayed in figure 17. Both the noun (animal_N) and its quantifier (no_Quant) are variable in number, and must share the same number when it is determined; this means that if we were to force an invariable no_Quant by defining its plural form to be nenhum, we would still have a plural animal_N, yielding the incorrect Portuguese linearization at gloss 20. This was in fact the linearization we originally had, before realizing our mistake of trying to force no_Quant to be invariable.

(21) a. John saw no animal

b. João não via nenhum animal

The idiomatic Brazilian Portuguese linearization is still available, however mapped to a different English linearization, as shown by gloss 21.

3.3.2 Matrix MRS test suite

The development of a grammar requires a way of testing its results. In the initial phase of grammar writing, testing the grammar in the controlled environment that is a corpus is ideal, because one can test basic grammatical constructs without worrying too much about robustness. As the RGL is a multilingual grammar, a multilingual corpus is best suited because it would allow us to not only test the translations produced by our grammar, but also to test the coverage of grammatical phenomena in the RGL (using another language’s grammar, in our case that of English). The Matrix MRS corpus17 was created to be “a short but representative set of sentences”, and is used as a test suite for the English Resource Grammar (ERG) under 17 http://moin.delph-in.net/MatrixMrsTestSuite 3.3. Evaluation 83

Figure 17 – Abstract syntax tree for John saw no animals. the -drive Phrase-Structure Grammar (HPSG) formalism. The fact that this corpus is successfully used by an industrial-strength grammar and is limited in both vocabulary and size made it an ideal choice for an initial experiment. Composed of 107 sentences, the corpus is available in English and Portuguese versions, besides several other languages. In order to the test the coverage of the RGL, we employed its English grammar to parse the Matrix sentences. Out of the 107 sentences, 77 were parsed with only minor lexical additions. Using an extension module of the RGL allowed us to parse an additional 21 sentences, and two small grammatical constructs were added to parse another 2 sentences. The remaining 6 sentences demand more significant grammatical additions in other to be parsed; we discuss some of them in section 3.3.2.1. The procedure we followed amounted to preprocessing input sentences (downcasing sentence-leading letters and adding space between tokens), then parsing them with the English GF resource grammar (our additions to it are collected in the MatrixEng module18). This yielded us GF abstract syntax trees, most of the time several of them for each sentence. We then manually removed spuriously ambiguous trees using a primitive (tool-less) form of discriminant-based disambiguation, inspired by the approach in [29]. Take the sentence Some bark: it has readings where bark is taken as noun and readings where bark is taken as verb; by deciding whether bark is a noun or a verb in a given sentence, we may remove all trees where it is taken to something else. We can generalize this approach from lexical units to subtrees, which give us even more discrimination power. Even in highly ambiguous sentences producing thousands of trees, one can find the plausible tree(s) in

18 https://github.com/odanoburu/gf-matrix 84 Chapter 3. A computational grammar for Portuguese relatively few disambiguation steps. Finally, we picked the disambiguated trees and linearized them in Portuguese using the Portuguese RG. The Portuguese linearizations were then compared to their corresponding sentences in the test suite, and analyzed with respect to grammatical correctness. We do not test the translated sentences for equivalence because translation equivalence is not a goal of the RGL [32]. A native speaker of Brazilian Portuguese with no knowledge of GF judged that out of the 101 trees we produced, 11 of them did not have a grammatically correct linearization. Our own judgement is that the wrong linearizations number only 7.

3.3.2.1 Discussion

3.3.2.1.1 Mismatched tenses

Gloss 22 shows a unidiomatic translation of the English sentence. Although both Portuguese translations are syntactically correct, the former uses the in the and the main verb in the , while the latter simply uses the Portuguese pretérito perfeito tense. Even though the former linearization is not a bug (since it is syntactically correct), we do consider the fact that it was impossible to obtain the latter linearization from the core RGL (i. e., without using grammar extension modules) a bug.

(22) a. the dog has barked

b. o cachorro tem ladrado

c. o cachorro ladrou

To understand why that was not possible, we must first understand how the RGL tense system works. A RGL clause like the one in gloss 22 is built using the PredVP constructor, which takes a noun phrase and a verb phrase and builds a clause. Clauses have no fixed tense; it is when we transform them into sentences using the UseCl constructor that we specify what tense we want to have. The RGL does not enforce a common tense model, but it does offer a common API for all languages. This means that the UseCl constructor will only take as arguments these 8 tenses, although language-specific constructors taking more tenses may exist in a particular RG. The RGL tense API is a combination of anteriority (simultaneous or anterior) and temporal order (past, present, future, or conditional) for a total of 8 different tenses [33, section 5.17]. The Romance functor’s tense API adds a second form of past temporal order (corresponding to Portuguese pretérito perfeito) to the RGL’s, for a total of 10 tenses. 3.3. Evaluation 85

Anteriority Temporal Order Romance Example ASimul TPres RPres o cachorro ladra TPast RPast o cachorro ladrava – RPasse o cachorro ladrou TFut RFut o cachorro ladrará TCond RCond o cachorro ladraria AAnter TPres RPres o cachorro tem ladrado TPast RPast o cachorro tinha ladrado – RPasse o cachorro teve ladrado TFut RFut o cachorro terá ladrado TCond RCond o cachorro teria ladrado

Table 2 – RGL and Romance tenses, and how they inflect verbs in a Portuguese sentence

Table2 shows how RGL tenses are mapped to Romance tenses and how those were mapped to actual verb forms by giving the example in gloss 22 with varying tenses.19 When mapping RGL tenses to Romance tenses, the Romance functor made the choice of using the auxiliary verb in the present and the participle form of the main verb in the case of an RGL present anterior tense, as in French le chien a aboyé – which is perfectly idiomatic French – and in Portuguese o cachorro tem ladrado (see table2). It was due to this choice that we could not get o cachorro ladrou as the translation of the dog has barked, since RPasse – the temporal order producing Portuguese pretérito perfeito – is a Romance-only construct. This does not simply affect RGL translation: application programmers were also forced to use RGL extensions to achieve o cachorro ladrou. Because we considered that only offering such a crucial aspect of the language in an extension was bad user interface, we decided to make the Romance tense mapping to verb forms language-dependent. We did this by including a parameter function in the DiffRomance module with a standard implementation that is overridden by Portuguese, as in listing 3.22. We are then able to have o cachorro ladrou and le chien a aboyé as translations of the dog has barked; the Portuguese pretérito perfeito is available in general as RGL present anterior tense (the AAnter x TPres combination in table2).

(23) a. Abrams handed Browne the cigarette

b. ! Atlas passava o cigarro a Bobi

c. Atlas passou o cigarro a Bobi

19 Note that the form corresponding to combination AAnter x RPasse is ungrammatical; however GF demands a value for each parameter combination and this was the choice that made the most sense, so that we had forms corresponding to the auxiliary verb in all possible tenses as we do for the main verb. 86 Chapter 3. A computational grammar for Portuguese

oper TAtoVF : RTense → Anteriority → (VF ⇒ Str) → (VF ⇒ Str) → Number →Person → Mood →Str → Str ∗ Str ; TAtoVF t a verb vaux n p m part = case ht,ai of { hRPast,Simuli ⇒ hverb ! VFin (VImperf m) n p, [] i ; hRPast,Anteri ⇒ hvaux ! VFin (VImperf m) n p, parti ; hRFut,Simuli ⇒ hverb ! VFin VFut n p, [] i ; hRFut,Anteri ⇒ hvaux ! VFin VFut n p, parti ; hRCond,Simuli ⇒ hverb ! VFin VCondit n p, [] i ; hRCond,Anteri ⇒ hvaux ! VFin VCondit n p, parti ; hRPasse,Simuli ⇒ hverb ! VFin VPasse n p, [] i ; hRPasse,Anteri ⇒ hvaux ! VFin VPasse n p, parti ; −− default Romance: hRPres,Anteri⇒h vaux! VFin(VPresm)np, part i ; hRPres,Anteri ⇒ hverb ! VFin VPasse n p, [] i ; hRPres,Simuli ⇒ hverb ! VFin (VPres m) n p, [] i }; Listing 3.22 – A function mapping temporal order and anteriority to verb forms

(24) a. Abrams knew that it rained

b. Atlas sabia que chovia

Gloss 23 highlights another problem with tenses: the Portuguese RG assigns the Portuguese pretérito imperfeito to the RGL tense which the English RG assigns the English , which ends up creating unidiomatic sentences. However it is not always the case that the pretérito imperfeito is a unidiomatic translation of the English simple past, as gloss 24 shows. There is no general solution to this translation problem at the RGL level (which is syntactic), and is an example of why the RGL does not have translation equivalence as a goal.

3.3.2.1.2 Whose as interrogative

(25) a. whose dog barked ?

b. * quem cachorro ladrava ?

Gloss 25 shows a trickier example than it seems. Google Translate (as of 2018-10-02) will translate it wrongly in Portuguese, Spanish, and French.20 Appropriate Portuguese translations such as de quem é o cachorro que ladrou? or o cachorro de quem ladrou? are difficult to represent in GF because they are not parallel to other questions which share the same underlying abstract tree. Gloss 26 shows a sentence whose abstract syntax 20 As cujo cachorro latiu?, de quien perro ladró?, and quel chien a aboyé?, respectively. Interestingly, it makes a different kind of mistake for each language. 3.3. Evaluation 87 tree is almost the same as the one gloss 25; in English their structure is almost the same, but the correct Portuguese translations have different structure (one introduces a copula, the other inverts the constituent order). To have the QuestVP constructor which builds both sentences behave in such different ways, we would need to add a new parameter to the linearization type of interrogative pronouns and build the resulting question clause according to it. That would be a valid solution if linguistic theory supported such a classification of interrogative pronons. We are not aware that it does, so we argue that a better solution is to add a constructor for whose-sentences in the Construction module of the RGL, which already contains constructors for other kinds of questions that are exceptional, like for example how old are you? (whose appropriate Portuguese translation has again a different structure).

(26) a. which dog barked ?

b. qual cachorro ladrava ?

3.3.2.1.3 Compound nouns

(27) a. the garden dog barked

b. o cachorro do jardim ladrava

Gloss 27 needed the definition of the CompoundN construction rule (that creates compound nouns) from the Extend grammar extension to produce a non-empty Portuguese linearization. After defining it, we were able to perform the linearization correctly.

3.3.2.1.4 Adverb placement

(28) a. the dog probably barked

b. # o cachorro ladrava provavelmente

c. o cachorro provavelmente ladrava

(29) a. the dog barked softly

b. o cachorro ladrava suavemente

c. o cachorro suavemente ladrava 88 Chapter 3. A computational grammar for Portuguese

GF has two adverb categories: AdV for adverbs that are directly attached to the verb (like always), and Adv for regular adverbs. The Romance functor currently inserts the Adv in a VP correctly, but fails to do so for AdV. For instance, for an Adv like softly, both Portuguese linearizations in gloss 29 work, although the first is more idiomatic. We have produced a patch for the Romance functor in order to fix this.

3.3.2.1.5 Incorrect tense choices

(30) a. Abrams intended Browne to bark

b. * Atlas pretendia que Bobi ladrar

c. Atlas pretendia que Bobi ladrasse

The verb intend in gloss 30 takes two complements, one of which is a verb. As we can see, the current Portuguese linearization employs the indicative mode, whereas the ideal linearization would use the verb complement in the subjunctive. There is currently no way of specifying which mode the complement verb should use in the RGL (most likely because it is not necessary for most of its languages), hence the wrong linearization.

3.3.2.1.6 Tag questions

The RGL does not support tag questions, therefore we could not parse the sentence the dog barked , didn’t it. This means that the sentence is not recognized by the English RG, and so it has no GF abstract syntax tree.

3.3.2.1.7 Date and time units

(31) a. it took Abrams ten minutes to arrive

b. levava dez minutos para Atlas chegar

To parse gloss 31 we needed to add a new verb category (V3V, for verbs taking two noun phrase complements and a verb phrase complement), plus a construction rule to create noun phrases from time units.

(32) a. June third arrived

b. o três de junho chegava

(33) a. Abrams arrived at three twenty 3.3. Evaluation 89

b. # Atlas chegava às vinte de três

c. Atlas chegava às três e vinte

GF has construction rules for dates, which allow us to correctly parse gloss 32. Without these rules, we could have parsed June third as a compound, which in this case would yield the same Portuguese linearization. Consider gloss 33, however. Its second sentence is the Portuguese linearization of the tree containing June third as a compound, while the third sentence is the linearization of the tree analyzing it using a special construction rule for time hours. We generally do not want to define special construction rules if we can avoid it, but in this case they are permissible since linguistic constructions for expressing time are indeed special cases in most languages.

3.3.2.1.8 Incomplete sentences

(34) Abrams could

(35) Browne tried to

Glosses 34 and 35 are incomplete sentences. Due to the way GF is structured, it could parse a sentence missing a NP complement, or a VP missing a complement, but not a sentence missing a verb complement. That is the case because there are intermediate categories for the former cases, but not for sentences missing a verb complement. A way to solve this problem is to create coercions; if we had a coercion from a verb that takes a verb complement (like could) to a verb that takes no complements, we would be able to parse gloss 34 successfully. The use of coercions introduces ambiguity, however, so coercions are usually declared in the grammar extension modules and not in the core RGL.

3.3.2.1.9 Iberic negative imperatives

(36) a. don’t bark

b. * não ladra !

c. não ladres !

Gloss 36 corresponds to an abstract syntax tree specifying a formal imperative in the singular. Because the imperative is in the negative, its conjugation differs from that in the positive imperative, which would be ladra !. It should follow the conjugation of the present subjunctive tense instead. Portuguese shares this feature with Spanish and Catalan, whose original implementations also produced wrong linearizations where the imperative behaved 90 Chapter 3. A computational grammar for Portuguese as if in the afirmative. Another problem with Iberic negative imperatives (although it did not show up in our evaluations) is showcased by gloss 37 – reflexive pronouns were being placed after the verb (não vista-se agora*), but this has also been fixed.

(37) a. don’t love yourselves

b. * não amem-se

c. não se amem 91 4 Applications

In this section we present three multilingual applications of the GF resource grammar library, which now include Portuguese because of our implementation. The intent of this section is to show how easy it is to extend an existing multilingual application with a new language, given a GF implementation of a resource grammar.

4.1 Health-domain application grammar

Digital Grammars (DG1) is a company founded by the GF core team who uses GF as a production tool.2 One of DG’s products is a healthcare-domain translation chat app, a demo of which is available from https://www.digitalgrammars.com/demo/ chat/ (a screenshot of the app demonstration is in figure 18). The app features a small controlled natural language (CNL) that talks about general health questions, meant to be a demonstration of a much larger industrial app meant for healthcare providers in multicultural societies. (We employ the term CNL here meaning a subset of English, but the definition of CNLs are far from straightforward [22].) Among the languages the app translates is Portuguese, supported by a concrete grammar written by us during the GF summer school 2018.3 A first version of this grammar was supported by the miniresource grammar described in section 3.2.1. An alternative, latter version is supported by the Portuguese RG implementation part of the RGL, developed as the instantiation of a functor. Both versions are under 150 lines of GF code (the functor version being even shorter) because they are supported by an underlying Portuguese grammar which does the heavy lifting.

4.2 Attempto Controlled English and GF

Attempto Controlled English (ACE4) is a controlled natural language (CNL) comprising an unambiguous subset of the English natural language which attempts to provide precision in textual descriptions while maintaining their naturalness, allowing untrained readers to understand them because of their similarity to English. ACE-in-GF is a project aiming to implement ACE’s syntax using GF, so that ACE can become multilingual. The current coverage is almost complete for the OWL-

1 https://www.digitalgrammars.com/ 2 There are a few other companies who do so, textual (https://textual.ai) among them. 3 Available from https://github.com/GrammaticalFramework/gf-summerschool-2018/tree/master/ application 4 For more information see http://attempto.ifi.uzh.ch/ 92 Chapter 4. Applications

Figure 18 – Screenshot of the DG demo app compatible5 subset of ACE. In this subsection we describe the ACE-in-GF project, show how we ported it to Portuguese with support from the Portuguese RG, and discuss a few of its capabilities.

4.2.1 ACE

ACE is a CNL because although it looks like the natural language English, it is in fact a formal language, i. e., a language completely described by a grammar. Despite the ill-definition of the concept of a CNL [22], in the case of ACE it means a subset of English – with a restricted grammar and a domain-specific vocabulary – that can be unambiguously translated to a notational variant of first-order logic [18], that of discourse representation structures. For an introduction to discourse representation theory, see Blackburn & Bos [4]; for a concrete specification of discourse representation structures in ACE, see [17]. To be be able to unequivocally translate ACE text to first-order logic – something that can not be done for English – two main techniques are employed: removing dispensable ambiguous constructions from the language (favoring the use of unambiguous ones instead) and interpreting the remaining ambiguous constructions using deterministic rules. To reflect their intent in a text, a user might need to rephrase the input. This is aided by two tools: a predictive editor aids the user in staying within the grammar’s and lexicon’s domain, and the ACE parser will produce paraphrases from an accepted input text to help the user ascertain if the ACE interpretation of the text is the one they meant. In the

5 https://www.w3.org/TR/owl2-overview/ 4.2. Attempto Controlled English and GF 93 following example taken from the ACE documentation,6 the first sentence is the input, and the following sentences are the paraphrases produced by the ACE parser.

(38) a. A customer inserts a card that is valid and opens an account.

b. There is a customer X1. There is a card X2. The card X2 is valid. The customer X1 inserts the card X2. The customer X1 opens an account.

If the user intended the interpretation where the card opens the account, they must rephrase their input to A customer inserts a card that is valid and that opens an account, which will produce the appropriate paraphrase. ACE construction rules summarizing its grammar can be found in the online documentation,7 as can the interpretation rules followed by the ACE parser.8 A more complete description of the ACE grammar is also available online.9

4.2.2 ACE-in-GF

ACE-in-GF10 is a project intending to implement the syntax of ACE in GF, allowing it to cover more languages, more easily. The project does not reimplement the ACE parser, it simply translates between the other languages and ACE, which means that the ACE parser is still necessary in most use cases. Implementing ACE in GF lends it not only multilinguality, but also support for GF technologies like lookahead writing prediction (which shows possible completions to the sentence being written), embeddable grammars, and conversion to speech-recognition grammar formats [5]. Lookahead editing, for example, is very important for CNLs in general, ACE among them. Because CNLs are similar to their host languages they are very easy for readers to understand, but are also very difficult to write, since the user has to learn their rules to be able to differentiate them from their host languages. Lookahead editing helps fix this problem by showing possible completions on an unfinished sentence. In the particular case of ACE, though, GF lookahead editing is not sufficient out of the box, because GF has no built-in notion of anaphoras, which are a core part of ACE [21]. Kuhn [21] exemplifies the problem with the partial sentence every man protects a house from every enemy and does not destroy.... Possible anaphoric completions to this sentence would refer to both man and house, but not to enemy, because enemy would

6 http://attempto.ifi.uzh.ch/site/docs/ace_nutshell.html 7 http://attempto.ifi.uzh.ch/site/docs/ace_constructionrules.html 8 http://attempto.ifi.uzh.ch/site/docs/ace_interpretationrules.html 9 http://attempto.ifi.uzh.ch/site/docs/syntax_report.html 10 https://github.com/Attempto/ACE-in-GF 94 Chapter 4. Applications be out-of-scope since in ACE verb phrases close at their end all scopes that are opened within them. In a GF lookahead editor, however, all these anaphoric completions would be available, since there is no way to enforce an anaphoric constraint in GF [21].11 This is not to say that it is impossible to implement such a lookahead editor with GF – indeed, one can be implemented in a general-purpose programming language with bindings to GF. But it does mean that there is no way of encoding declaratively the anaphora-related grammatical rules of ACE in GF. It is because of this that Kuhn implemented a new grammar notation capable of encoding such rules [21]. For more on the expressivity of GF, refer to [24].

4.2.3 Implementation and example usage

The Portuguese concrete grammar of ACE-in-GF is built on top of the functor serving all languages of the project, with support from the Portuguese RG. A full evaluation of the ACE-in-GF project is available from [5], which excludes our later implementation. We have not repeated this evaluation including Portuguese, since our intent here is to show how easily one can add another language to a GF multilingual application given a resource grammar, not to ascertain how well an instance of such an application performs. To exemplify how ACE-in-GF could be used, let us take a real-world example of a CNL whose use has been discontinued because of problems which ACE-in-GF might solve. Catterpillar Fundamental English (CFE) was developed by the Caterpillar company to facilitate communication with its non-English speaking staff [40] (as cited in [22]).12 It was mainly used in service manuals that had to be read by employees whose native languages numbered more than 50. Instead of having the manuals translated – which was costly and might introduce errors – Caterpillar has its employees complete a 30-lesson course on CFE that made them capable of understanding the language. CFE ended up discontinued due to Caterpillar’s inability at the time of enforcing the language’s rules [19] (as cited in [22]), which often made the language degenerate into full English. This kind of enforcement is nowadays simply solved by a syntax-driven interactive editor providing lookahead editing and syntax-checking, but was not available at the time of CFE’s introduction. Some years later Caterpillar introduced another CNL whose aims were to have its rules enforceable and to diminish translation costs – as opposed to eliminating them. This later language has a similar use case to that of ACE-in-GF. A technical writer would use ACE’s or GF’s interactive editor to write a text in the CNL. ACE-in-GF would

11 We suspect that these constraints could be enforced using the dependently-typed fragment of GF, however this implementation choice would incur into performance and maintainability costs. 12 We do not claim that Caterpillar could or should use ACE-in-GF; CFE dates from 1971 and a lot has changed since then. 4.3. GF to UD 95 then produce translations with reasonably low cost and quality. Gloss 39 shows an example of an ACE sentence translated to Portuguese taken from the ACE-in-GF documentation.

(39) a. it is false that X is read by nothing but computers that Y doesn’t see

b. é falso que X é lido só por computadores que Y não vê

4.3 GF to UD

We have given an overview of the Universal Dependencies (UD) project in sec- tion 3.3.1. In this section we first present the kinds of trees we will discuss, and then show the algorithm developed by Kolachina & Ranta [20] that allows the transformation of GF RGL abstract syntax trees into UD trees. Further work from Kolachina & Ranta on the conversion of UD trees to GF trees is also available [34], but we will not discuss it here. Converting GF trees to UD trees has a few use-cases; some of the ones mentioned in [20] include:

1. it allows GF to be used as a rule-based dependency parser;

2. it enables bootstrapping of UD treebanks from GF treebanks (this is specially interesting for languages that have little to no UD data);

3. it becomes a way of checking manually annotated UD trees (at least for those that can be generated by GF).

This conversion is enhanced by a small language-specific13 configuration, which is what we created for Portuguese to allow the conversion of syntax trees to Portuguese UD dependency trees. This and the Portuguese RG itself are our only contributions to this section. Finally, we discuss aspects of the conversion, focusing on how the results deviate from the UD standard, following [20].

4.3.1 Trees

In this section we briefly explain what are the different trees involved in the GF to UD conversion. Figure 19 shows examples of these trees.

4.3.1.0.1 Abstract syntax trees

Are trees whose nodes and leaves are abstract syntax functions (see figure 19a). 13 Actually it is concrete grammar-specific, since the conversion could work for any GF grammar, not only the RGL. 96 Chapter 4. Applications

4.3.1.0.2 Parse trees

Are trees whose nodes are categories and the leaves are words (strings) (see figures 19b and 19c).

4.3.1.0.3 Dependency trees

Are trees whose nodes are words and whose edges have dependency labels; word order is significant (see figures 19d and 19e).

4.3.1.0.4 Abstract dependency trees

Are trees whose nodes are constant functions and whose edges have dependency labels; word order is insignificant (see figure 19f).

4.3.2 The algorithm

We can transform an abstract syntax tree into a dependency tree by picking one of its argument as the head, and giving the other arguments appropriate labels. For example, given an abstract syntax tree created by a function such as

Pred : V → Subj → Obj → S; we can derive a dependency tree where the first argument is the head, the second is a dependent with label Subj, and the third is another dependent with label Obj. To have more control over the conversion process, we can specify a dependency configuration, which specifies which of the arguments is the head, and what the other arguments must be labelled. In the GF to UD conversion, this is done by a text file such as the one in listing 4.1.

PredVP nsubj head −− NP →VP →Cl QuestIAdv advmod head −− IAdv →Cl →QCl ComplSlash head dobj −− VPSlash →NP →VP SlashVV head acl −− VV →VPSlash →VPSlash Listing 4.1 – Dependency configurations for a few RGL functions

This handles one level of the tree; Given an abstract syntax tree T and a word sequence S, we can derive a dependency tree by generalizing this idea to a recursive algorithm:

1. For each word w in S, find the function fw forming its smallest spanning subtree in T . 4.3. GF to UD 97

(a)

(b) (c)

(d) (e)

(f)

Figure 19 – Different kinds of trees 98 Chapter 4. Applications

2. Link each word w in S with either

a) the head argument of fw, if w is not the head;

b) or the head of the whole fw, if w is itself the head

(The smallest spanning subtree of a word is the subtree whose top node is the function whose linearization generates that word.) There is an alternative way of deriving dependency trees from abstract syntax trees, which has benefit of being simpler to follow. To perform this method, we need to produce a parse tree, and decorate it with the labels obtained from the dependency configurations. To obtain a parse tree from an abstract syntax tree T :

1. Linearize it to a word sequence T ;

2. Link each word in S to its smallest spanning subtree in T ;

3. Replace each function in the nodes of T by its value category.

After labelling the parse tree with the abstract syntax functions and dependency labels, we can assign a head and a label to each word by the following procedure:

1. Follow the edges up from the word until a labels is reached: this is the label of the word ;

2. From the dominating node, follow the path of unlabelled edges down to another word: this is the head of the word ;

3. If no label is encountered on the way upwards, the word itself is the head of the sentence.

4.3.2.0.1 Example

Figure 20 shows the partial conversion of the sentence o gato nos vê to a dependency tree using the default dependency configuration for GF to UD. The final dependency tree would be the one in figure 19e.

4.3.2.0.2 Corpus

After we revised the UD examples treebank described in section 3.3.1, we have performed its conversion to a UD treebank, which is available at https://github.com/ odanoburu/gfss2018-presentation/blob/master/por-ud.conllu. 4.3. GF to UD 99

(a)

(b)

(c)

det ? DET NOUN o gato Quant-0 N-0

(d)

(e)

nsubj det ? #? DET NOUN VERB o gato vˆe Quant-0 N-0 V2-3

Figure 20 – Converting decorated parse tree to a dependency tree 100 Chapter 4. Applications

(a) (b)

Figure 21 – Dependency trees converted from GF trees

4.3.3 Discussion

Comparing the GF and the UD [28] approaches to the idea of an universal grammar is one of the goals of Kolachina & Ranta [20]. During their work, they found a lot of commonalities, but also points where the two approaches differ. These points raise bigger issues about how linguistic knowledge should be encoded, but also cause problems in the GF to UD conversion. A first case of mismatch between the UD and the GF approaches is in the handling of syncategorematic words (see [20, section 2.4]). Syncategorematic words are words with no abstract syntax category attached to them, a very common example of them being copulas. While the UD approach is that all words in a sentence must have labels connecting them to a head word, GF chooses not consider copulas as deserving a category of their own, because languages like Russian do not have them in the present tense, for example. The UD approach leads to less homogeneity between trees across languages, while the GF approach – even though it is more general – fails to specify exactly what is the relation between the copulas and their heads when converted to dependency trees. In their conversion to UD trees, all syncategorematic words in GF get a defaut label of dep, which is the unspecified label defined by the UD project.14 Although this problem in the conversion could be solved by changing the GF RGL to have a category for copulas, Kolachina & Ranta decided to create a language-specific configuration for the conversion process that allows one to specify how words appearing in a certain function are to be labelled, and what words should be their heads. It is this configuration I have contributed for Portuguese, reducing the number of unspecified dependency relations in the GF UD treebank from 115 to just 3. Figure 21 shows a dependency tree before and after the configuration file was specified. Not all UD-GF discrepancies can be solved by this language-dependent configura- tion, though. Take the case of passive constructions: in UD their subject must have the label nsubj:pass, however as in GF they are constructed at verb-phrase level, they are seen by the clause-making functions like PredVP as any other VP, and thus have their heads labelled as nsubj. This conversion problem is solved in [20] by another type of configuration,

14 See https://universaldependencies.org/u/dep/dep.html 4.3. GF to UD 101

that of non-local abstract rules. We refrain from reproducing their explanation here since they have not been published at the public GF repository yet. Two issues not discussed in [20] are lemmatization (the conversion output includes the GF constructor names as lemmas instead of the actual lemmas) and UD meta-tokens (i. e., multi-word tokens and empty nodes), which do not seem to be handled by the current conversion tool – it only ever outputs UD word tokens. Listing 4.2 shows the raw ouput of the GF to UD conversion, while listing 4.3 shows a version where the lemmas and the tokenization were corrected to conform to UD standards. Because of this and the other issues we mentioned, GF can be successfully used as a bootstraper of UD treebanks, but its output needs revision to conform to the UD standards. We must note that much of this revision need can be automated, so we believe that the GF to UD conversion is practical.

# text = há uma vaca na floresta 1 há ExistNPAdv AUX Cl Cl−0 3 aux _ _ 2 uma IndefArt DET Quant Quant−7 3 det _ _ 3 vaca cow_N NOUN N N−0 0 root _ _ 4 na DefArt DET Quant Quant−10 5 det _ _ 5 floresta forest_N NOUN N N−0 3 nmod _ _

Listing 4.2 – Raw output of the GF to UD conversion for the sentence “há uma vaca na floresta”

# text = há uma vaca na floresta 1 há haver AUX Cl Cl−0 3 aux _ _ 2 uma um DET Quant Quant−7 3 det _ _ 3 vaca vaca NOUN N N−0 0 root _ _ 4−5 na ______4 em em DET Quant Quant−10 6 det _ _ 5 a o DET Quant Quant−10 6 det _ _ 6 floresta floresta NOUN N N−0 3 nmod _ _

Listing 4.3 – Partially corrected output of the GF to UD conversion for the sentence “há uma vaca na floresta”

103 5 Conclusion

In this dissertation we have presented a freely-available computational grammar of Portuguese under the GF grammar formalism. As the GF resource grammar library, its focus is in generating syntactically-correct natural language – even if it can be used for restricted-domain parsing – and in providing a library of grammatical constructions that can be reused by other grammars (dubbed application grammars). Our evaluations have shown that the resource grammar is capable of producing grammatical Portuguese across a rich range of syntactic phenomena (producing gram- matical sentences in over 90% of the trees from our test treebanks, see section 3.3), even though a few issues still remain to be worked on. We have also shown extensions of existing application grammars to Portuguese, exemplifying how easy it is to create new grammars on top of our work.

5.1 Future Work

We may classify future work related to the Portuguese RG into two kinds: that related to its own improvement and that related its applications. Going further than a resource grammar under the RGL, there is also future work on creating a Portuguese- or Romance-only GF grammar with bigger coverage of syntactic phenomena. This work could then be used to define the Portuguese RGL grammar (to avoid work duplication) and might be better suited to open-domain parsing – in addition to providing more flexible but monolingual NLG. This idea could of course apply to any language, and it depends on improvements being made to the GF ecosystem to support compilation of even bigger grammars and the disambiguation of their parse trees.

5.1.1 Known issues with the Portuguese RG

In this section we describe a few outstanding issues with our Portuguese implemen- tation of a resource grammar.

5.1.1.0.1 Preposition + personal pronoun contractions

(40) a. he is here with me

b. * ele está aqui com mim

c. ele está aqui comigo 104 Chapter 5. Conclusion

We have not implemented contractions such as those between prepositions com, em, de and personal pronouns, as can be seen in gloss 40.

(41) a. this is his dog

b. este é seu cachorro

c. este é o cachorro dele

However this does not always means that grammatical mistakes occur, for we might phrase things differently, as in gloss 41.

5.1.1.0.2 Enclitic and mesoclitic pronoun contractions

(42) a. to love her isn’t easy

b. * amar a não é fácil

c. amá-la não é fácil

There is currently no support for enclitic pronoun contractions, as shown by gloss 42. We are unsure about whether there should ever be support for mesoclitic pronouns since they are archaic and very rarely used in written or spoken Portuguese.

5.1.1.0.3 Preposition + demonstrative pronouns contractions

(43) a. I don’t like this dog

b. * eu não gosto de este cachorro

c. eu não gosto deste cachorro

As exemplified by gloss 43, we have not implemented contractions between prepositions such as a, de, em and demonstrative pronouns.

5.1.1.0.4 Reflexive two-place adjectives

(44) a. the number 2 is divisible by itself

b. o número 2 é divisível por si 5.1. Future Work 105

The RGL contains a constructor for reflexive two-place adjectives, whose use is exemplified by gloss 44. Its Romance implementation hardcodes a reference to the third-person (causing the error in gloss 45), since its representation of adjectival phrases is not variable in the person parameter. We debate whether this implementation should change, since changing the representation of adjectival phrases for only one constructor has little advantage and incurs a non-insignificant performance cost.

(45) a. I am divisible by myself

b. * eu sou divisível por si

5.1.2 Possible applications

In this section we describe ideas for applications using GF and our Portuguese RG.

5.1.2.0.1 WordNet gloss corpus

The gf-wordnet project uses WordNet senses to improve GF machine transla- tion. However it is also the case that GF machine translation can improve WordNet. If gf-wordnet makes GF robust enough to parse all WordNet glosses, it can be used to translate them too. This work could be coupled with the WordNet gloss corpus annotation project,1 which would guarantee correct translations at the lexical level. This joint work could also be used to establish that WordNet is closed under the definition property, i. e., that each word-sense pair used to define a sense is itself defined in WordNet. If it is the case that such a pair does not have this property, then either a word has to be added to a sense or a new sense should be added to WordNet, or both. The translation process should also help to catch bugs in other language’s word- nets and in the glosses disambiguation. While working on the Portuguese modules for gf-wordnet we found several cases of word translations that seemed wrong, and they were caused by errors in OpenWordNet-PT, the Portuguese WordNet. We believe it is also the case that a bad translation might point out a bug in the disambiguation; take gloss8 (page 74) as an example; if we wrongly disambiguate hands as a rotating pointer on the face of a timepiece instead of the body part, the error is not immediately to someone seeing the data; however if we present the gloss together with its Portuguese translation as in gloss 46, then it is obvious that something is wrong.

(46) a. lower your hands slowly

b. abaixe seus ponteiros lentamente

1 http://wordnetcode.princeton.edu/glosstag.shtml 106 Chapter 5. Conclusion

5.1.2.0.2 Data as text

Angelov & Enache describe the encoding of the Suggested Upper-Merged Ontology (SUMO) in GF [3], affording it consistence-checking through GF’s logical framework. In addition to consistence-checking, GF generated improved natural language descriptions of SUMO’s concepts; gloss 47 (taken from [14]) shows first the concept description offered by SUMO’s own NLG tool, and second a GF-generated description of the same concept.

(47) a. for all unique list ?LIST holds for all ?NUMBER1, ?NUMBER2 holds if “h element of ?LIST” is equal to “h element of ?LIST”, then ?NUMBER1 is equal to ?NUMBER2

b. for every unique list LIST, every positive integer NUMBER2 and every positive integer NUMBER1 we have that if the element with number NUMBER1 in LIST is equal to the element with number NUMBER2 in LIST then NUMBER1 is equal to NUMBER2

The existence of a Portuguese RG allows the extension of this work to include Portuguese, but more interestingly it allows syntactically-correct and reasonably idiomatic natural language generation from ontologies, and even more generally from any kind of data that might need natural language descriptions to make it understandable to non-experts. Another use for natural language descriptions of ontologies or data is help domain experts find errors in them without needing to learn a formal language or a data model. For Portuguese NLG in particular we believe there is a use-case in Brazilian official journals (or Diário Oficial), whose current format only now started to provide machine- readable data of unrealiable quality despite the fact that no person can keep up with its long contents. Changing the Diário Oficial to be machine-readable first, with GF NLG providing a human interface to the data would be easier and more correct than the current solution of trying to extract structured information from its natural language texts. 107 Bibliography

[1] Krasimir Angelov. The mechanics of Grammatical Framework. PhD thesis, Chalmers University of Technology and University of Gothenburg, 2011.

[2] Krasimir Angelov, Björn Bringert, and Aarne Ranta. PGF: A portable run-time format for type-theoretical grammars. Journal of Logic, Language and Information, 19(2):201–228, Apr 2010.

[3] Krasimir Angelov and Ramona Enache. Typeful ontologies with direct multilingual verbalization. In Michael Rosner and Norbert E. Fuchs, editors, Controlled Natural Language, pages 1–20, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.

[4] Patrick Blackburn and Johan Bos. Representation and inference for natural language- volume ii: Working with discourse representation structures. Department of Compu- tational Linguistics, University of Saarland, Germany, 1999.

[5] John J Camilleri, Norbert E Fuchs, and Kaarel Kaljurand. Deliverable d11. 1. ace grammar library. Technical report, Technical report, MOLTO project, June 2012. http://www. molto-project. eu . . . , 2012.

[6] Alessandra Cid, Alexandre Rademaker, Bruno Cuconato, and Valeria de Paiva. Linguis- tic legal concept extraction in portuguese. In Monica Palmirani, editor, Legal Knowl- edge and Information Systems, volume 313 of Frontiers in Artificial Intelligence and Ap- plications. 2018. The 31th International Conference on Legal Knowledge and Informa- tion Systems (JURIX 2018). Expanded version at https://arxiv.org/abs/1810.09379.

[7] Bruno Cuconato and Alexandre Rademaker. A computational grammar for Portuguese. In Valeria de Paiva, Rodrigo Wilkens, and Fernando Batista, editors, PROPOR 2018 demonstration session. September 2018.

[8] Leonel Figueiredo de Alencar, Bruno Cuconato, and Alexandre Rademaker. Mor- phobr: An open source large-coverage full-form lexicon for morphological analysis of portuguese. Texto Livre: Linguagem e Tecnologia, 11(3):1–25, 2018.

[9] Marie-Catherine De Marneffe, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher D Manning. Universal stanford dependencies: A cross-linguistic typology. In LREC, volume 14, pages 4585–4592, 2014.

[10] Marie-Catherine De Marneffe and Christopher D Manning. Stanford typed dependen- cies manual. Technical report, Technical report, Stanford University, 2008. 108 Bibliography

[11] Valeria de Paiva, Alexandre Rademaker, and Gerard de Melo. Openwordnet-pt: An open Brazilian Wordnet for reasoning. In Proceedings of COLING 2012: Demonstration Papers, pages 353–360, Mumbai, India, December 2012. The COLING 2012 Organizing Committee. Published also as Techreport http://hdl.handle.net/10438/10274.

[12] Pedro Delfino, Bruno Cuconato, Guilherme Paulino Passos, Gerson Zaverucha, and Alexandre Rademaker. Using OpenWordnet-PT for question answering on legal domain. In Global Wordnet Conference 2018, Singapore, January 2018.

[13] Grégoire Détrez and Aarne Ranta. Smart paradigms and the predictability and complexity of inflectional morphology. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’12, pages 645–653, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.

[14] Ramona Enache. Reasoning and language generation in the sumo ontology. Master’s thesis, Chalmers University of Technology and University of Gothenburg, 2010.

[15] Ramona Enache, Aarne Ranta, and Krasimir Angelov. An open-source computational grammar for Romanian. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 163–174. Springer, 2010.

[16] N Anido Freire. Les verbes portugais: formes et emplois.

[17] Norbert E Fuchs, Kaarel Kaljurand, and Tobias Kuhn. Discourse representation structures for ace 6.7. Technical report, Department of Infomatics & Institute of Computational Linguistics, University of Zurich, 2013.

[18] Norbert E Fuchs, Uta Schwertel, and Rolf Schwitter. Attempto controlled english—not just another logic specification language. In International Workshop on Logic Pro- gramming Synthesis and Transformation, pages 1–20. Springer, 1998.

[19] Christine Kamprath, Eric Adolphson, Teruko Mitamura, and Eric Nyberg. Controlled language for multilingual document production: Experience with caterpillar technical english. In Proceedings of the Second International Workshop on Controlled Language Applications, volume 146, 1998.

[20] Prasanth Kolachina and Aarne Ranta. From abstract syntax to universal dependencies. LiLT (Linguistic Issues in Language Technology), 13, 2016.

[21] Tobias Kuhn. Codeco: A grammar notation for controlled natural language in predictive editors. arXiv preprint arXiv:1103.5676, 2011.

[22] Tobias Kuhn. A survey and classification of controlled natural languages. Computa- tional Linguistics, 40(1):121–170, 2014. Bibliography 109

[23] Inari Listenmaa and Koen Claessen. Automatic test suite generation for pmcfg. Proceedings of the fifth workshop on Natural Language and Computer Science, 2018.

[24] Peter Ljunglöf. Expressivity and Complexity of the Grammatical Framework. PhD thesis, Göteborg University & Chalmers University of Technology, 2004.

[25] Simon Marlow et al. Haskell 2010 language report. Available online https://www.haskell.org/onlinereport/haskell2010/(October 2017), 2010.

[26] Per Martin-Löf and Giovanni Sambin. Intuitionistic type theory, volume 9. Bibliopolis Napoli, 1984.

[27] George Miller. WordNet: An electronic lexical database. MIT press, 1998.

[28] Joakim Nivre. Towards a universal grammar for natural language processing. In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, pages 3–16, Cham, 2015. Springer International Publishing.

[29] Stephan Oepen and Jan Tore Lønning. Discriminant-based mrs banking. In Proceedings of the 5th International Conference on Language Resources and Evaluation, pages 1250–1255, 2006.

[30] Aarne Ranta. Grammatical framework: a type-theoretical grammar formalism. Journal of Functional Programming, 14(2):145–189, 2004.

[31] Aarne Ranta. From Semantics to Computer Science. Essays in honour of Gilles Kahn, chapter Grammars as software libraries. Cambridge University Press, 2009.

[32] Aarne Ranta. The GF resource grammar library. Linguistic Issues in Language Technology, 2(2):1–63, 2009.

[33] Aarne Ranta. Grammatical Framework: Programming with Multilingual Grammars. CSLI Publications, Stanford, 2011. ISBN-10: 1-57586-626-9 (Paper), 1-57586-627-7 (Cloth).

[34] Aarne Ranta and Prasanth Kolachina. From universal dependencies to abstract syntax. In Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017), pages 107–116, 2017.

[35] Maria Aparecida Florence Cerquera Ryan. Conjugação dos verbos em português: prático e eficiente. Ática, 1999.

[36] Hiroyuki Seki, Takashi Matsumura, Mamoru Fujii, and Tadao Kasami. On multiple context-free grammars. Theoretical Computer Science, 88(2):191–229, 1991. 110 Bibliography

[37] Hiroyuki Seki, Ryuichi Nakanishi, Yuichi Kaji, Sachiko Ando, and Tadao Kasami. Par- allel multiple context-free grammars, finite-state translation systems, and polynomial- time recognizable subclasses of lexical-functional grammars. In Proceedings of the 31st annual meeting on Association for Computational Linguistics, pages 130–139. Association for Computational Linguistics, 1993.

[38] Stuart M Shieber, Yves Schabes, and Fernando CN Pereira. Principles and imple- mentation of deductive parsing. The Journal of logic programming, 24(1-2):3–36, 1995.

[39] Pasi Tapanainen and Atro Voutilainen. Tagging accurately: Don’t guess if you know. In Proceedings of the Fourth Conference on Applied Natural Language Processing, ANLC ’94, pages 47–52, Stroudsburg, PA, USA, 1994. Association for Computational Linguistics.

[40] Charles A Verbeke. Caterpillar fundamental english. Training and Development Journal, 27(2):36–40, 1973.

[41] John Whitlam. Modern Brazilian Portuguese Grammar. Routledge, 2010.