Simplenlg-IT: Adapting Simplenlg to Italian
Total Page:16
File Type:pdf, Size:1020Kb
SimpleNLG-IT: adapting SimpleNLG to Italian Alessandro Mazzei and Cristina Battaglino and Cristina Bosco Dipartimento di Informatica Universita` degli Studi di Torino Corso Svizzera 185, 10149 Torino [mazzei,battagli,bosco]@di.unito.it Abstract guage specific features, like word order, inflection and selection of functional words. This paper describes the SimpleNLG-IT re- Surface realisers can be classified on the basis of aliser, i.e. the main features of the porting of the SimpleNLG API system (Gatt and Reiter, their input. Fully fledged realisers accept as input an 2009) to Italian. The paper gives some details unordered and uninflected proto-syntactic structure about the grammar and the lexicon employed enriched with semantic and pragmatic features that by the system and reports some results about are used to produce the most plausible output string. a first evaluation based on a dependency tree- OpenCCG is a member of this category of realisers bank for Italian. A comparison is developed (White, 2006). Indeed, OpenCCG accepts as input with the previous projects developed for this a semantic graph representing a set of hybrid logic task for English and French, which is based on formulas. The hybrid logic elements are indeed the morpho-syntactical differences and simi- larities between Italian and these languages. the semantic specification of syntactic CCG struc- tures defined in the grammar realiser. The semantic graph under-specifies morpho-syntactic information 1 Introduction and delegates to the realiser many lexical and syn- tactic choices (e.g. function words). Chart based Natural Language Generation (NLG) involves a algorithms and statistical models are used to resolve number of elementary tasks that can be addressed by the ambiguity arising from under-specification. using different approaches and architectures. A well defined standard architecture is the pipeline pro- In contrast, realisation engines are simpler sys- posed by Reiter and Dale (Reiter and Dale, 2000). tems which perform just linearisation and mor- In this approach, three steps transform raw data phological inflections of the proto-syntactic input. into natural language text, that are: document As a consequence, realization engine presumes a planning, sentence-planning and surface realization. more detailed morpho-syntactic information as in- Each one of these modules triggers the next one ad- put. A member of this category is SimpleNLG dressing a distinct issue as follows. In document (Gatt and Reiter, 2009). It assumes a complete planning the user decides the information content of syntactic specification, but unordered and unin- the text to be generated (what to say). In sentence- flected, of the sentence in the form of a mixed con- planning, the focus is instead on the design of a stituency/dependency structure. Content and func- number of features that are related to the informa- tion words are chosen in input as well as modifiers tion contents as well as to the specific language, as order. The greatest advantage of this system is its the choice of the words. Finally, in surface realisa- simplicity, which allows to pay more efforts in the tion, sentences are generated according to the deci- previous stages of the NLG pipeline. sions taken in the previous stages and by fulfilling SimpleNLG was originally designed for English the morpho-syntactic constraints related to the lan- but it has been successively adapted to German, 184 Proceedings of The 9th International Natural Language Generation conference, pages 184–192, Edinburgh, UK, September 5-8 2016. c 2016 Association for Computational Linguistics French, Brazilian-Portuguese and Telugu (Boll- describe the grammar defined by the system, that mann, 2011; Vaudry and Lapalme, 2013; de Oliveira has been developed starting from the SimpleNLG- and Sripada, 2014; Dokkara et al., 2015). The first EnFr1.1. grammar; in Section 3 we describe the contribution of this paper is the adaptation of Sim- lexicon adopted, that has been built starting from pleNLG for Italian1. The most challenging issues three lexical resources available for Italian; Sec- under this respect of this project (see Sections 2 and tion 4 describes the evaluation of the system, which 3) are: (1) the Italian verb conjugation system, that is based on examples from both grammar books (Pa- cannot be easily mapped to the English system and tota, 2006) and an Italian treebank (Nivre et al., shows many idiosyncrasies; (2) the high complex- 2016); finally, Section 5 closes the paper with some ity of the Italian morphological inflections; (3) the final considerations and pointing to future works. lack of a publicly available computational lexicon suitable for generation. Nevertheless, the contribu- 2 From French to Italian grammar tion of this paper goes beyond the adaptation of the existing implementation to a novel language. We In this Section we will focus on the generation of applied indeed a treebank-based methodology (see constituents and on their order within the sentence the monolingual and multilingual resources cited be- in Italian. In this achievement, the main reference low) for both evaluating our results (see sec. 4), and for Italian grammar is (Patota, 2006). In general, describing in a comparative perspective the features it must be observed that Italian, like French, is fea- of the implemented grammar, referring to the dif- tured by a rich inflection that is clearly attested by ferences between Italian, French and English. This Verbs, but also by the behavior of other grammati- makes the work more linguistically sound and data- cal categories whose morphosyntactic features (e.g. driven. We started our work from SimpleNLG- gender, number and case) are crucial for determin- EnFr1.1, that is an adaptation to French (Vaudry and ing their syntactical order in the phrases to be gener- Lapalme, 2013) of the model developed for English ated. As stated above, our approach is based on that in (Gatt and Reiter, 2009). A property of our project adopted for French in (Vaudry and Lapalme, 2013), is multilingualism: by using the same architecture which has been in turn inspired by that used for En- of SimpleNLG-EnFr1.1 we are able to multilingual glish (Gatt and Reiter, 2009). First of all, we devel- documents with sentences in English, French and oped therefore a comparison among Italian and these Italian. other two languages in order to detect the main novel features to be taken into account in the development In porting SimpleNLG-EnFr1.1 to SimpleNLG- 2 IT, we created 10 new packages and modified 28 of SimpleNLG-IT. The parallel treebank ParTUT existing classes. The morphology and morphonol- developed for Italian/French/English helped us in ogy processors needed to be written from scratch this comparison. because of the features that differentiate Italian with In the rest of this Section we organize these fea- respect to French and English. The Syntax proces- tures in the main classes which did drive the pro- sor needed to be adapted, especially for the man- cesses we implemented: morphology and syntax, agement of noun and verb phrases and for clauses. which are strictly interrelated because of the concor- However, at this stage, we used the same orthog- dance phenomena, and morphonology. raphy processor of French. We needed to extend the system with 33 new lexical features, necessary 2.1 Morphology and syntax for accounting verb irregularities (subjunctive, con- 2.1.1 Verb conjugation ditional, remote past, etc.) and for processing the Italian is featured by a complexity of inflection superlative irregular form of the adjectives. which is typical of morphologically rich languages In the next Sections we survey the main features and its richness, in this perspective, positively com- of SimpleNLG-IT, in particular: in Section 2 we pares with that of French. Nevertheless, in order to 1SimpleNLG-IT: https://github.com/ 2http://www.di.unito.it/˜tutreeb/ alexmazzei/SimpleNLG-IT treebanks.html 185 develop a suitable model for Italian verbs, we dif- Italian conjugation Tense PE PR ferentiate the implementation of SimpleNLG-IT un- indicativo presente present F F imperfetto past F F der this respect with that exploited in SimpleNLG- futuro semplice future F F FrEn1.1. futuro anteriore future T F The main traits that we have assumed in this phase passato prossimo past T F of the project for modeling verbs are tense, progres- passato remoto remote-past T F sive and perfect, as can it be seen in the Table 1. The trapassato prossimo plus-past T F trapassato remoto plus-remote-past T F opposition between the different features, i.e. per- passato remoto remote-past T F fect and imperfect, can be expressed by using dif- presente progressivo present F T ferent means in different languages. While in En- passato progressivo past F T glish aspect is especially relevant and strictly inter- futuro progressivo future F T related with mood and tense, in Italian and other Ro- condizionale presente present F F condizionale passato past F F mance languages derived from Latin several means congiuntivo presente present F F are available for expressing it, which vary from in- congiuntivo imperfetto past F F flection, to lexical selection, to syntactic choice of congiuntivo passato past T F periphrastic forms and a system of moods richer congiuntivo trapassato plus-past T F than that of English. On the one hand, the imper- Table 1: Relation between verb tenses and traits in Italian: fect forms for present (I’m writing) and past (I was TENSE is a multi-value feature; PErfect and PRogressive are writing) and the perfect forms for present (I have two boolean features. written) and past (I had written) exploited in English cannot always find a unique correspondence in Ital- ian forms. On the other hand, while the progressive form Io sto scrivendo surely corresponds to I’m writ- ifiers, while English nouns often occur without de- ing, the form Io scrivo can be translated with I write terminers.