Inflection Generation for Spanish Verbs Using Supervised Learning

Inflection Generation for Spanish Verbs using Supervised Learning Cristina Barros Dimitra Gkatzia Elena Lloret Department of Software School of Computing Department of Software and Computing Systems Edinburgh Napier University and Computing Systems University of Alicante Edinburgh, EH10 5DT, UK University of Alicante Apdo. de Correos 99 [email protected] Apdo. de Correos 99 E-03080, Alicante, Spain E-03080, Alicante, Spain [email protected] [email protected] Abstract verbs used for denoting modality - statement of facts, desires, commands, etc.). To create lexicons We present a novel supervised approach for all the verb inflections and moods would be to inflection generation for verbs in Span- a very time-consuming and costly task, so in this ish. Our system takes as input the verb’s context the use of machine learning approaches lemma form and the desired features such can benefit the inflection of unseen verb forms. as person, number, tense, and is able to Based on this, the research challenge we tackle predict the appropriate grammatical conju- is defined as follows: given a Spanish verb in its gation. Even though our approach learns base form (i.e., its lemma), we want to automati- from fewer examples comparing to pre- cally generate all the inflections for that verb. This vious work, it is able to deal with all is very useful for tasks involving natural language the Spanish moods (indicative, subjunc- generation (e.g., text generation, machine transla- tive and imperative) in contrast to previous tion), since the generated texts would sound more work which only focuses on indicative and natural and grammatically correct. subjunctive moods. We show that in an Our contributions to the field are as follows: we intrinsic evaluation, our system achieves present a novel and efficient method for tackling 99% accuracy, outperforming (although the challenge of inflection generation for Spanish not significantly) two competitive state-of- verbs using an ensemble of algorithms; we pro- art systems. The successful results ob- vide a high-quality dataset which includes inflec- tained clearly indicate that our approach tion rules of Spanish verbs for all the grammatical could be integrated into wider approaches moods (i.e. indicative, subjunctive and imperative, related to text generation in Spanish. being this last one do not tackled by the current approaches); our models are trained with fewer re- 1 Introduction sources than the state-of-art methods; and finally, Existing Natural Language Generation (NLG) ap- our method outperforms the state-of-the-art meth- proaches are usually applied to non morphological ods achieving a 2% higher accuracy. rich languages, such as English, where the mor- The rest of the paper is shaped as follows: In phological inflection of the word during the gener- the next section (Section2) we refer to the related ation process can be addressed using simple hand- work on inflection generation. In Section3, we written rules or existing libraries such as Sim- describe the overall methodology and the dataset pleNLG (Gatt and Reiter, 2009). In contrast, when used to train our model. In Section4, we present it comes to morphological rich languages, such as a comparison to the state-of-art inflection genera- Spanish, the use of rules can lead to incorrect in- tion approaches and in Section5, we discuss the flection of a word, thus generating ungrammatical results. Finally, in Section6, directions for future or meaningless texts. Our ultimate goal is to im- work are discussed. plement a morphological inflection approach for Spanish sentences within an NLG system based on 2 Related Work the use of lexicons. However, lexicons lack some verbs’ information, specifically, regarding gram- Morphological inflection has been addressed from matical moods (i.e., the grammatical features of different perspectives within the area of Compu- 136 Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 136–141, Copenhagen, Denmark, September 7, 2017. c 2017 Association for Computational Linguistics. tational Linguistics, commonly for morphological 3 Methodology rich languages, such as German, Spanish, Finnish In order to perform the inflection task, we first or Arabic, as well as less morphological rich lan- created a dataset to be used for training machine guages such as English. learning algorithms to inflect verbs in Spanish. As Previous work has used supervised or semi- part of this submission we will make our dataset supervised learning (Durrett and DeNero, 2013; freely available1. Then, we trained a model ca- Ahlberg et al., 2014; Nicolai et al., 2015; Faruqui pable of predicting the appropriate inflection of a et al., 2016) to learn from large datasets of mor- verb automatically, given a verb base form. Next, phological rules on word forms in order to ap- each of the stages of the proposed approach are ply them to inflect the desired words. Other described in more detail. approaches have relied on linguistic information, such as morphemes and phonology (Cot- 3.1 Dataset Creation terell et al., 2016); morphosyntactic disambigua- For the purposes of this research, we created a par- tion rules (Suarez´ et al., 2005); and, graphical allel dataset of Spanish base forms and their cor- models (Dreyer and Eisner, 2009). responding inflected form. The Spanish verbs can Recently, the morphological inflection has been be divided into regular and irregular verbs, where also addressed at SIGMORPHON 2016 Shared all the regular verbs share the same inflection pat- Task (Cotterell et al., 2016) where, given a lemma terns whereas, the irregular ones do not and can with its part-of-speech, a target inflected form had completely vary from one verb to another, as it is to be generated (task 1). This task was addressed shown in Figure1. through several approaches, including align and transduce (Alegria and Etxeberria, 2016; Nico- lai et al., 2016; Liu and Mao, 2016); recurrent neural networks (Kann and Schutze¨ , 2016; Aha- roni et al., 2016; Ostling¨ , 2016); and, linguistic- inspired heuristics approaches (Taji et al., 2016; Sorokin, 2016). Overall, recurrent neural net- Figure 1: Differences between regular and irregu- works approaches performed better, being (Kann lar verbs in Spanish, for the first singular person of and Schutze¨ , 2016) the best performing system in the present tense and in the subjunctive mood. the shared task, obtaining around 98%. Therefore, we constructed a dataset, contain- Furthermore, the work described here differs ing the necessary examples of inflection for all from existing statistical surface realisation meth- the tenses in the Spanish language, by consulting ods which use phrase-based learning (e.g., (Kon- the Real Academia Espanola˜ 2 and the Enciclope- stas and Lapata, 2012)) since they do not usu- dia Libre Universal en Espanol˜ 3. We further con- ally include morphological inflection. In this re- sidered that a verb can be divided in three parts: spect, our work is more similar to (Dusekˇ and (1) ending, (2) ending stem, and (3) penSyl. An Jurcˇ´ıcekˇ , 2013), where the inflected word forms example is shown in Figure2. This information are learnt through multi-class logistic regression will be later used as features within the dataset. by predicting edit scripts. The aforementioned In Spanish, the verbs can be classified depending data-driven methods achieve high accuracy in pre- on their ending, specifically, the verbs ended by “- dicting the appropriate inflection by learning from ar”, “-er” and “-ir” belong to the first, second an huge datasets. For example, Durret and DeNero third conjugation, respectively. Moreover, for the (2013) use 11400 amount of data (i.e. the total feature penSyl, the previous syllable of the end- number of instances or rules used to predict the ing, formed by the whole syllable, or its dominant inflections of a verb). In contrast, we use almost vowel is extracted. Finally, the ending stem is the half to train our system (4556 instances), and we closest consonant to the ending. achieve comparable or better results for Spanish. Finally, the work presented here relies on ensem- 1Our dataset for the Spanish verbs inflection is available bles of classifiers which has been proved success- here: https://github.com/cbarrosua/infDataset 2http://www.rae.es/diccionario-panhispanico-de- ful for content selection in data-to-text systems dudas/apendices/modelos-de-conjugacion-verbal (Gkatzia et al., 2014). 3http://enciclopedia.us.es/index.php/Categor´ıa:Verbos 137 tures for the inflection. In this manner, the base form was divided into syllables, taking the penul- timate one to obtain the penSyl feature. Since all verbs in Spanish always end with “-ar”, “-er” and “-ir”, as described in the previous section, we split the last syllable into the ending and ending stem features. Then, for each model we predicted its potential inflection using these extracted features Figure 2: Division of the Spanish verb to begin combined with the ones related to the verb tense, and its inflection for the first singular person of the i.e., the number, person, etc. Finally, the predicted present tense and in the subjunctive mood. inflections were employed to replace the features previously identified in the base form, leading to the reconstruction of the base form into the desired Besides the previous features obtained from the inflection, as it can be seen in Figure3. verb, other features, such as suff1, suff2 or stemC1 were included because in Spanish some verbs have several variations of an inflection for the same tense, person and number. Therefore, our dataset is finally composed of the following features: (1) ending, (2) ending stem, (3) penSyl, (4) person, (5) number, (6) tense, (7) mood, (8) suff1, (9) suff2, (10) stemC1, (11) stemC2, (12) stemC3. In partic- ular, suff1 and suff2 are the inflection predicted for the suffix of the verb form; and stemC1, stemC2 and stemC3, refer to the inflection predicted for the penSyl of the verb form.

Load more