<<

Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020

ISSN: 1135-5948

Artículos

Grammatical error correction for Basque through a seq2seq neural architecture and synthetic examples Zuhaitz Beloki, Xabier Saralegi, Klara Ceberio, Ander Corral ...... 13 Un estudio de los métodos de reducción del frente de Pareto a una única solución aplicado al problema de resumen extractivo multi-documento Jesús M. Sánchez-Gómez, Miguel A. Vega-Rodríguez, Carlos J. Pérez...... 21 Generación de frases literarias: un experimento preliminar Luis-Gil Moreno-Jiménez, Juan-Manuel Torres-Moreno, Roseli S. Wedmann...... 29 Cross-lingual training for multiple-choice Guillermo Echegoyen, Álvaro Rodrigo, Anselmo Peñas ...... 37 ContextMEL: Classifying contextual modifiers in clinical text Paula Chocron, Álvaro Abella, Gabriel de Maeztu...... 45 Limitations of neural networks-based NER for resume data extraction Juan F. Pinzón, Stanislav Krainikovsky, Roman S. Samarev ...... 53 Minería de argumentación en el referéndum del 1 de octubre de 2017 Marcos Esteve Casademunt, Paolo Rosso, Francisco Casacuberta ...... 59 Edición de un corpus digital de inventarios de bienes Pilar Arrabal Rodríguez ...... 67 Relevant content selection through positional language models: an exploratory analysis Marta Vicente, Elena Lloret ...... 75 Rantanplan, fast and accurate syllabification and scansion of Spanish poetry Javier de la Rosa, Álvaro Pérez, Laura Hernández, Salvador Ros, Elena González-Blanco ...... 83

Proyectos

COST action “European network for web-centred linguistic data science” Thierry Declerck, Jorge Gracia, John P. McCrae ...... 93 MINTZAI: Sistemas de aprendizaje profundo E2E para traducción automática del habla Thierry Etchegoyhen, Haritz Arzelus, Harritxu Gete, Aitor Álvarez, Inma Hernaez, Eva Navas, Ander González, Jaime Osácar, Edson Benites, Igor Ellakuria, Eusebi Calonge, Maite Martín ...... 97 MISMIS: Misinformation and Miscommunication in social media: aggregating information and analysing language Paolo Rosso, Francisco Casacuberta, Julio Gonzalo, Laura Plaza, J. Carrillo-Albornoz, E. Amigó, M. Felisa Verdejo, Mariona Taulé, Maria Salamó, M. Antònia Martí ...... 101 AMALEU: a machine-Learned universal language representation Marta R. Costa-jussà ...... 105 Transcripción, indexación y análisis automático de declaraciones judiciales a partir de representaciones fonéticas y técnicas de lingüística forense Antonio García-Díaz, Ángela Almela, Fernando Molina, Juan Salvador Castejón-Garrido, Rafael Valencia-García ...... 109

© 2020 Sociedad Española para el Procesamiento del Lenguaje Natural

Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020

ISSN: 1135-5948

Comité Editorial

Consejo de redacción L. Alfonso Ureña López Universidad de Jaén [email protected] (Director) Patricio Martínez Barco Universidad de Alicante [email protected] (Secretario) Manuel Palomar Sanz Universidad de Alicante [email protected] Felisa Verdejo Maíllo UNED [email protected]

ISSN: 1135-5948 ISSN electrónico: 1989-7553 Depósito Legal: B:3941-91 Editado en: Universidad de Jaén Año de edición: 2020 Editores: Eugenio Martínez Cámara Universidad de Granada [email protected] Álvaro Rodrigo Yuste UNED [email protected] Paloma Martínez Fernández Universidad Carlos III [email protected]

Publicado por: Sociedad Española para el Procesamiento del Lenguaje Natural Departamento de Informática. Universidad de Jaén Campus Las Lagunillas, EdificioA3. Despacho 127. 23071 Jaén [email protected]

Consejo asesor Manuel de Buenaga Universidad de Alcalá (España) Sylviane Cardey-Greenfield Centre de recherche en linguistique et traitement automatique des langues (Francia) Irene Castellón Masalles Universidad de Barcelona (España) Arantza Díaz de Ilarraza Universidad del País Vasco (España) Antonio Ferrández Rodríguez Universidad de Alicante (España) Alexander Gelbukh Instituto Politécnico Nacional (México) Koldo Gojenola Galletebeitia Universidad del País Vasco (España) Xavier Gómez Guinovart Universidad de Vigo (España) José Miguel Goñi Menoyo Universidad Politécnica de Madrid (España) Ramón López-Cozar Delgado Universidad de Granada (España) Bernardo Magnini Fondazione Bruno Kessler (Italia) Nuno J. Mamede Instituto de Engenharia de Sistemas e Computadores (Portugal) M. Antònia Martí Antonín Universidad de Barcelona (España) M. Teresa Martín Valdivia Universidad de Jaén (España) Patricio Martínez-Barco Universidad de Alicante (España) Eugenio Martínez Cámara Universidad de Granada (España) Paloma Martínez Fernández Universidad Carlos III (España)

© 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Raquel Martínez Unanue Universidad Nacional de Educación a Distancia (España) Ruslan Mitkov University of Wolverhampton (Reino Unido) Manuel Montes y Gómez Instituto Nacional de Astrofísica, Óptica y Electrónica (México) Lluís Padró Cirera Universidad Politécnica de Cataluña (España) Manuel Palomar Sanz Universidad de Alicante (España) Ferrán Pla Santamaría Universidad Politécnica de Valencia (España) German Rigau Claramunt Universidad del País Vasco (España) Horacio Rodríguez Hontoria Universidad Politécnica de Cataluña (España) Paolo Rosso Universidad Politécnica de Valencia (España) Leonel Ruiz Miyares Centro de Lingüística Aplicada de Santiago de Cuba (Cuba) Emilio Sanchís Arnal Universidad Politécnica de Valencia (España) Kepa Sarasola Gabiola Universidad del País Vasco (España) Encarna Segarra Soriano Universidad Politécnica de Valencia (España) Thamar Solorio University of Houston (Estados Unidos de América) Maite Taboada Simon Fraser University (Canadá) Mariona Taulé Delor Universidad de Barcelona (España) Juan-Manuel Torres-Moreno Laboratoire Informatique d’Avignon / Université d’Avignon (Francia) José Antonio Troyano Jiménez Universidad de Sevilla (España) L. Alfonso Ureña López Universidad de Jaén (España) Rafael Valencia García Universidad de Murcia (España) René Venegas Velásquez Pontificia Universidad Católica de Valparaíso (Chile) Felisa Verdejo Maíllo Universidad Nacional de Educación a Distancia (España) Manuel Vilares Ferro Universidad de la Coruña (España) Luis Villaseñor-Pineda Instituto Nacional de Astrofísica, Óptica y Electrónica (México)

Revisores adicionales

Mario Almagro Universidad Nacional de Educación a Distancia (España) Laura Alonso Alemany Universidad Nacional de Córdoba (Argentina) Marco Casavantes Universidad Autónoma de Chihuahua (México) Víctor Manuel Darriba Bilbao Universidad de Vigo (España) Luis Espinosa-Anke University of Cardiff (Reino Unido) Juan Luis García Mendoza Instituto Nacional de Astrofísica, Óptica y Electrónica (México) Horacio Jarquín Instituto Nacional de Astrofísica, Óptica y Electrónica (México) Pilar López Úbeda Universidad de Jaén (España) Soto Montalvo Universidad Rey Juan Carlos (España) Arturo Montejo Ráez Universidad de Jaén (España) Flor Miriam Plaza del Arco Universidad de Jaén (España) Fracisco J. Ribadas-Pena Universidad de Vigo (España) Juan Javier Sánchez Junquera Universidad Politécnica de Valencia (España)

© 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020

ISSN: 1135-5948

Preámbulo

La revista Procesamiento del Lenguaje Natural pretende ser un foro de publicación de artículos científico-técnicos inéditos de calidad relevante en el ámbito del Procesamiento de Lenguaje Natural (PLN) tanto para la comunidad científica nacional e internacional, como para las empresas del sector. Además, se quiere potenciar el desarrollo de las diferentes áreas relacionadas con el PLN, mejorar la divulgación de las investigaciones que se llevan a cabo, identificar las futuras directrices de la investigación básica y mostrar las posibilidades reales de aplicación en este campo. Anualmente la SEPLN (Sociedad Española para el Procesamiento del Lenguaje Natural) publica dos números de la revista, que incluyen artículos originales, presentaciones de proyectos en marcha, reseñas bibliográficas y resúmenes de tesis doctorales. Esta revista se distribuye gratuitamente a todos los socios, y con el fin de conseguir una mayor expansión y facilitar el acceso a la publicación, su contenido es libremente accesible por Internet.

Las áreas temáticas tratadas son las siguientes:  Modelos lingüísticos, matemáticos y psicolingüísticos del lenguaje  Lingüística de corpus  Desarrollo de recursos y herramientas lingüísticas  Gramáticas y formalismos para el análisis morfológico y sintáctico  Semántica, pragmática y discurso  Lexicografía y terminología computacional  Resolución de la ambigüedad léxica  Aprendizaje automático en PLN  Generación textual monolingüe y multilingüe  Traducción automática  Reconocimiento y síntesis del habla  Extracción y recuperación de información monolingüe, multilingüe y multimodal  Sistemas de búsqueda de respuestas  Análisis automático del contenido textual  Resumen automático  PLN para la generación de recursos educativos  PLN para lenguas con recursos limitados  Aplicaciones industriales del PLN  Sistemas de diálogo  Análisis de sentimientos y opiniones  Minería de texto  Evaluación de sistemas de PLN  Implicación textual y paráfrasis

El ejemplar número 65 de la revista Procesamiento del Lenguaje Natural contiene trabajos correspondientes a tres apartados diferentes: comunicaciones científicas, resúmenes de proyectos de investigación y descripciones de aplicaciones informáticas (demostraciones). Todos ellos han

© 2020 Sociedad Española para el Procesamiento del Lenguaje Natural sido aceptados mediante el proceso de revisión tradicional en la revista. Queremos agradecer a los miembros del Comité Asesor y a los revisores adicionales la labor que han realizado.

Se recibieron 37 trabajos para este número, de los cuales 22 eran artículos científicos y 15 correspondían a resúmenes de proyectos de investigación y descripciones de aplicaciones informáticas. De entre los 22 artículos recibidos, 10 han sido finalmente seleccionados para su publicación, lo cual fija una tasa de aceptación del 45,4%.

El Comité asesor de la revista se ha hecho cargo de la revisión de los trabajos. Este proceso de revisión es de doble anonimato: se mantiene oculta la identidad de los autores que son evaluados y de los revisores que realizan las evaluaciones. En un primer paso, cada artículo ha sido examinado de manera ciega o anónima por tres revisores. En un segundo paso, para aquellos artículos que tenían una divergencia mínima de tres puntos (sobre siete) en sus puntuaciones, sus tres revisores han reconsiderado su evaluación en conjunto. Finalmente, la evaluación de aquellos artículos que estaban en posición muy cercana a la frontera de aceptación ha sido supervisada por más miembros del comité editorial. El criterio de corte adoptado ha sido la media de las tres calificaciones, siempre y cuando hayan sido iguales o superiores a 5 sobre 7.

Septiembre de 2020 Los editores.

© 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020

ISSN: 1135-5948

Preamble

The Natural Language Processing journal aims to be a forum for the publication of high-quality unpublished scientific and technical papers on Natural Language Processing (NLP) for both the national and international scientific community and companies. Furthermore, we want to strengthen the development of different areas related to NLP, widening the dissemination of research carried out, identifying the future directions of basic research and demonstrating the possibilities of its application in this field. Every year, the Spanish Society for Natural Language Processing (SEPLN) publishes two issues of the journal that include original articles, ongoing projects, book reviews and summaries of doctoral theses. All issues published are freely distributed to all members, and contents are freely available online.

The subject areas addressed are the following:

 Linguistic, Mathematical and Psychological models to language  Grammars and Formalisms for Morphological and Syntactic Analysis  Semantics, Pragmatics and Discourse  Computational Lexicography and Terminology  Linguistic resources and tools  and Synthesis  Dialogue Systems   Word Sense Disambiguation  Machine Learning in NLP  Monolingual and multilingual Text Generation  and Information Retrieval  Question Answering  Automatic Text Analysis   NLP Resources for Learning  NLP for languages with limited resources  Business Applications of NLP   Opinion Mining   Evaluation of NLP systems  and Paraphrases

The 65th issue of the Procesamiento del Lenguaje Natural journal contains scientific papers, research project summaries and description of Natural Language Processing software tools. All of these were accepted by a peer review process. We would like to thank the Advisory Committee members and additional reviewers for their work.

© 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Thirty-seven papers were submitted for this issue, from which twenty-two were scientific papers and fifteen were either projects or tool description summaries. From these twenty-two papers, we selected ten (45,4%) for publication.

The Advisory Committee of the journal has reviewed the papers in a double-blind process. Under double-blind review the identity of the reviewers and the authors are hidden from each other. In the first step, each paper was reviewed blindly by three reviewers. In the second step, the three reviewers have given a second overall evaluation of those papers with a difference of three or more points out of seven in their individual reviewer scores. Finally, the evaluation of those papers that were in a position very close to the acceptance limit were supervised by the editorial board. The cut-off criterion adopted was the mean of the three scores given, as long as it is equal or greater than 5 out of 7.

September 2020 Editorial board.

© 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020

ISSN: 1135-5948

Artículos Grammatical error correction for Basque through a seq2seq neural architecture and synthetic examples Zuhaitz Beloki, Xabier Saralegi, Klara Ceberio, Ander Corral ...... 13 Un estudio de los métodos de reducción del frente de Pareto a una única solución aplicado al problema de resumen extractivo multi-documento Jesús M. Sánchez-Gómez, Miguel A. Vega-Rodríguez, Carlos J. Pérez...... 21 Generación de frases literarias: un experimento preliminar Luis-Gil Moreno-Jiménez, Juan-Manuel Torres-Moreno, Roseli S. Wedmann...... 29 Cross-lingual training for multiple-choice question answering Guillermo Echegoyen, Álvaro Rodrigo, Anselmo Peñas ...... 37 ContextMEL: Classifying contextual modifiers in clinical text Paula Chocron, Álvaro Abella, Gabriel de Maeztu...... 45 Limitations of neural networks-based NER for resume data extraction Juan F. Pinzón, Stanislav Krainikovsky, Roman S. Samarev ...... 53 Minería de argumentación en el referéndum del 1 de octubre de 2017 Marcos Esteve Casademunt, Paolo Rosso, Francisco Casacuberta ...... 59 Edición de un corpus digital de inventarios de bienes Pilar Arrabal Rodríguez ...... 67 Relevant content selection through positional language models: an exploratory analysis Marta Vicente, Elena Lloret ...... 75 Rantanplan, fast and accurate syllabification and scansion of Spanish poetry Javier de la Rosa, Álvaro Pérez, Laura Hernández, Salvador Ros, Elena González-Blanco ...... 83

Proyectos COST action “European network for web-centred linguistic data science” Thierry Declerck, Jorge Gracia, John P. McCrae ...... 93 MINTZAI: Sistemas de aprendizaje profundo E2E para traducción automática del habla Thierry Etchegoyhen, Haritz Arzelus, Harritxu Gete, Aitor Álvarez, Inma Hernaez, Eva Navas, Ander González, Jaime Osácar, Edson Benites, Igor Ellakuria, Eusebi Calonge, Maite Martín ...... 97 MISMIS: Misinformation and Miscommunication in social media: aggregating information and analysing language Paolo Rosso, Francisco Casacuberta, Julio Gonzalo, Laura Plaza, J. Carrillo-Albornoz, E. Amigó, M. Felisa Verdejo, Mariona Taulé, Maria Salamó, M. Antònia Martí ...... 101 AMALEU: a machine-Learned universal language representation Marta R. Costa-jussà ...... 105 Transcripción, indexación y análisis automático de declaraciones judiciales a partir de representaciones fonéticas y técnicas de lingüística forense Antonio García-Díaz, Ángela Almela, Fernando Molina, Juan Salvador Castejón-Garrido, Rafael Valencia-García ...... 109

Demostraciones EmoCon: Analizador de emociones en el congreso de los diputados Salud María Jiménez-Zafra, Miguel Ángel García-Cumbreras, María Teresa Martín-Valdivia, Andrea López-Fernández ...... 115 Nalytics: natural speech and text analytics Ander González-Docasal, Naiara Pérez, Aitor Álvarez, Manex Serras, Laura García-Sardiña, Haritz Arzelus, Aitor García-Pablos, Montse Cuadros, Paz Delgado, Ane Lazpiur, Blanca Romero ...... 119 RESIVOZ: dialogue system for voice-based information registration in eldercare Sardiña, Manex Serras, Arantza del Pozo, Mikel D. Fernández-Bhogal ...... 123

© 2020 Sociedad Española para el Procesamiento del Lenguaje Natural The text complexity library Rocío López Anguita, Jaime Collado Montañez, Arturo Montejo Ráez ...... 127 The impact of coronavirus on our mental health Jiawen Wu, Francisco Rangel, Juan Carlos Martínez ...... 131 AREVA: augmented reality voice assistant for industrial maintenance Manex Serras, Laura García-Sardiña, Bruno Simões, Hugo Álvarez, Jon Arambarri ...... 135 UMUCorpusClassifier: compilation and evaluation of linguistic corpus for natural language processing tasks José Antonio García-Díaz, Ángela Almela, Gema Alcaraz-Mármol, Rafael Valencia-García ...... 139

Información General Información para los autores ...... 145 Información adicional ...... 147

© 2020 Sociedad Española para el Procesamiento del Lenguaje Natural

Artículos

Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 13-20 recibido 02-04-2020 revisado 01-06-2020 aceptado 01-06-2020

Grammatical Error Correction for Basque through a seq2seq neural architecture and synthetic examples

Corrección gramatical para euskera mediante una arquitectura neuronal seq2seq y ejemplos sintéticos

Zuhaitz Beloki, Xabier Saralegi, Klara Ceberio, Ander Corral Elhuyar Foundation {z.beloki, x.saralegi, k.ceberio, a.corral}@elhuyar.eus

Abstract: Sequence-to-sequence neural architectures are the state of the art for ​ addressing the task of correcting grammatical errors. However, large training datasets are required for this task. This paper studies the use of sequence-to-sequence neural models for the correction of grammatical errors in Basque. As there is no training data for this language, we have developed a rule-based method to generate grammatically incorrect sentences from a collection of correct sentences extracted from a corpus of 500,000 news in Basque. We have built different training datasets according to different strategies to combine the synthetic examples. From these datasets different models based on the Transformer architecture have been trained and evaluated according to accuracy, recall and F0.5 score. The results obtained with the best model reach 0.87 of F0.5 score. Keywords: GEC, Seq2seq architectures, Basque, Less-resourced languages ​ Resumen: Las arquitecturas neuronales secuencia a secuencia constituyen el estado del arte para abordar la tarea de corrección de errores gramaticales. Sin embargo, su entrenamiento requiere de grandes conjuntos de datos. Este trabajo estudia el uso de modelos neuronales secuencia a secuencia para la corrección de errores gramaticales en euskera. Al no existir datos de entrenamiento para este idioma, hemos desarrollado un método basado en reglas para generar de forma sintética oraciones gramaticalmente incorrectas a partir de una colección de oraciones correctas extraídas de un corpus de 500.000 noticias en euskera. Hemos construido diferentes conjuntos de datos de entrenamiento de acuerdo a distintas estrategias para combinar los ejemplos sintéticos. A partir de estos conjuntos de datos hemos entrenado sendos modelos basados en la arquitectura Transformer que hemos evaluado y comparado de acuerdo a las métricas de precisión, cobertura y F0.5. Los resultados obtenidos con el mejor modelo alcanzan un F0.5 de 0.87. Palabras clave: Corrección gramatical, Arquitecturas seq2seq, Euskera ​

Initially, this task was addressed with 1 Introduction relative success through symbolic approaches The task of Grammatical Error Correction based on linguistic rules. Later, different ​ (GEC) consists in detecting and correcting authors introduced methods based on machine grammatical errors in a sentence, resulting in a learning that can be divided into two groups: a) grammatically correct sentence (e.g. "I give her methods based on supervised classification, and ​ a book yesterday" → "I gave her a book b) methods based on machine translation. The ​ yesterday"). This is a task that arouses great methods based on supervised classification ​ interest within Natural Language Processing, consist of building a classifier for each type of and proof of this is the large number of works grammatical error from annotated corpora on GEC that we can find in the literature. (Izumi et al., 2003; Gamon, 2010). Methods

ISSN 1135-5948. DOI 10.26342/2020-65-1 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Zuhaitz Beloki, Xabier Saralegi, Klara Ceberio, Ander Corral based on machine translation, on the other several rules into correct sentences. Four hand, address the task of grammatical strategies were used to combine the various correction as a problem of machine translation, types of incorrect examples generated. Correct or sequence-to-sequence learning (seq2seq), sentences were extracted from a large collection ​ ​ where the source sentences correspond to of news gathered from digital newspapers. The grammatically incorrect sentences and the main contributions of this work are the target sentences to grammatically correct ones. following: These methods are more efficient than methods ● To the best of our knowledge, this is based on supervised classification, as their the first work that studies the use of ability to deal with errors that follow complex neural seq2seq models in the task of ​ patterns is greater (Rozovskaya and Roth, correcting grammatical errors in 2016). Basque. The first methods based on machine ● We propose a new rule-based method translation used statistical machine translation for generating synthetic training corpus models (Brockett et al., 2006). Lately, they oriented to the correction of have been replaced by superior neural models grammatical errors in Basque. (Chollampatt and Ng, 2018; Grund-kiewicz and ● We provide a new benchmark for the Junczys-Dowmunt, 2018) given the great GEC task in Basque, by making the capacity of the latter to generalize patterns. synthetic training and evaluation However, neural translation models require datasets publicly available1 . even more training data than statistical The article is structured as follows. In the translation models. This is a major problem, following section we will explain the because even for widely used languages such as methodology we have followed to select the English, there are no large training corpora grammatical errors included in the study. In available for the GEC task. For this reason, section 3 we describe in detail the methods several authors (Zhao et al., 2019; Lichtarge et proposed for the generation of synthetic al., 2019) propose to address the task of training datasets from grammatically correct grammatical error correction as a low-resource texts. In section 4 we will present the seq2seq ​ ​ Neural Machine Translation task, where the neural architecture used to implement the ​ automatic generation of synthetic training data grammatical correction system. Section 5 will is essential. Different techniques have been discuss the results obtained with the systems proposed for generating synthetic data, such as trained on the basis of the different synthetic techniques based on back-translation (Reiet al., training datasets generated. We will conclude 2017; Ge et al., 2018). this article by presenting the main conclusions The problem of the lack of training data is drawn from the work and also the future work even greater in languages with limited digital planned. resources, where there are not even initial annotated corpora that can serve as a basis for 2 Selection of grammatical errors generating synthetic training examples. This is This work focuses on the most common the case of the Basque language, which is the grammatical errors in Basque texts. As we do case of the study that we propose in this paper. not have a corpus of Basque texts with Until now, the correction of grammatical grammatical errors annotated, we have chosen errors in Basque texts has been tackled through to consult a professional translator/corrector rule-based strategies (Oronoz, 2009) with the and a professional lexicographer. limitations that this approach entails. In this Each of them has been asked to select ten paper we propose to address the problem by errors from the list of 33 grammatical errors using seq2seq neural models. To this end, we ​ proposed for Basque by Oronoz (2009). have evaluated seq2seq neural models trained ​ from different synthetic training corpora. Various training corpora were generated, by 1 introducing grammatical errors derived from https://hizkuntzateknologiak.elhuyar.eus/assets/files/ elh-gec-eu.tgz

14 Grammatical error correction for Basque through a seq2seq neural architecture and synthetic examples

Atzo kalean ikusi nizun Specifically, they had to select those errors ​ (→zintudan). (I saw you which in their opinion are most common in ​ Basque texts regardless of the register and yesterday on the street) Utzi behar dugu (→diogu) negar domain. ​ ​ ​ egitea (→egiteari). (We have to In addition to their own judgment, these ​ ​ stop crying) experts used as additional material the list of E3: Jon (→Jonek) ez daki ezer. most common errors in the exams for the EGA ​ ​ Concordance title (Euskararen Gaitasun Agiria, or in (Jon doesn't know anything) ​ ​ verb-subject Bidaiak (→Bidaiek) atsedena English: Certificate of Proficiency in Basque). ​ ​ From the intersection of the two sets of ten hartzeko balio dute. (Travelling is good to rest) errors selected by both experts, six errors Jende askok uste dute (→du). (A resulted (Table 1), of which two (erroneous use ​ ​ ​ lot of people think). of suffixes in dates and times) were discarded E4: Ez dut uste hori egia dela because they are easily solved by rule-based ​ Completive (→denik). (I don't think that's techniques (Oronoz, 2009). Thus, the four ​ sentences true) errors selected for this paper are the following: Nire ustez, hori horrela dela ​ ● E1: Wrong use of the verb tense or (→da). (I think it's like that) aspect. For example, the use of the ​ Badago beste kutsadura bat dela ​ verbal form of the present in a future (→dena) nuklearra. (There's ​ context. another contamination which is ● E2: Misuse of the verbal paradigm. The nuclear contamination) verbal system in Basque consists of four Table 1: Selected errors and examples. In paradigms: nor (monovalent ​ ​ brackets, the English translation of the intransitive), nor-nork (bivalent ​ ​ corrected example transitive), nor-nori (bivalent ​ intransitive), nor-nori-nork (trivalent ​ transitive). Each verb is conjugated 3 Generation of synthetic datasets according to its corresponding 3.1 Generation of synthetic errors paradigm(s). It is a very common error to use the wrong paradigm. The size of the different training datasets ● E3: Lack of concordance between the available for GEC is insufficient for training verb and the subject. Confusion of the seq2seq neural models, so different techniques declension suffix in the subject. for the generation of additional synthetic ● E4: Misuse of the verbal suffix. training data have been proposed in the Completive sentences in Basque are literature. Some authors (Grundkiewicz y formed by adding the suffix (-(e)la) to Junczys-Dowmunt, 2014; Lichtarge et al., ​ ​ the verb of the subordinate sentence. If 2019) have proposed to extract training data the sentence is negative, the suffix from Wikipedia revision histories, a source should be -(e)nik. from which a large number of examples can be ​ ​ extracted, especially for English. Other authors Error-type Examples (Reiet al., 2017; Ge et al., 2018) generate E1: Verb tense Ziur bihar jakiten (→jakingo) synthetic training data following the ​ ​ ​ dugula. (I'm sure we’ll find out back-translation strategy proposed for machine tomorrow) translation systems. An intermediate model is Gustura egingo nuen (→nuke) trained from the initial training corpus to be ​ ​ orain. (I'd love to do it now) applied to a corpus of correct sentences, and Gauza bat faltatzen (→falta) zait thus sentences with grammatical errors are ​ ​ ​ ​ esateko. (There's one more thing automatically generated. Another alternative I have to say) proposed in the literature to generate synthetic E2: Verbal Afaltzera gonbidatu zidan ​ training corpora is to introduce "noise" in a paradigm (→ninduen). (He/She invited me ​ corpus of correct texts. The "noise" or to dinner) grammatical errors are introduced by means of

15 Zuhaitz Beloki, Xabier Saralegi, Klara Ceberio, Ander Corral linguistic rules (Yuan and Felice, 2013), or Error Rules more generic operations of token replacement, E1 R1.1: ko/go → ten/tzen ​ ​ elimination, insertion or reordering (Zhao et al., E2 R2.1: nor-nork → nor-nori-nork ​ ​ 2019). R2.2: nor-nori-nork ↔ nori-nor In our case, to generate the synthetic training R2.3: nor-nork ↔ nor corpus we will follow an approach based on R2.4: nor-nork ↔ nori-nor E3 R3.1: subj_erg → sub_abs linguistic rules, in line with the strategy adopted ​ ​ E4 R4.1: v_aux(-nik) → v_aux(-la) by Yuan and Felice (2013). However, unlike ​ ​ R4.2: v_aux(-na) →v_aux(-la) Yuan and Felice (2013), we will not use an ​ ​ initial annotated corpus as a reference. R4.3: v_aux(-laren) ↔v_aux(-lako) The rules are designed to generate specific Table 2: Rules for errors associated with grammatical errors in grammatically correct selected grammatical errors sentences. That way, we can generate pairs of incorrect and correct sentences useful for Rule Examples compiling a training dataset. R1.1 Arratsaldean ikusiko (→ikusten) gara. ​ ​ ​ ​ ​ The implemented rules (Table 2) generate (See you in the afternoon) errors of the types E1, E2, E3, E4 (Table 1) R2.1 Atzo hondartzan ikusi zintudan (→nizun). ​ ​ ​ ​ ​ described in subsection 3.1. For each type of (I saw you on the beach yesterday) R2.2 Aholku kontrajarriak ematen ari zaigu error a set of rules has been implemented so ​ (→digu). (He/She is giving us that most of the possible cases are covered. In ​ ​ some cases the application of the rule is contradictory advice) R2.3 Azkenaldian asko argaldu du (→da). (He's bi-directional depending on whether the error ​ ​ ​ ​ ​ generated in that way is also common. lost a lot of weight lately) R2.4 Paisaia asko gustatzen zait (→nau). (I The changes executed by the implemented ​ ​ ​ ​ ​ rules consist of replacing specific words really like the landscape) R3.1 Nik (→ni) ez dut nahi. (I don’t want it). depending on the type of error (examples ​ ​ ​ ​ ​ Langileek (→langileak) lan handia egin shown in Table 3). These replacements are ​ ​ ​ ​ made according to certain grammatical dute. (The workers have worked very hard) R4.1 Ez dut uste etorriko denik (→dela). (I information and specific tokens that we obtain ​ ​ ​ ​ ​ don't think he/she's coming) through the morphosyntactic analyzer for the R4.2 Badago beste arazo bat zehaztasuna dena Basque language Eustagger (Ezeiza et al., ​ (→dela). (There's another problem that is 1998): ​ ​ ​ precision) ● Rule R1.1 associated with the error type R4.3 Laster zabaldu da denok gaixotuko E1 is applied to sentences where the garelaren (→garelako) albistea. (The verb tense is future (suffixes "ko" and ​ ​ ​ ​ ​ news that we're all going to get sick has "go") and the verb inflection is modified ​ spread fast.) to transform it into present tense Table 3: Examples of errors generated by the (suffixes "ten" and "tzen"). ​ ​ ​ ​ implemented rules. In brackets, the English ● Rules R2.1, R2.2, R2.3, R2.4 associated translation of the correct example with error type E2 modify the auxiliary verb to simulate the most common 3.2 Strategies for building datasets verbal paradigm confusions. ● Rule R3.1 associated with error type E3 Each example in the training -and evaluation- modifies the subject's grammatical case datasets we create is composed of a sentence (its declension) to transform ergative pair that includes a sentence containing cases into absolutive ones. grammatical errors and its corresponding ● Rules R4.1, R4.2, R4.3 associated with corrected version. In order to generate those error type E4 modify the auxiliary verb pairs we apply the rules described in the to simulate suffix errors in completive previous subsection over grammatically correct sentences. sentences. We also add pairs composed of the

16 Grammatical error correction for Basque through a seq2seq neural architecture and synthetic examples

original unmodified sentences so that trained pairs (R1(oci) ∘…∘ Rn(oci),oci) and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ models also take those cases into account. (oci,oci). If no rule can be applied, only ​ ​ ​ ​ ​ ​ Those grammatically correct sentences are the pair (oci,oci) is generated. ​ ​ ​ ​ ​ ​ extracted from a news corpus compiled from In the different training datasets created we several Basque news websites: Berria.eus, differentiate three types of example pairs Argia.eus and the various proximity media of according to the number of rules applied for the Tokikom.eus network. The collected corpus their generation: a) None: they are not the result ​ ​ consists of 500,015 news items from which we of any rule, b) Single: they are the result of ​ ​ extract 4,927,748 correct sentences Oc={oci} applying one rule, c) Multi: they are the result ​ ​ ​ ​ ​ ​ ​ including 66 million words. of applying several rules (Table 5). To generate the training datasets (Tables 4 and 5) we have analyzed different strategies to Pairs R1 R2 R3 R4 D 8.49M - - - - apply the rules on a subset of Oc of 4,921,748 ​t0 ​ ​ D 9.33M 0.37M 3.43M 0.54M 0.07M sentences: ​t1 ● Baseline (D dataset): We apply to each Dt2 17.38M 1.04M 9.66M 1.60M 0.19M ​ ​t0 ​ Dt3 9.33M 0.59M 3.84M 0.89M 0.10M correct sentence oci a variation of the ​ ​ ​ D 6000 662 2924 871 1291 substitution rule proposed by Zhao et al. ​ea D 672 49 307 92 143 (2019), since seq2seq models trained on ​em ​ Table 4: Number of total pairs (None included) data generated using the original method ​ performed very poorly. The original and pairs generated by each rule included in the training (Dt0, Dt1, Dt2, Dt3) and evaluation (Dea, method consists in replacing, with a ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Dem) datasets 10% probability, each word with another ​ ​ randomly selected word from the corpus. To avoid highly artificial sentences, our Pairs None Single Multi D 8.49M - - - variation adds the constraint that the ​t0 D 9.33M 4.92M 4.41M 0 (w1, w2) exists in the corpus, t1 ​ ​ ​ ​ ​ ​ ​ where w is the randomly selected word D 17.38M 4.91M 12.47M 0 ​ ​2 ​t2 D 9.33M 4.92M 3.17M 1.24M and w1 is the word preceding the original ​t3 ​ ​ D 6000 2000 2000 2000 replaced word. For each correct sentence ​ea Dem 672 250 221 201 we generate the pair (z(oci),oci) ​ ​ ​ ​ ​ ​ ​ ​ Table 5: Total number of pairs (Pairs) and pairs composed of the incorrect sentence ​ ​ according to typology (None, Single, Multi) z(oci) and the correct oci, in addition to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ included in the training (D , D , D , D ) and the unmodified pair (oci,oci). t0 t1 t2 t3 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ● Strategy 1 (D dataset): From the rules evaluation (Dea, Dem) datasets ​ ​t1 ​ ​ ​ ​ ​ ​ that can be applied to the correct

sentence oci, a single Ri rule is randomly To create the evaluation datasets we use the ​ ​ ​ ​ ​ selected and the pair (Ri(oci),oci) is same strategies as those used to create the ​ ​ ​ ​ ​ ​ ​ ​ ​ generated in addition to the unmodified training datasets, but on a different subset (6k pair (oc ,oc ). If no rule can be applied, i i correct sentences) of Oc. We guarantee a ​ ​ ​ ​ ​ ​ ​ ​ ​ only the pair (oci,oci) is generated. balance between the pair types None, Single ​ ​ ​ ​ ​ ​ ​ ​ ​ ● Strategy 2 (Dt2 dataset): For each Ri rule and Multi. In this way, we generate a first ​ ​ ​ ​ ​ ​ that can be applied to the correct oci evaluation dataset D fully automatically. ​ ​ ​ ea​ sentence, the pair (Ri(oci),oci) is Taking into account that, in some cases, the ​ ​ ​ ​ ​ ​ ​ ​ ​ generated, in addition to the unmodified application of the rules can generate pair (oci,oci). If no rule can be applied, grammatically correct sentences, we also built ​ ​ ​ ​ ​ ​ only the pair (oci,oci) is generated. another D evaluation dataset consisting of a ​ ​ ​ ​ ​ ​ ​ ​em ● Strategy 3 (Dt3 dataset): We apply a set subset of 750 D pairs but reviewed manually. ​ ​ ​ ​e of defined rules to each correct sentence Pairs generated by the rules that do not really oci. From the rules that can be applied, a include grammatical errors are eliminated. 78 ​ ​ set {Rj} of n rules is selected at random, pairs were discarded in the manual review ​ ​ ​ ​ and applied sequentially to generate the

17 Zuhaitz Beloki, Xabier Saralegi, Klara Ceberio, Ander Corral process, leaving a subset of 672 pairs (see unseen words are represented as a sequence of tables 4 and 5). subword units. In the case of Basque, this encoding is particularly useful as declensions 4 Seq2seq architecture for GEC generate a larger vocabulary. The seq2seq neural architectures are being used ​ 5 Results successfully to address the GEC task. Unlike statistical translation models, seq2seq neural We present results for four GEC systems. All of ​ architectures can model dependencies between them are based on the Transformer model words (or similar word sets) that are critical in introduced in the previous section and trained correcting grammatical errors (Sakaguchi et al., on the synthetic datasets presented in section 3. 2016). Those systems were evaluated according to the In the literature, we can distinguish three standard metrics used in GEC: precision, recall main sequence-to-sequence architectures and F with respect to the set of edits needed ​ ​0.5 proposed for the correction of grammatical to correct the incorrect sentences. The errors: architectures based on recurrent neural upperbound would be an oracle system that networks (Ge et al., 2018), architectures based makes only the necessary edits to correct the on convolutional networks (Chollampatt and errors included in the incorrect sentences. The Ng, 2018), and architectures based on following systems were built and evaluated: self-attention (Junczys-Dowmunt et al., 2018). ● Dt0+tr system: Training of the ​ ​ ​ These last ones are the ones we are going to use Transformer model from the training in this work, since they are the ones that dataset D (synthetic examples by ​ ​t0 provide better results in this task according to random word replacement).

Zhao et al. (2019). ● Dt1+tr system: Training of the ​ ​ ​ For training the grammatical correction Transformer model from the training models we have chosen the Transformer dataset D (Synthetic examples by ​ ​t1 architecture (Vaswani et al., 2017). application of a rule by sentence).

Specifically, we have used the implementation ● Dt2+tr: Training of the Transformer ​ ​ ​ of the OpenNMT-py library. The Transfomer model from the training dataset D ​ ​t2 architecture is based on an encoder-decoder (synthetic examples by application of n system with an attention mechanism. Both the rules per sentence) encoder and the decoder are composed of 6 ● Dt3+tr: Training of the Transformer ​ ​ ​ layers composed in turn by a recurrent neural model from the training dataset D ​ ​t3 network and a mechanism of attention. We (synthetic examples by simultaneous have used the default values of the architecture application of n rules per sentence) ​ ​ without any optimization of the parameters. The Tables 6, 7, 8 and 9 show the results size of the recurrent neural network of each obtained for each of the systems with respect to layer is 512. Thus, 512 size embeddings have the evaluation datasets, both the automatic D ​ ​ea been used for both incorrect and corrected and the manually reviewed Dem. ​ ​ ​ sentences. The Adam optimizer has been used The best results are obtained by the Dt3+tr ​ ​ ​ during the training, and a learning-rate of 2 with system, on both evaluation datasets and also on a warm-up phase of 8000 steps. The dropout the Single (sentences with one error) and Multi ​ ​ ratio is 0.1, the batch size is 4096 sentences, (sentences with more than one error) subsets, as and the models have been trained until the well as on the four types of errors (E1, E2, E3 results on the development set have not shown and E4). The Transformer model seems able to any improvement. For the development set better learn the task from examples that can 5000 sentence pairs have been selected combine more than one error, which is the randomly from the training data. configuration of the D training dataset. Error ​ ​t3 To avoid the open vocabulary issue and for a analysis revealed that Dt3+tr works well with ​ ​ ​ better translation of unknown words, BPE sentences with more than one error, but tends to tokenization (Sennrich et al., 2016) has been make incorrect fixes in sentences with no errors applied to source and target sequences. Rare or or containing a single error.

18 Grammatical error correction for Basque through a seq2seq neural architecture and synthetic examples

The Dt0+tr system trained from the baseline E1 E2 E3 E4 ​ ​ ​ dataset has a very low performance, and points Dt0+tr 0.07 0.07 0.01 0.06 ​ ​ Dt1+tr 0.88 0.83 0.76 0.79 out that the generic replacement rules are not ​ ​ Dt2+tr 0.85 0.79 0.72 0.82 adequate to generate synthetic training datasets, ​ ​ Dt3+tr 0.94 0.90 0.90 0.87 at least in this case study. The model does not ​ ​ Table 9: F results of the systems with respect have enough information to perform the ​ 0.5​ to the manually revised dataset Dem depending necessary fixes. It does not create new ​ ​ ​ mistakes, but neither corrects them. on the grammatical error to be corrected

The results of the Dt1+tr and Dt2+tr systems ​ ​ ​ ​ ​ ​ differ slightly from each other, the former being 6 Conclusions better. But the results of both are notably lower In this work we have been able to prove that it than those of Dt3+tr, especially in terms of ​ ​ ​ ​ recall. This difference in performance with is possible to implement a based on seq2seq neural models for a respect to Dt3+tr is especially accentuated (see ​ ​ ​ ​ table 7) when dealing with sentences with more less-resourced language, represented by Basque than one error (Multi). These systems rarely in this case of study. For this type of language ​ ​ solve more than one error in the same sentence. where no training data is available for the GEC With regard to the different types of errors, task, we have found that a strategy based on there are no major differences, and in general building synthetic training datasets from better results are obtained for types E1 and E2 monolingual corpora is feasible. The proposed (see tables 8 and 9). method, based on combining different linguistic rules to generate grammatical errors, allows the D D creation of large valid datasets to train high ​ea em​ performance seq2seq neuronal models. In the P R F0.5 P R F0.5 ​ ​ ​ future, we plan to extend the repertoire of Dt0+tr 0.26 0.02 0.07 0.26 0.02 0.07 ​ ​ linguistic rules for generating synthetic errors, Dt1+tr 0.86 0.57 0.78 0.88 0.58 0.80 ​ ​ Dt2+tr 0.83 0.56 0.75 0.80 0.60 0.75 and also study other methods of synthetic data ​ ​ Dt3+tr 0.88 0.73 0.85 0.90 0.76 0.87 generation, in order to include more types of ​ ​ Table 6: Precision (P), recall (R) and F of the grammatical errors in the seq2seq model. ​ ​0.5 ​ ​ systems with respect to the automatic (Dea) and ​ ​ ​ References manually reviewed (Dem) datasets ​ ​ ​ Brockett, C., W. B. Dolan and M. Gamon. D D 2006. Correcting ESL errors using phrasal ​ea ​em S M S+M S M S+M SMT techniques. In Proceedings of the 21st ​ Dt0+tr 0.08 0.06 0.07 0.06 0.09 0.07 International Conference on Computational ​ ​ Dt1+tr 0.81 0.79 0.80 0.84 0.81 0.82 ​ ​ Linguistics and the 44th annual meeting of Dt2+tr 0.82 0.77 0.79 0.83 0.78 0.80 ​ ​ the Association for Computational Dt3+tr 0.83 0.88 0.86 0.86 0.90 0.89 ​ ​ Linguistics (pp. 249-256). Table 7: F results of the systems with respect ​ ​ ​0.5 to the Single (S) and Multi (M) subsets of the Chollampatt, S. and H.T. Ng. 2018. A ​ ​ ​ ​ ​ ​ evaluation datasets multilayer convolutional encoder-decoder neural network for grammatical error correction. In Thirty-Second AAAI ​ E1 E2 E3 E4 Conference on Artificial Intelligence. Dt0+tr 0.07 0.07 0.04 0.06 ​ ​ ​ Dt1+tr 0.81 0.81 0.74 0.77 Ezeiza, N., I. Alegria, J.M. Arriola, R. Urizar ​ ​ Dt2+tr 0.79 0.79 0.72 0.78 and I. Aduriz. 1998. Combining stochastic ​ ​ Dt3+tr 0.88 0.87 0.86 0.85 and rule-based methods for disambiguation ​ ​ Table 8: F results of the systems with respect in agglutinative languages. In Proceedings ​ 0.5​ ​ to the automatic evaluation dataset D of the 17th international conference on ​ ea​ depending on the grammatical error to correct Computational linguistics-Volume 1 (pp. 380-384).

19 Zuhaitz Beloki, Xabier Saralegi, Klara Ceberio, Ander Corral

Gamon, M. 2010. Using mostly native data to (Doctoral dissertation, Universidad del País correct errors in learners' writing: a Vasco-Euskal Herriko Unibertsitatea). meta-classifier approach. In Human ​ Rei, M. and H. Yannakoudakis. 2017. Auxiliary Language Technologies: The 2010 Annual Objectives for Neural Error Detection Conference of the North American Chapter Models. In Proceedings of the 12th of the Association for Computational ​ Workshop on Innovative Use of NLP for Linguistics (pp. 163-171). ​ Building Educational Applications (pp. Ge, T., F. Wei and M. Zhou. 2018. Fluency 33-43). boost learning and inference for neural Rozovskaya, A. and D. Roth. 2016. grammatical error correction. In Grammatical error correction: Machine Proceedings of the 56th Annual Meeting of translation and classifiers. In Proceedings of the Association for Computational ​ the 54th Annual Meeting of the Association Linguistics (Volume 1) (pp. 1055-1065). ​ for Computational Linguistics (Volume 1: Grundkiewicz, R. and M. Junczys-Dowmunt. Long Papers) (pp. 2205-2215). ​ 2014. The WikEd error corpus: A corpus of Sakaguchi, K., C. Napoles, M. Post and J. corrective Wikipedia edits and its Tetreault. 2016. Reassessing the goals of application to grammatical error correction. grammatical error correction: Fluency In International Conference on Natural ​ instead of grammaticality. Transactions of Language Processing (pp. 478-490). ​ ​ the Association for Computational Izumi, E., K. Uchimoto, T. Saiga, T. Supnithi Linguistics, 4, 169-182. ​ ​ ​ and H. Isahara. 2003. Automatic error Sennrich, R., B. Haddow, and A. Birch. 2016. detection in the Japanese learners’ English Edinburgh Neural Machine Translation spoken data. In The Companion Volume to ​ Systems for WMT 16. In Proceedings of the the Proceedings of 41st Annual Meeting of ​ First Conference on Machine Translation: the Association for Computational Volume 2, Shared Task (pp. 371-376). Linguistics (pp. 145-148). ​ ​ Vaswani, A., N. Shazeer, N. Parmar, J. Junczys-Dowmunt, M., R. Grundkiewicz, S. Uszkoreit, L. Jones, A.N. Gomez and I. Guha and K. Heafield. 2018. Approaching Polosukhin. 2017. Attention is all you need. Neural Grammatical Error Correction as a In Advances in neural information Low-Resource Machine Translation Task. In ​ processing systems (pp. 5998-6008). Proceedings of the 2018 Conference of the ​ North American Chapter of the Association Yuan, Z. and M. Felice. 2013. Constrained for Computational Linguistics: Human grammatical error correction using statistical machine translation. In Proceedings of the Language Technologies, Volume 1 (Long ​ Papers) (pp. 595-606). Seventeenth Conference on Computational ​ Natural Language Learning: Shared Task Lichtarge, J., C. Alberti, S. Kumar, N. Shazeer, (pp. 52-61). N. Parmar and S. Tong. 2019. Corpora Generation for Grammatical Error Zhao, W., L. Wang, K. Shen, R. Jia and J. Liu. Correction. In Proceedings of the 2019 2019. Improving Grammatical Error ​ Conference of the North American Chapter Correction via Pre-Training a of the Association for Computational Copy-Augmented Architecture with Unlabeled Data. In Proceedings of the 2019 Linguistics: Human Language Technologies, ​ (pp. 3291-3301). Conference of the North American Chapter of the Association for Computational Oronoz, M. 2009. Euskarazko errore ​ Linguistics: Human Language Technologies, sintaktikoak detektatzeko eta zuzentzeko Volume 1 (pp. 156-165). baliabideen garapena: datak, ​ postposizio-lokuzioak eta komunztadura

20 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 21-28 recibido 06-03-2020 revisado 14-06-2020 aceptado 14-06-2020

Un estudio de los m´etodos de reducci´ondel frente de Pareto a una ´unicasoluci´onaplicado al problema de resumen extractivo multi-documento

A study of methods for reducing the Pareto front to a single solution applied to the extractive multi-document summarization problem Jesus M. Sanchez-Gomez1, Miguel A. Vega-Rodr´ıguez1, Carlos J. P´erez2 1Dpto. de Tecnolog´ıade Computadores y Comunicaciones, Universidad de Extremadura 2Dpto. de Matem´aticas,Universidad de Extremadura [email protected], [email protected], [email protected]

Resumen: Los m´etodos de resumen autom´aticoson actualmente necesarios en mu- chos contextos diferentes. El problema de resumen extractivo multi-documento in- tenta cubrir el contenido principal de una colecci´onde documentos y reducir la informaci´onredundante. La mejor manera de abordar esta tarea es mediante un enfoque de optimizaci´onmulti-objetivo. El resultado de este enfoque es un conjunto de soluciones no dominadas o conjunto de Pareto. Sin embargo, dado que solo se necesita un resumen, se debe reducir el frente de Pareto a una ´unicasoluci´on.Para ello, se han considerado varios m´etodos, como el mayor hipervolumen, la soluci´on consenso, la distancia m´ascorta al punto ideal y la distancia m´ascorta a todos los puntos. Los m´etodos han sido probados utilizando conjuntos de datos de DUC, y han sido evaluados con las m´etricasROUGE. Los resultados revelan que la soluci´on consenso obtiene los mejores valores promedio Palabras clave: Resumen extractivo multi-documento, Optimizaci´on multi- objetivo, Frente de Pareto, Soluci´on´unica Abstract: Automatic summarization methods are currently needed in many diffe- rent contexts. The extractive multi-document summarization problem tries to cover the main content of a document collection and to reduce the redundant informa- tion. The best way to address this task is through a multi-objective optimization approach. The result of this approach is a set of non-dominated solutions or Pareto set. However, since only one summary is needed, the Pareto front must be reduced to a single solution. For this, several methods have been considered, such as the largest hypervolume, the consensus solution, the shortest distance to the ideal point, and the shortest distance to all points. The methods have been tested using datasets from DUC, and they have been evaluated with ROUGE metrics. The results show that consensus solution achieves the best average values Keywords: Extractive multi-document summarization, Multi-objective optimiza- tion, Pareto front, Single solution 1 Introducci´on resumen puede ser mono-documento o multi- documento (Zajic, Dorr, y Lin, 2008): los Hoy en d´ıa,la informaci´onen Internet crece mono-documento reducen la informaci´onde exponencialmente, y obtener la m´asimpor- un ´unicodocumento, y los multi-documento tante sobre un tema concreto es de inter´es seleccionan la informaci´onde una colecci´on. en muchas ´areas.Extraer la informaci´onm´as Un resumen tambi´enpuede ser abstractivo, relevante es posible mediante las herramien- donde las palabras y oraciones que lo forman tas de miner´ıade texto, las cuales son capa- pueden no existir en el texto original, o ex- ces de generar un resumen autom´aticoa par- tractivo, donde sus oraciones s´ıexisten en la tir de informaci´ontextual (Hashimi, Hafez, y fuente original (Wan, 2008). Mathkour, 2015). Existen varios tipos de resumen. Depen- El problema del resumen extractivo multi- diendo de d´ondese obtiene la informaci´on,un documento puede ser formulado como un pro- ISSN 1135-5948. DOI 10.26342/2020-65-2 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Jesús M. Sánchez-Gómez, Miguel A. Vega-Rodríguez, Carlos J. Pérez blema de optimizaci´ontanto mono-objetivo nadas con los pesos de las funciones obje- como multi-objetivo. En la optimizaci´on tivo. El m´etodo de funci´onde estr´espon- mono-objetivo solo se optimiza una ´unica derado propuesto por Ferreira, Fonseca, y funci´onobjetivo, la cual incluye todos los Gaspar-Cunha (2007) integra las preferencias criterios de forma ponderada (Alguliev, Ali- del usuario para encontrar la mejor regi´ondel guliyev, y Mehdiyev, 2011). Por otro lado, frente de Pareto de acuerdo con estas prefe- en la optimizaci´onmulti-objetivo todas las rencias. De la misma manera, el m´etodo de funciones objetivo se optimizan de manera utilidades aditivas discriminatorias (Soylu y simult´anea.Adem´as,la optimizaci´onmulti- Ulusoy, 2011) requiere que el usuario asigne objetivo ha logrado mejores resultados que algunas referencias que reflejen sus preferen- la mono-objetivo (Saleh, Kadhim, y Attea, cias. Otro m´etodo es el basado en el filtro 2015). Las funciones objetivo utilizadas en es- de Pareto, desarrollado por Antipova et al. tos trabajos fueron la cobertura del contenido (2015), usado para reducir y facilitar el an´ali- y la reducci´onde la redundancia. sis post-´optimocon la idea de clasificar las En la optimizaci´onmulti-objetivo la so- soluciones de acuerdo a una eficiencia global luci´ongenerada no es ´unica,sino que es un y seleccionar aquellas que muestren un mejor conjunto de soluciones no dominadas deno- equilibrio entre los objetivos. minado conjunto de Pareto (Sudeng y Wat- En segundo lugar, los m´etodos de cluste- tanapongsakorn, 2015). Los m´etodos de re- ring dividen el frente de Pareto en varios gru- ducci´ondel frente de Pareto pueden clasifi- pos. Estos m´etodos comienzan con un n´ume- carse en tres grupos. Primero, los m´etodos ro concreto de grupos, calculando el centroide basados en preferencias, que necesitan que el para cada uno de ellos de forma iterativa y usuario las asigne de forma previa (Antipova agrupando los puntos del frente de Pareto con et al., 2015). Segundo, los m´etodos de cluste- el centroide m´assimilar. El algoritmo de par- ring, que seleccionan uno o varios subconjun- tici´onde grupos k-means (Taboada y Coit, tos de soluciones del frente mediante el uso de 2007) proporciona conjuntos m´aspeque˜nos t´ecnicasbasadas en similitud, como en Ta- de soluciones intermedias ´optimas,calculan- boada y Coit (2007). Y tercero, los m´etodos do el centroide de cada grupo y asignando basados en distancias seleccionan una ´unica cada soluci´onal grupo con el centroide m´as soluci´ondel frente bas´andoseen la distancia cercano. Aguirre y Taboada (2011) usaron al punto ideal (Padhye y Deb, 2011). el enfoque de ´arbol de crecimiento din´ami- En este trabajo se han estudiado y com- co auto-organizado para reducir de forma in- parado varios m´etodos autom´aticosde reduc- teligente el tama˜nodel conjunto, obtenien- ci´ondel frente de Pareto a una ´unica soluci´on do as´ısoluciones representativas. Adem´as,la aplicados al problema de resumen extractivo t´ecnicadel agrupamiento jer´arquicoconsiste multi-documento. Estos m´etodos se han ba- en la formaci´onde subconjuntos de acuerdo sado en los conceptos del hipervolumen, de a las decisiones de dise˜no(Veerappa y Letier, la soluci´onconsenso y de varios tipos de dis- 2011). tancias. La experimentaci´onha sido realiza- Por ´ultimo,los m´etodos basados en dis- da con los conjuntos de datos de DUC (Do- tancias seleccionan el punto del frente de Pa- cument Understanding Conferences), y los reto que tiene la distancia m´ascorta al punto resultados se han evaluado con las m´etri- ideal, que representa una soluci´onen la que cas ROUGE (Recall-Oriented Understudy for los valores de las funciones objetivo son ´opti- Gisting Evaluation). Adem´as,tambi´ense ha mos. Padhye y Deb (2011) utilizaron el m´eto- realizado un an´alisisestad´ısticode los resul- do de la m´etrica L2 para seleccionar la solu- tados obtenidos. ci´oncon la distancia Eucl´ıdeam´ascorta al punto ideal. Del mismo modo, Siwale (2013) 2 Trabajo relacionado present´ola soluci´onde compromiso Chebys- A continuaci´onse analizan los m´etodos m´as hev, que utiliza el criterio de la cercan´ıaa un usados para reducir el frente de Pareto. punto ideal bas´andoseen esta distancia. En primer lugar, los m´etodos basados Los m´etodos basados en preferencias de en preferencias de usuario necesitan que los usuario y los basados en clustering no selec- usuarios asignen sus preferencias a priori pa- cionan una ´unica soluci´on.Adem´as,los m´eto- ra obtener un conjunto reducido de solucio- dos basados en distancias que se han revisado nes. Estas preferencias suelen estar relacio- solo tienen en cuenta dos tipos de distancias. 22 Un estudio de los métodos de reducción del frente de Pareto a una única solución aplicado al problema de resumen extractivo multi-documento

En este trabajo se han evaluado otros tipos de tambi´enpuede representarse como un con- distancias, como la Manhattan, la Mahalano- junto de n oraciones: D = {s1, s2, ..., sn}. El bis y la Levenshtein, adem´asde otros m´eto- fin es generar un resumen R ⊂ D que tenga dos, como el del mayor hipervolumen y la so- en cuenta los siguientes tres aspectos: luci´onconsenso. Cobertura del contenido. El resumen de- 3 Definici´ondel problema be cubrir el contenido principal de la co- El problema de resumen extractivo multi- lecci´onde documentos. documento se formula a continuaci´oncomo Reducci´onde la redundancia. El resu- un problema de optimizaci´onmulti-objetivo. men no debe contener oraciones de la co- Los m´etodos m´asusados en este campo son lecci´onde documentos similares entre s´ı. los basados en vectores de palabras. En ellos, Longitud. El resumen debe tener una una oraci´onse representa como un vector de longitud prefijada L. palabras, y para medir la similitud entre ora- ciones se utiliza un criterio particular, siendo El problema de resumen extractivo la similitud coseno el m´asutilizado. multi-documento implica la optimizaci´onsi- mult´aneade la cobertura del contenido y de 3.1 La similitud coseno la reducci´onde la redundancia. Sin embar- Dado el conjunto T = {t1, t2, ..., tm} que go, estos criterios son contradictorios entre contiene los m t´erminos distintos existen- s´ı.Por lo tanto, la mejor forma de resolver tes en la colecci´onde documentos D. Ca- este problema es mediante un enfoque de op- da oraci´on si ∈ D se representa como si = timizaci´onmulti-objetivo. (wi1, wi2, ..., wim), i = 1, 2, ..., n, siendo n el Antes es necesario definir la representa- n´umerode oraciones en D. Cada componen- ci´onde una soluci´on, X = (x1, x2, ..., xn), te w representa el peso del t´ermino t en la ik k donde xi es una variable binaria que tiene oraci´on s , que se calcula mediante el esque- i en cuenta la presencia o ausencia (xi = 1 o ma tf isf (term-frequency inverse-sentence- xi = 0) de la oraci´on si en el resumen R. frequency) de la siguiente forma: El primer objetivo a optimizar, ΦCC (X), es el relativo al criterio de la cobertura del wik = tfik · log(n/nk) (1) contenido. Dadas las oraciones si ∈ R, este objetivo se representa como la similitud entre donde tfik cuenta cu´antas veces est´a el las oraciones s y todas las oraciones en la t´ermino tk en la oraci´on si, y log(n/nk) es i colecci´on D, representada por el vector medio el factor isf, siendo nk el n´umerode oracio- O. Por lo tanto, la siguiente funci´ondebe ser nes que contienen el t´ermino tk. El contenido principal de D puede repre- maximizada: n sentarse como la media de los pesos de los X m t´erminosen T mediante un vector de me- ΦCC (X) = sim(si,O) · xi (4) dias O = (o1, o2, ..., om), cuyas componentes i=1 se calculan como: El segundo objetivo a optimizar, ΦRR(X), n se refiere a la reducci´onde la redundancia. 1 X o = w (2) En este caso, se necesita otra variable bina- k n ik i=1 ria yij que relacione la presencia o ausencia La similitud coseno se basa en los pesos simult´anea(yij = 1 o yij = 0) del par de definidos previamente. Concretamente, mide oraciones si y sj en el resumen R. Para ca- da par de oraciones si, sj ∈ R, su similitud la semejanza entre el par de oraciones si = sim(si, sj) debe ser minimizada, lo que es (wi1, wi2, ..., wim) y sj = (wj1, wj2, ..., wjm) de la siguiente forma: equivalente a maximizar la siguiente funci´on: 1 m ΦRR(X) = (5) P Pn−1 Pn  Pn k=1 wikwjk sim(si,sj )·yij · xi sim(si, sj) = q (3) i=1 j=i+1 i=1 Pm w2 · Pm w2 k=1 ik k=1 jk Tras la definici´on de los objetivos, se 3.2 La optimizaci´onmulti-objetivo puede formular el problema de optimizaci´on multi-objetivo: La colecci´on de documentos D = {d1, d2, ..., dN }, que contiene N documentos, m´axΦ(X) = {ΦCC (X), ΦRR(X)} (6) 23 Jesús M. Sánchez-Gómez, Miguel A. Vega-Rodríguez, Carlos J. Pérez

n 1,0 X

sujeto a L − ε ≤ li · xi ≤ L + ε (7) 0,9 i=1 0,8

donde li es la longitud de la oraci´on si y ε es 0,7

la tolerancia de la longitud del resumen: 0,6

0,5 ε = m´ax li − m´ın li (8)

i=1,2,...,n i=1,2,...,n 0,4 4 M´etodos de reducci´ondel frente 0,3

0,2

de Pareto a una ´unica soluci´on Reducción de la redundancia

0,1

En esta secci´onse presentan los diferentes 0,0 m´etodos estudiados para reducir el frente de 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0 Pareto a una ´unicasoluci´on. Cobertura del contenido 4.1 Mayor hipervolumen Figura 2: Soluci´onconsenso El hipervolumen mide el espacio cubierto por los valores de las funciones objetivo corres- En un espacio objetivo normalizado, con ran- pondientes a cada soluci´ondel frente de Pa- go [0,1], donde las funciones objetivo deben reto (Beume et al., 2009). Al tratarse de un ser maximizadas, el punto ideal est´asituado problema bidimensional, el hipervolumen es en (1,1). Por lo tanto, este m´etodo (DCPI) el ´areacubierta por cada punto (soluci´on). mide la distancia entre cada punto del fren- Este m´etodo selecciona la soluci´onasociada te y el punto ideal, y selecciona el que tie- al punto con mayor hipervolumen (MH) (ver ne la distancia m´ascorta (ver la Figura 3, Figura 1). que se basa en la distancia Eucl´ıdea). Da- dos dos puntos del espacio P1 = (p11, p12) 1,0 y P2 = (p21, p22), la distancia d entre P1 0,9 y P2 se define como d(P1,P2). Adem´asde 0,8 las distancias Eucl´ıdea (DCPIE) y Chebys- 0,7 hev (DCPIC), tambi´ense han estudiado las 0,6 distancias Manhattan (DCPIM) y Mahalano- 0,5 bis (DCPIB). 0,4

0,3

1,0

0,2

0,9 Reducción de la redundancia

0,1

0,8

0,0

0,7

0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0

0,6

Cobertura del contenido

0,5

Figura 1: Mayor hipervolumen 0,4

0,3

0,2

4.2 Soluci´onconsenso Reducción de la redundancia

0,1 La soluci´onconsenso (SC) es aquella que se 0,0 genera a partir de todas las soluciones del 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0 frente de Pareto (P´erezet al., 2017). En este Cobertura del contenido problema se forma con las oraciones de los res´umenesasociados, seleccionando las ora- Figura 3: Distancia m´ascorta al punto ideal ciones m´asutilizadas hasta alcanzar la res- En primer lugar, la distancia Eucl´ıdeaes tricci´onde longitud L± indicada en la Ecua- la distancia ordinaria entre dos puntos en un ci´on(7). Al tratarse de un resumen generado espacio Eucl´ıdeo(Padhye y Deb, 2011): a partir de otros, su soluci´onasociada puede v no existir en el frente (ver Figura 2). u 2 uX 2 4.3 Distancia m´ascorta al punto dE(P1,P2) = t (p1i − p2i) (9) ideal i=1 El punto ideal es aquel en el que los valo- En segundo lugar, la distancia Manhattan res de los objetivos son los mejores posibles. est´abasada en la geograf´ıacallejera cuadri- 24 Un estudio de los métodos de reducción del frente de Pareto a una única solución aplicado al problema de resumen extractivo multi-documento

culada del distrito de Manhattan (Wu et al., 1,0 2015). Se define como la suma de las longitu- 0,9 des de las proyecciones en los ejes de coorde- 0,8 nadas del segmento de l´ıneaentre dos puntos: 0,7

0,6 P2 dM (P1, P2) = kP1,P2k = i=1 |p1i − p2i| (10) 0,5

0,4 En tercer lugar, la distancia Chebyshev es 0,3 la mayor de las longitudes de las proyecciones 0,2

en los ejes de coordenadas del segmento de Reducción de la redundancia 0,1 l´ıneaentre dos puntos (Siwale, 2013): 0,0

0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0

dC (P1,P2) = m´ax |p1i − p2i| (11) Cobertura del contenido i=1,2 Y en cuarto lugar, la distancia Mahalano- Figura 4: Distancia m´ascorta a todos los bis determina la similitud entre dos variables puntos aleatorias multi-dimensionales, teniendo en cuenta sus correlaciones (Zhao et al., 2017). 5 Resultados En este contexto se define como: v 5.1 Conjuntos de datos u 2  2 uX p1i − p2i Para la ejecuci´on de los experimentos se d (P ,P ) = t (12) B 1 2 σ han utilizado los conjuntos de datos DUC, i=1 i que es un banco de pruebas para la evalua- donde σi es la desviaci´onest´andarno corregi- ci´onde res´umenesautom´aticos.En este es- da, que se calcula midiendo la dispersi´onde tudio se han utilizado 10 temas (del d061j los valores de la funci´onobjetivo i a trav´es al d070f) del conjunto de datos DUC2002 de los G puntos del frente de Pareto: (NIST, 2014). La longitud de resumen esta- blecida en estos conjuntos es de 200 palabras. 1 G σ = X (p − p¯ )2 (13) i G gi i 5.2 M´etricas de evaluaci´on g=1 El rendimiento de los m´etodos ha sido eva- G luado con las m´etricas ROUGE. Estas m´etri- 1 X siendop ¯ = p . cas son las que se utilizan en DUC (Lin, i G gi g=1 2004). ROUGE mide la similitud entre un re- 4.4 Distancia m´ascorta a todos sumen generado autom´aticamente y un resu- los puntos men generado por un experto. Las puntuacio- nes ROUGE empleadas han sido el ROUGE- Este m´etodo (DCTP) eval´uala suma de las 2 (R-2) y el ROUGE-L (R-L), y se ha usado distancias entre un punto y todos los restan- la media aritm´etica para medir la tendencia tes, y selecciona el punto del frente con la me- central en cada tema. nor distancia (ver Figura 4). Para este m´eto- Adem´as,se ha realizado un an´alisises- do, adem´asde las cuatro distancias definidas tad´ıstico de las puntuaciones ROUGE me- anteriormente (Eucl´ıdeaDCTPE, Chebyshev diante pruebas de hip´otesisadecuadas. Los DCTPC, Manhattan DCTPM y Mahalanobis p-valores menores que 0,05 han sido conside- DCTPB), tambi´ense ha considerado la dis- rados como estad´ısticamente significativos. tancia Levenshtein (DCTPL) (Ristad y Yia- nilos, 1998). En este contexto se define como 5.3 Configuraci´on el n´umerode inserciones y eliminaciones de Sanchez-Gomez, Vega-Rodr´ıguez, y P´erez oraciones necesario para transformar un re- (2018) propusieron y aplicaron un enfoque sumen en otro. Dados dos res´umenes(o con- de optimizaci´onmulti-objetivo basado en el juntos de oraciones) R y S, la distancia de algoritmo de colonia de abejas artificiales Levenshtein entre ellos se calcula como: aplicado al problema de resumen extractivo multi-documento. Los siguientes par´ametros dL(R, S) = |R| + |S| − 2 · |R ∩ S| (14) produjeron buenos resultados: tama˜node po- donde | · | es la cardinalidad o n´umerode ele- blaci´on= 50; n´umerode ciclos = 1000; y pro- mentos en un conjunto. babilidad de mutaci´on= 0,1. 25 Jesús M. Sánchez-Gómez, Miguel A. Vega-Rodríguez, Carlos J. Pérez

M´etodo d061j d062j d063j d064j d065j d066j d067f d068f d069f d070f Media MH 0,297 0,186 0,212 0,201 0,112 0,189 0,273 0,242 0,131 0,099 0,194 SC 0,268 0,400 0,199 0,203 0,167 0,195 0,273 0,277 0,254 0,172 0,241 DCPIE 0,297 0,191 0,205 0,168 0,115 0,059 0,231 0,242 0,233 0,099 0,184 DCPIM 0,248 0,295 0,256 0,187 0,168 0,063 0,172 0,208 0,195 0,258 0,205 DCPIC 0,222 0,140 0,183 0,197 0,124 0,189 0,353 0,260 0,145 0,101 0,191 DCPIB 0,297 0,190 0,153 0,170 0,143 0,059 0,231 0,318 0,111 0,098 0,177 DCTPE 0,219 0,245 0,192 0,175 0,163 0,203 0,272 0,291 0,119 0,111 0,199 DCTPM 0,232 0,270 0,185 0,169 0,164 0,221 0,268 0,286 0,122 0,112 0,203 DCTPC 0,222 0,242 0,196 0,169 0,164 0,202 0,284 0,282 0,118 0,117 0,200 DCTPB 0,219 0,245 0,191 0,175 0,166 0,203 0,274 0,292 0,117 0,111 0,199 DCTPL 0,272 0,232 0,175 0,167 0,177 0,196 0,268 0,261 0,148 0,130 0,203

Tabla 1: Resultados para ROUGE-2 (en negrita los mejores valores)

M´etodo d061j d062j d063j d064j d065j d066j d067f d068f d069f d070f Media MH 0,541 0,342 0,469 0,404 0,366 0,428 0,508 0,450 0,321 0,402 0,423 SC 0,509 0,571 0,458 0,390 0,407 0,461 0,517 0,506 0,517 0,480 0,482 DCPIE 0,540 0,347 0,459 0,347 0,375 0,278 0,417 0,450 0,536 0,402 0,414 DCPIM 0,508 0,483 0,489 0,373 0,393 0,281 0,413 0,450 0,461 0,481 0,433 DCPIC 0,494 0,291 0,414 0,400 0,375 0,425 0,558 0,465 0,380 0,406 0,421 DCPIB 0,540 0,345 0,357 0,312 0,341 0,279 0,417 0,531 0,256 0,402 0,378 DCTPE 0,481 0,463 0,440 0,368 0,390 0,458 0,525 0,518 0,441 0,435 0,452 DCTPM 0,492 0,477 0,430 0,365 0,390 0,468 0,518 0,512 0,449 0,440 0,454 DCTPC 0,438 0,462 0,445 0,364 0,391 0,460 0,532 0,507 0,443 0,440 0,453 DCTPB 0,481 0,463 0,438 0,368 0,392 0,458 0,525 0,518 0,443 0,435 0,452 DCTPL 0,518 0,406 0,384 0,349 0,395 0,456 0,488 0,485 0,322 0,431 0,423

Tabla 2: Resultados para ROUGE-L (en negrita los mejores valores)

5.4 Resultados sis estad´ıstico. Se han aplicado pruebas de Los m´etodos han sido evaluados con los fren- hip´otesisestad´ısticaspara analizar si existen tes de Pareto resultantes de las 20 repeticio- diferencias estad´ısticamente significativas en- nes ejecutadas para cada tema. Las Tablas tre los diferentes m´etodos. Las condiciones 1 y 2 contienen los resultados para R-2 y de normalidad y homokedasticidad para el R-L respectivamente, mostrando que la so- ANOVA uni-factorial no se pueden asumir luci´onconsenso (SC) ha conseguido el mejor para ning´untema, por lo que se ha utilizado valor medio en ambos ROUGE. Tambi´enes el test de Kruskal-Wallis. Los p-valores son el m´etodo que ha obtenido el mejor valor en inferiores a 0,001 para todos los temas, por mayor n´umerode temas (3 en R-2 y 2 en R- lo que al menos un par de m´etodos son signi- L). Los m´etodos de la distancia m´ascorta al ficativamente diferentes. Las comparaciones punto ideal y a todos los puntos con la distan- por pares se han realizado entre el m´etodo de la soluci´onconsenso y el resto de los m´etodos cia Manhattan (DCPIM y DCTPM) han lo- mediante el test de Conover con la correcci´on grado el segundo mejor valor medio (DCPIM de Bonferroni. Las Tablas 3 y 4 muestran las en R-2 y DCTPM en R-L). Adem´as,DCPIM diferencias estad´ısticamente significativas, y y DCTPM han obtenido el mejor valor en 2 temas y en 1 tema en ambos ROUGE, res- de ellas se concluye que el m´etodo de la so- pectivamente. El m´etodo del mayor hipervo- luci´onconsenso (SC) aporta diferencias es- lumen (MH) ha obtenido el mejor resultado tad´ısticamente significativas con respecto al en 1 tema en R-2 y en 2 temas en R-L. En resto. Concretamente, SC ha obtenido dife- cuanto a los m´etodos basados en distancias, y rencias estad´ısticamente significativas con el agrupando todas ellas, DCPI ha obtenido el 80 % de los m´etodos en R-2 en al menos la mejor resultado en 5 temas en ambos ROU- mitad de los temas. En cuanto a R-L, SC ha GE, mientras que DCTP los ha obtenido en obtenido diferencias estad´ısticamente signifi- 2 temas para R-2 y en 1 tema para R-L. cativas con todos los m´etodos en al menos la Una vez analizados los resultados de las mitad de los temas. puntuaciones ROUGE, se muestra el an´ali- Para terminar, la Tabla 5 muestra los por- 26 Un estudio de los métodos de reducción del frente de Pareto a una única solución aplicado al problema de resumen extractivo multi-documento

M´etodo d061j d062j d063j d064j d065j d066j d067f d068f d069f d070f Total MH <0,001 <0,001 0,056 0,077 <0,001 <0,001 1,000 <0,001 <0,001 <0,001 7 DCPIE <0,001 <0,001 1,000 <0,001 <0,001 <0,001 <0,001 <0,001 1,000 <0,001 8 DCPIM 1,000 <0,001 0,002 1,000 1,000 <0,001 <0,001 <0,001 0,002 1,000 6 DCPIC 0,057 <0,001 0,005 0,429 <0,001 0,006 <0,001 0,008 0,025 <0,001 8 DCPIB <0,001 <0,001 <0,001 <0,001 0,683 <0,001 <0,001 1,000 <0,001 <0,001 8 DCTPE <0,001 <0,001 1,000 0,202 1,000 0,415 1,000 1,000 <0,001 <0,001 4 DCTPM 0,007 <0,001 0,017 <0,001 1,000 <0,001 1,000 1,000 <0,001 <0,001 7 DCTPC 0,002 <0,001 1,000 <0,001 1,000 0,197 1,000 1,000 <0,001 <0,001 5 DCTPB <0,001 <0,001 1,000 0,202 1,000 0,619 1,000 1,000 <0,001 <0,001 4 DCTPL 1,000 <0,001 <0,001 <0,001 1,000 1,000 1,000 0,442 <0,001 0,003 5

Tabla 3: p-valores obtenidos de la comparaci´onpor pares entre el m´etodo de la soluci´onconsenso y el resto de m´etodos para ROUGE-2 (en negrita las diferencias estad´ısticamente significativas)

M´etodo d061j d062j d063j d064j d065j d066j d067f d068f d069f d070f Total MH <0,001 <0,001 1,000 0,112 <0,001 <0,001 1,000 <0,001 <0,001 <0,001 7 DCPIE <0,001 <0,001 1,000 <0,001 <0,001 <0,001 <0,001 <0,001 1,000 <0,001 8 DCPIM 0,457 <0,001 0,024 1,000 0,027 <0,001 <0,001 <0,001 1,000 1,000 6 DCPIC 1,000 <0,001 <0,001 0,900 <0,001 <0,001 <0,001 <0,001 <0,001 <0,001 8 DCPIB <0,001 <0,001 <0,001 <0,001 <0,001 <0,001 <0,001 1,000 <0,001 <0,001 9 DCTPE <0,001 <0,001 0,419 0,011 0,275 1,000 1,000 1,000 <0,001 <0,001 5 DCTPM 0,134 <0,001 0,003 0,004 0,205 1,000 1,000 1,000 <0,001 <0,001 5 DCTPC 0,005 <0,001 1,000 <0,001 0,303 1,000 0,635 1,000 <0,001 <0,001 5 DCTPB <0,001 <0,001 0,206 0,027 0,599 1,000 1,000 1,000 <0,001 <0,001 5 DCTPL 1,000 <0,001 <0,001 <0,001 1,000 1,000 1,000 0,048 <0,001 <0,001 6

Tabla 4: p-valores obtenidos de la comparaci´onpor pares entre el m´etodo de la soluci´onconsenso y el resto de m´etodos para ROUGE-L (en negrita las diferencias estad´ısticamente significativas)

M´etodo MH DCPIE DCPIM DCPIC DCPIB DCTPE DCTPM DCTPC DCTPB DCTPL Media ROUGE-2 24,23 30,98 17,56 26,18 36,16 21,11 18,72 20,50 21,11 18,72 23,53 ROUGE-L 13,95 16,43 11,32 14,49 27,51 6,64 6,17 6,40 6,64 13,95 12,35

Tabla 5: Porcentajes de mejora obtenidos por el m´etodo de la soluci´onconsenso

centajes de mejora obtenidos por la solu- distancias. Los resultados indican que la so- ci´onconsenso con respecto al resto de m´eto- luci´onconsenso es el m´etodo que mejores re- dos. Los porcentajes de mejora obtenidos por sultados ha obtenido. Los porcentajes de me- la soluci´onconsenso var´ıan del 17,56 % al jora global alcanzados son del 23,53 % para 36,16 % para R-2 y del 6,17 % al 27,51 % para ROUGE-2 y del 12,35 % para ROUGE-L. R-L. Adem´as,los porcentajes de mejora glo- bal obtenidos son del 23,53 % y del 12,35 % Agradecimientos en R-2 y R-L, respectivamente. Esta investigaci´onha sido apoyada por la Agencia Estatal de Investigaci´on(proyectos 6 Conclusiones PID2019-107299GB-I00 y MTM2017-86875- El problema del resumen extractivo multi- C3-2-R), por la Junta de Extremadura (pro- documento se ha resuelto aplicando optimiza- yectos GR18090 y GR18108) y por la Uni´on ci´onmulti-objetivo. Como el resultado es un Europea (Fondo Europeo de Desarrollo Re- conjunto de soluciones no dominadas o fren- gional). Jesus M. Sanchez-Gomez est´aapo- te de Pareto, el objetivo de este trabajo ha yado por un Contrato Predoctoral financia- sido llevar a cabo un an´alisispara seleccio- do por la Junta de Extremadura (contrato nar un ´unico resumen del frente. Para ello, PD18057) y la Uni´onEuropea (Fondo Social se han implementado, evaluado y compara- Europeo). do diferentes m´etodos basados en el mayor hipervolumen, en la soluci´onconsenso, en la Bibliograf´ıa distancia m´ascorta al punto ideal y en la Aguirre, O. y H. Taboada. 2011. A Cluste- distancia m´ascorta a todos los puntos, eva- ring Method Based on Dynamic Self Or- luando para estos dos ´ultimosvarios tipos de ganizing Trees for Post-Pareto Optima- 27 Jesús M. Sánchez-Gómez, Miguel A. Vega-Rodríguez, Carlos J. Pérez

lity Analysis. Procedia Computer Science, Saleh, H. H., N. J. Kadhim, y B. A. At- 6:195–200. tea. 2015. A genetic based optimization model for extractive multi-document text Alguliev, R. M., R. M. Aliguliyev, y C. A. summarization. Iraqi Journal of Science, Mehdiyev. 2011. Sentence selection for 56(2):1489–1498. generic document summarization using an adaptive differential evolution algorithm. Sanchez-Gomez, J. M., M. A. Vega- Swarm Evol. Comput., 1(4):213–222. Rodr´ıguez, y C. J. P´erez. 2018. Extractive multi-document text sum- Antipova, E., C. Pozo, G. Guill´en-Gos´albez, marization using a multi-objective arti- D. Boer, L. F. Cabeza, y L. Jim´enez.2015. ficial bee colony optimization approach. On the use of filters to facilitate the post- Knowledge-Based Syst., 159:1–8. optimal analysis of the Pareto solutions Siwale, I. 2013. Practical Multi-Objective in multi-objective optimization. Comput. Programming. Informe t´ecnico, Technical Chem. Eng., 74:48–58. Report RD-14-2013. APEX Research. Beume, N., C. M. Fonseca, M. L´opez-Ib´a˜nez, Soylu, B. y S. K. Ulusoy. 2011. A preference L. Paquete, y J. Vahrenhold. 2009. On ordered classification for a multi-objective the Complexity of Computing the Hy- max–min redundancy allocation problem. pervolume Indicator. IEEE Trans. Evol. Comput. Oper. Res., 38(12):1855–1866. Comput., 13(5):1075–1082. Sudeng, S. y N. Wattanapongsakorn. 2015. Ferreira, J. C., C. M. Fonseca, y A. Gaspar- Post Pareto-optimal pruning algorithm Cunha. 2007. Methodology to Select So- for multiple objective optimization using lutions from the Pareto-Optimal Set: A specific extended angle dominance. Eng. Comparative Study. En Proceedings of Appl. Artif. Intell., 38:221–236. the 9th Annual Conference on Genetic and Taboada, H. A. y D. W. Coit. 2007. Data Evolutionary Computation, p´aginas789– Clustering of Solutions for Multiple Ob- 796. ACM. jective System Reliability Optimization Problems. Qual. Technol. Quant. Manag., Hashimi, H., A. Hafez, y H. Mathkour. 2015. 4(2):191–210. Selection criteria for text mining approa- ches. Comput. Hum. Behav., 51:729–733. Veerappa, V. y E. Letier. 2011. Unders- tanding Clusters of Optimal Solutions in Lin, C.-Y. 2004. ROUGE: A package for Multi-Objective Decision Problems. En automatic evaluation of summaries. En 19th Requirements Engineering Conferen- Proceedings of the ACL-04 Workshop, vo- ce, p´aginas89–98. IEEE. lumen 8, p´aginas74–81. ACL. Wan, X. 2008. An exploration of document NIST. 2014. Document Understanding Con- impact on graph-based multi-document ferences. http://duc.nist.gov. Ultimo´ ac- summarization. En Proceedings of the ceso: 11 de agosto de 2020. Conference on Empirical Methods in NLP, p´aginas755–762. ACL. Padhye, N. y K. Deb. 2011. Multi-objective optimisation and multi-criteria decision Wu, L., X. Xu, X. Ye, y X. Zhu. 2015. Re- making in SLS using evolutionary ap- peat and near-repeat burglaries and offen- proaches. Rapid Prototyping Journal, der involvement in a large Chinese city. 17(6):458–478. Cartogr. Geogr. Inf. Sci., 42(2):178–189. Zajic, D. M., B. J. Dorr, y J. Lin. P´erez,C. J., M. A. Vega-Rodr´ıguez,K. Re- 2008. Single-document and multi- der, y M. Fl¨orke. 2017. A Multi-Objective document summarization techniques for Artificial Bee Colony-based optimization email threads using sentence compression. approach to design water quality monito- Inf. Process. Manage., 44(4):1600–1610. ring networks in river basins. Journal of Cleaner Production, 166:579–589. Zhao, L., Z. Lu, W. Yun, y W. Wang. 2017. Validation metric based on Mahalanobis Ristad, E. S. y P. N. Yianilos. 1998. Learning distance for models with multiple corre- string-edit distance. IEEE Trans. Pattern lated responses. Reliab. Eng. Syst. Saf., Anal. Mach. Intell., 20(5):522–532. 159:80–89. 28 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 29-36 recibido 30-04-2020 revisado 26-05-2020 aceptado 26-05-2020

Generaci´onde frases literarias: un experimento preliminar

Generation of Literary Sentences: a Preliminary Approach

Luis-Gil Moreno-Jim´enez1,4, Juan-Manuel Torres-Moreno1,3, Roseli S. Wedemann2 1Universit´ed’Avignon/LIA 2Universidade do Estado do Rio de Janeiro 3Polytechnique Montr´eal 4Universidad Tecnol´ogicade la Selva [email protected], [email protected], [email protected]

Resumen: En este trabajo abordamos la generaci´onautom´aticade frases literarias en espa˜nol.Proponemos un modelo de generaci´ontextual basado en algoritmos estad´ısticosy an´alisissint´acticosuperficial. Presentamos resultados preliminares que son bastante alentadores. Palabras clave: Generaci´onde texto, Modelos de lenguaje, Abstract: In this paper we address the automatic generation of literary phrases in Spanish. We propose a model for text generation based on statistical algorithms and Shallow . We present preliminary results that are quite encouraging. Keywords: Natural Language Generation, Language Models, Word Embedding

1 Introducci´on Pensamos que, combinados adecuadamente, La Generaci´onAutom´aticade Texto (GAT), estos conceptos podr´ıan usarse para mode- es un ´areadel Procesamiento de Lenguaje lar computacionalmente una parte del pro- Natural (PLN), que en los ´ultimosa˜nosha ceso creativo que una persona sigue para la logrado avances importantes, bajo la convic- generaci´onde literatura. ci´onde desarrollar modelos computacionales La complejidad para automatizar la gene- capaces de simular c´omolas personas mani- raci´onliteraria reside en el an´alisisde los tex- pulan el lenguaje. La mayor´ıade estos tra- tos literarios, que sistem´aticamente han sido bajos persiguen la automatizaci´onde ciertos dejados a un lado por varias razones. En pri- procesos para aumentar la productividad en mer lugar, el nivel de discurso literario es m´as el ´ambito industrial, acad´emico,tecnol´ogico, complejo que el de los otros g´eneros.En se- etc. (Szymanski y Ciota, 2002; Sridhara et gundo lugar, a menudo, los documentos lite- al., 2010; Fu et al., 2014). rarios hacen referencia a mundos o situacio- Sin embargo, existe un enfoque dentro de nes imaginarias o aleg´oricas,a diferencia de GAT que ha sido poco abordado: la genera- g´eneroscomo el perid´ıstico o enciclop´edico ci´onautom´aticade texto literario. Desde ha- que describen mayoritariamente situaciones ce alg´untiempo, diversos investigadores han o hechos factuales. Estas y otras caracter´ısti- trabajado en modelos generativos de este ti- cas presentes en los textos literarios, vuelven po, la mayor parte se avocan directamente a sumamente compleja la tarea de an´alisisau- poemas, poes´ıaso narrativas (Lebret, Gran- tom´aticode este tipo de textos. En este tra- gier, y Auli, 2016; Brantley et al., 2019). Con- bajo nos proponemos utilizar corpora litera- sideramos que la literatura, como “Proceso rios, a fin de generar realizaciones literarias Creativo” Boden (2004), es el resultado que (frases nuevas) no presentes en dichos corpo- las personas desarrollan como parte de un ra. proceso cognitivo complejo, y que la mejor Este proceso automatizado da lugar a un manera de abordarla, es partir de conceptos campo de investigaci´ondenominado Creati- m´asamplios como: emociones, asociaci´onde vidad Computacional (CC) (P´erezy P´erez, conceptos (sem´antica) y estilos de redacci´on. 2015), donde el objetivo es modelar el “proce- ISSN 1135-5948. DOI 10.26342/2020-65-3 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Luis-Gil Moreno-Jiménez, Juan-Manuel Torres-Moreno, Roseli S. Wedmann so creativo” en un lenguaje que sea interpre- Este art´ıculo est´aestructurado como si- table y reproducible en el dominio de lo cal- gue. La Secci´on2 presenta un estado del arte culable. Una gran variedad de modelos de IA de la creatividad computacional. La Secci´on3 han sido adaptados e incluso mejorados para describe los corpora utilizados en nuestros ex- lograr simular el proceso creativo a trav´esde perimentos. Los modelos usados son descritos modelos computacionales (Colton, 2012). De- en la Secci´on4. Los resultados se encuentran be considerarse que el objetivo principal de la en la Secci´on5, antes de concluir en Secci´on CC no es solucionar problemas espec´ıficosen 6 con algunas ideas de trabajos futuros. el ´ambito industrial o acad´emico,sino pro- poner nuevos paradigmas para la creaci´onde 2 Estado del arte obras art´ısticas (Colton, Wiggins, y others, En la GAT se encuentran algunos trabajos 2012). con diversos objetivos. Szymanski y Ciota Otro problema es la falta de una defini- (2002) han desarrollado un modelo basado ci´onuniversal de literatura; por lo que diver- en cadenas de Markov para la generaci´on sas definiciones son encontradas. Esto com- de texto en polaco. El proceso inicia con un plica la tarea de evaluaci´on,ya que, para lo- t´erminodado por el usuario (estado inicial). grar una percepci´onliteraria homog´enea,se Un proceso probabil´ısticocalcula los estados deber´ıapartir de la misma definici´onde li- siguientes. Cada estado es representado por teratura. En este trabajo optaremos por in- n-gramas de letras o de palabras. El m´eto- troducir una definici´onpragm´aticade frase do demostr´oun mejor comportamiento, ge- literaria, que servir´apara nuestros modelos y nerando palabras de hasta 4 o 5 letras, con- experimentos. siderando que en polaco esta es la longitud Definici´on. Una frase literaria se carac- media de palabras. teriza por poseer elementos (sustantivos, ver- Sridhara et al. (2010) utilizan un enfoque bos, adjetivos y adverbios) que son percibidos distinto. Presentan un algoritmo generativo como elegantes y menos coloquiales que sus de comentarios descriptivos aplicados a blo- equivalentes en lengua general. ques de c´odigoen lenguaje Java. Se consi- Por ejemplo, la frase en lengua general: deran algunas variables ling¨u´ısticascomo los “Me detuve a descansar luego de haber nombres de m´etodos, funciones e instancias. caminado mucho.” Estos elementos son procesados heur´ıstica- mente para generar texto descriptivo. puede ser ligeramente re-escrita para generar Tambi´enexisten trabajos menos extensos una frase literaria seg´unnuestra definici´on: pero m´asprecisos. Huang et al. (2012) pro- ponen un modelo basado en redes neurona- “Tom´e unos instantes para respirar y les para la generaci´onde subconjuntos multi- oxigenar mis pulmones... pues mi andar palabras. Este mismo objetivo se considera en se hab´ıaprolongado largo tiempo.” (Fu et al., 2014), en donde se busca estable- cer y detectar la relaci´onhiper´onimo-hip´oni- En particular, proponemos crear artificial- mo usando un modelo (tambi´en mente frases literarias utilizando modelos ge- basado en redes neuronales (Mikolov, Yih, y nerativos y aproximaciones sem´anticas basa- Zweig, 2013)). Los autores reportan una pre- dos en corpora de lengua literaria. Buscamos, cisi´onde 0.70, al ser evaluado sobre un corpus atrav´esde la combinaci´onde estos elementos, manualmente etiquetado. una homosintaxis, es decir, la producci´onde En cuanto a los trabajos relacionados a la texto nuevo a partir de formas de discurso GAT, se percibe una diferencia entre aquellos de diversos autores. La homosintaxis no tie- orientados a la generaci´onliteraria y aquellos ne el mismo contenido sem´antico, tampoco que buscan la generaci´onde texto en lengua las mismas palabras, pero guarda la misma general. Por ejemplo, en (Zhang y Lapata, estructura sint´actica.En este art´ıculoestu- 2014) se propone un modelo para la genera- diamos el problema de la generaci´onde texto ci´onde poemas basado en dos premisas b´asi- literario en forma de frases aisladas, no a nivel cas: ¿qu´edecir? y ¿c´omodecirlo? El mode- de p´arrafos.La generaci´onde p´arrafospuede lo considera una lista de palabras clave para ser objeto de trabajos futuros. Un protoco- seleccionar un conjunto de frases. Estas fra- lo de evaluaci´onde la calidad de las frases ses son procesadas con una red neuronal (Mi- generadas ser´apresentado. kolov y Zweig, 2012) para construir nuevas 30 Generación de frases literarias: un experimento preliminar combinaciones y formular nuevos contextos. de an´alisis.Estas son, sin embargo, las condi- El modelo fue evaluado manualmente por 30 ciones que presenta un corpus literario real. expertos en una escala de 1 a 5, analizando le- Las herramientas cl´asicascomo FreeLing gibilidad, coherencia y significatividad en fra- tienen mucha dificultad en tratar este tipo ses de 5 palabras, obteniendo una precisi´onde de documentos. Por ello, decidimos construir 0,75. Sin embargo, la coherencia entre frases un segmentador de frases ad hoc para este result´oser muy pobre. tipo de corpus ruidoso. Las frases fueron seg- Otros trabajos, como los de Oliveira mentadas autom´aticamente, usando un pro- (2012; Oliveira y Cardoso (2015) proponen grama en PERL 5.0 y expresiones regulares, modelos para la generaci´onde poemas, basa- para obtener una frase por l´ınea. dos en plantillas ling¨u´ısticas. Se utilizan listas Las caracter´ısticasdel corpus 5KL se en- de palabras clave para controlar el contexto cuentran en la Tabla 14. Este corpus es em- bajo el cual los poemas son generados. Estos pleado para entrenar el modelo Word2vec trabajos utilizan la herramienta PEN1 para (Secci´on4). obtener la informaci´ongramatical de las pa- labras y crear nuevas plantillas. La ventaja de Corpus Frases Tokens utilizar m´etodos basados en plantillas es que 5KL 9 M 149 M ayudan a mantener la coherencia y la grama- Media por doc. 2.4 K 37.3 K ticalidad de los textos generados. Modelos enfocados a la generaci´on de Tabla 1: Corpus 5KL: obras literarias poes´ıapueden ser analizados en los trabajos El corpus literario 5KL posee la ventaja de de Oliveira (2017) y Agirrezabal et al. (2013). ser muy extenso y adecuado para el aprendi- Este ´ultimopresenta un modelo estoc´astico, zaje autom´atico.Tiene sin embargo, la des- donde se calcula la probabilidad de aparici´on ventaja de que no todas las frases son nece- de etiquetas POS (Part-of-Speech), conside- sariamente frases literarias. Muchas de ellas rando las ocurrencias de estas etiquetas ex- son frases de lengua general, que a menudo tra´ıdas de diversos corpora. Despu´esse ge- otorgan una fluidez a la lectura y proporcio- neran nuevas secuencias y posteriormente se nan los enlaces necesarios a las ideas. procede a la sustituci´onde las etiquetas POS de sustantivos y adjetivos. 3.2 Corpus 8KF Decidimos crear un peque˜nocorpus contro- 3 Corpora utilizados lado, exclusivamente compuesto de “frases li- En esta secci´on, describimos los corpora uti- terarias”, que ser´autilizado para generar las lizados en nuestros modelos para los experi- plantillas o estructuras gramaticales (Secci´on mentos. Se trata del corpus 5KL y del corpus 4.1). Este corpus de casi 8 000 frases literarias 8KF, ambos creados en idioma espa˜nol. fue constituido manualmente, a partir de poe- mas, discursos, citas, cuentos y otras obras. 3.1 Corpus 5KL Se evitaron cuidadosamente las frases de Este corpus fue constituido con aproximada- lengua general, y tambi´enaquellas demasia- mente 5 000 documentos (en su mayor parte do cortas (N ≤ 3 palabras) y demasiado lar- libros) en espa˜nol.Los textos, en su mayor´ıa, gas (N ≥ 30 palabras). Algunos elementos corresponden a g´enerosliterarios: narrativa, que sirvieron para seleccionar manualmente poes´ıa,teatro, ensayos, etc2. Los documentos las frases “literarias” fueron: un vocabulario originales, en formatos muy heterog´eneos3, complejo y est´etico,el cual rara vez es em- fueron procesados para crear un ´unicodocu- pleado en el lenguaje com´un,adem´asde la mento codificado en utf8. Dada su heteroge- identificaci´onde ciertas figuras literarias co- neidad, este corpus presenta una gran canti- mo la rima, la an´afora,la met´aforay otras. dad de errores: palabras cortadas o pegadas, Las caracter´ısticasdel corpus 8KF se mues- s´ımbolos, n´umeros,disposici´onno convencio- tran en la Tabla 2. nal de p´arrafos,etc. Lo que complica la tarea 4 Modelo de Generaci´onTextual 1http://code.google.com/p/pen 2Dada la dimensi´onde este corpus, no nos fue po- A continuaci´on,se describen las dos fases sible cuantificar los g´enerosmanualmente. Una apro- que conforman nuestro modelo. La primera ximaci´onautom´aticapodr´arealizarse a futuro. 3pdf, txt, html, doc, docx, odt, etc. 4M = 106 y K = 103. 31 Luis-Gil Moreno-Jiménez, Juan-Manuel Torres-Moreno, Roseli S. Wedmann

Corpus Frases Tokens Seg´unnuestra hip´otesis, al sustituirlos, lo- 8KF 7 679 114 K graremos la generaci´onde frases nuevas por Media por frase – 15 homosintaxis: sem´antica diferente, misma es- tructura7. Tabla 2: Corpus 8KF: frases literarias La arquitectura del modelo se ilustra en la Figura 1. Los cuadros llenos representan consiste en la generaci´onde una estructura palabras funcionales y los cuadros vac´ıoseti- gramatical que hemos denominado estructura quetas POS a ser reemplazadas. gramatical parcialmente vac´ıa (EGP), la cual se compone de palabras funcionales (conec- 8KF Pre- tores, verbos auxiliares, art´ıculos,etc.) y eti- proceso Freeling quetas POS (Part-of-Speech)5. Estas ´ultimas son obtenidas a trav´esde un an´alisismorfo- sint´acticorealizado con FreeLing 6 (Padr´oy Stanilovsky, 2012). Detección La segunda, consiste en la sustituci´onde de sintagmas las etiquetas POS con palabras sem´antica- Longitud N mente relacionadas a un contexto (query, Q), que es solicitado al usuario. Para ello, se em- plea un modelo Word2vec (Mikolov, Yih, y EGP Zweig, 2013) con interpretaciones geom´etri- cas. Figura 1: Modelo generativo canned text 4.1 Generaci´onde EGP basado en texto enlatado 4.2 Reemplazo de etiquetas POS Esta t´ecnica conocida como texto enlata- En esta fase, empleamos el modelo de apro- do (Canned Text) (Molins y Lapalme, 2015) ximaci´onsem´antica Word2vec, que utiliza un cuenta con la ventaja de agilizar el an´alisis algoritmo basado en redes neuronales. El ob- sint´acticoy permitir centrarnos directamente jetivo es obtener la representaci´onvectorial en el vocabulario a emplear (McRoy, Chan- de las palabras (embeddings) del corpus 5KL narukul, y Ali, 2003; van Deemter, Theune, y en un espacio n-dimensional8, y poder cal- Krahmer, 2005). La EGP es generada usando cular las distancias entre estas. El entrena- una frase del corpus 8KF (Secci´on3), donde miento del modelo Word2vec se describe a se reemplazan ´unicamente verbos, sustanti- continuaci´on. vos o adjetivos {V, S, A}, por sus respectivas 4.2.1 Modelo Word2vec etiquetas POS. Las otras entidades ling¨u´ısti- cas, en particular las palabras funcionales, Para el entrenamiento y la implementaci´on de Word2vec se utiliza la biblioteca Gen- son conservadas. 9 El proceso inicia con la selecci´onaleato- sim , una implementaci´on en Python de Word2vec10. El corpus 5KL es pre-procesado ria de una frase original fo ∈ corpus 8KF de longitud |f | = N. Se selecciona una f para para uniformizar el formato del texto, elimi- o o nando caracteres que no son importantes en cada frase que se desee generar. fo ser´aana- lizada con FreeLing para identificar los sin- los an´alisisde PLN (como puntuaci´on,n´ume- tagmas. Los elementos {V, S, A} de los sin- ros, etc.) (Torres-Moreno, 2014). Se consideraron palabras con m´asde 5 tagmas de fo ser´an reemplazados por sus respectivas etiquetas POS. Estos elementos ocurrencias en el corpus. Se defini´ouna lon- ling¨u´ısticos son los que mayor informaci´on gitud de 10 palabras para la ventana contex- aportan en un texto, independientemente de tual. Para las representaciones vectoriales se su g´enero(Bracewell, Ren, y Kuriowa, 2005). 7Al contrario de la par´afrasisque busca conservar la sem´antica, alterando la estructura sint´actica. 5Etiquetas que aportan informaci´on gramatical 8Word2vec pertenece a un amplio campo de inves- de cada palabra http://blade10.cs.upc.edu/freeling- tigaci´ondentro de PLN conocido como Representa- old/doc/tagsets/tagset-es.html tion Learning (Bengio, Courville, y Vincent, 2013). 6Desarrollado en el centro TALP (Universidad Po- 9Disponible en: https://pypi.org/project/gensim/ lit´ecnicade Catalu˜na)y puede ser obtenido en la di- 10https://towardsdatascience.com/introduction- recci´on:http://nlp.lsi.upc.edu/freeling to-word-embedding-and-word2vec-652d0c2060fa 32 Generación de frases literarias: un experimento preliminar consideraron vectores de 60 dimensiones. El de 30 dimensiones. El n´umerode dimensiones modelo Word2vec empleado en este trabajo fue fijado a 30 de manera emp´ırica,como un es continuous skip-gram model (Skip-gram) compromiso razonable entre diversidad l´exica (Mikolov et al., 2013). y tiempo de procesamiento. Con el modelo entrenado, es posible ob- El vector U~ puede ser escrito como tener un conjunto de palabras, o embeddings, asociadas a una entrada definida por un query U~ = (u1, ..., u10, u11, ..., u20, u21, ..., u30) Q. Word2vec recibe un t´ermino Q y devuel- ve un l´exico L(Q) = (w1, w2, ..., wm), que re- donde cada elemento uj, j = 1, ..., 10, re- presenta un conjunto de m = 10 palabras presenta una palabra pr´oximaa o; uj, j = sem´anticamente pr´oximasa Q. El valor de m 11, ..., 20, representa una palabra pr´oximaa fue definido de esta manera ya que se perci- Q; y uj, j = 21, ..., 30, es una palabra pr´oxi- bi´oque, mientras m´asse extiende el n´umero ma a w. U~ puede ser re-escrito de la siguiente de palabras proximas a Q, estas pierden m´as manera, su relaci´oncon respecto a Q. Formalmente, ~ representamos Word2vec: Q → L(Q). U = (o1, ..., o10,Q11, ..., Q20, w21, ..., w30) 4.2.2 Interpretaci´ongeom´etrica o, Q y w generan respectivamente tres vecto- En esta parte, determinamos las palabras res num´ericosde 30 dimensiones: m´asadecuadas que sustituir´anlas etiquetas POS de una EGP y as´ı generar una nueva o : X~ = (x1, ..., x30) frase. Para cada etiqueta POS , k = 1, 2, ... k Q : Q~ = (q1, .., q30) ∈ EGP, que se desea sustituir, usamos el al- ~ goritmo descrito a continuaci´on. w : W = (w1, ..., w30) Se efect´uaun an´alisismorfosint´acticodel ~ corpus 5KL usando FreeLing y se usan las donde los valores de X son obtenidos toman- etiquetas POS para crear conjuntos de pala- do la distancia entre la palabra o y cada ~ bras que posean la misma informaci´ongra- palabra uj ∈ U, j = 1, ..., 30. La distancia, matical (etiquetas POS id´enticas). Una Ta- xj = dist(o, uj) es recuperada de Word2vec, bla Asociativa (TA) es generada como re- donde xj ∈ [0, 1]. Evidentemente la palabra sultado de este proceso. La TA consiste en o estar´am´aspr´oximaa las 10 primeras pala- bras u que a las restantes. entradas de pares POSk y una lista de pa- j labras asociadas. Formalmente, se reprenta Un proceso similar permite obtener los va- ~ ~ POSk → Vk = {vk,1, vk,2, ..., vk,i}. lores de Q y W a partir de Q y w, respecti- Luego se construye un vector para cada vamente. En estos casos, el query Q estar´a una de las tres palabras siguientes. m´aspr´oximoa las palabras uj en las posicio- nes j = 11, ..., 20 y la palabra candidata w o: es la palabra k de la frase fo correspon- estar´am´aspr´oximaa las palabras uj en las diente a la etiqueta POSk. Esta palabra posiciones j = 21, ..., 30. permite recrear un contexto del cual la Enseguida, se calculan las similitudes co- nueva frase debe alejarse, evitando pro- seno entre Q~ y W~ (1) y entre X~ y W~ (2), ducir una par´afrasis. Q~ · W~ Q: es la palabra que define al query pro- θ = cos(Q,~ W~ ) = (1) porcionado por el usuario. |Q~ ||W~ | w: es la palabra candidata que podr´ıare- ~ ~ ~ ~ X · W emplazar POS . Esta palabra pertenece β = cos(X, W ) = (2) k |X~ ||W~ | a un vocabulario Vk de tama˜no |Vk| = m palabras, w ∈ Vk, que es recuperado de Estos valores de θ y β est´annormalizados en TA. [0,1]. Se itera el proceso para todas las pala- bras del l´exico w ∈ Vk. Esto genera otro con- Las 10 palabras oi m´aspr´oximasa o, las junto de vectores X,~ Q~ y W~ para los cuales se 10 palabras Qi m´aspr´oximasa Q y las 10 deber´ancalcular nuevamente las similitudes. palabras wi m´aspr´oximasa w (en este orden Al final se obtienen m valores de similitudes y obtenidas con Word2vec), son concatena- θi y βi, i = 1, ..., m, y se calculan los prome- das y representadas en un vector simb´olico U~ dios hθi y hβi. 33 Luis-Gil Moreno-Jiménez, Juan-Manuel Torres-Moreno, Roseli S. Wedmann

 hθi  El cociente normalizado θ indica qu´e modelo, realizamos dos experimentos basa- i dos en m´etodos estoc´asticos,empleando Ca- tan grande es la similitud de θi con respecto al promedio hθi (interpretaci´onde tipo ma- denas de Markov, y un integraci´onde Can- ximizaci´on);es decir, que tan pr´oximase en- ned Text y Word2vec, con un enfoque sim- cuentra la palabra candidata w al query Q. plificado. Los resultados de esos experimen-   El cociente normalizado βi indica qu´etan tos se detallan en (Moreno-Jim´enez, Torres- hβi Moreno, y Wedemann, 2020), pero a grosso reducida es la similitud de βi con respecto a modo los criterios evaluados as´ıcomo la esca- hβi (interpretaci´onde tipo minimizaci´on);es la empleada fueron los mismos que se aplica- decir, qu´etan lejos se encuentra la palabra ron en este trabajo. Para el modelo basado en candidata w de la palabra o de fo. Markov, se obtuvieron los siguientes resulta- Estas fracciones se obtienen en cada dos: gramaticalidad=0,55, coherencia=0,25 par (θi, βi) y se combinan (minimizaci´on- y contexto=0,67. El experimento integra- maximizaci´on) para calcular un score Si, do por Canned Text y Word2vec obtu- seg´unla ecuaci´on vo: gramaticalidad=0,74, coherencia=0,56 y     contexto=0,35. hθi βi Si = · (3) Los resultados de esos experimentos nos θi hβi permitieron detectar algunas deficiencias en los modelos iniciales, y proponer el modelo Mientras m´aselevado sea el valor Si, me- jor obedece a nuestros objetivos: acercarse al actual, cuyos resultados se muestran a conti- query y alejarse de la sem´antica original. nuaci´on. Finalmente, ordenamos en forma decre- 5.1 Resultados ciente la lista de valores de S y se escoge, i Presentamos algunos ejemplos de frases ge- de manera aleatoria, entre los 3 primeros, la neradas por nuestro modelo. Para el query palabra candidata w que reemplazar´ala eti- Q, y una longitud en n´umerode palabras queta POS en cuesti´on.El resultado es una k N, los resultados se muestran en el formato nueva frase f (Q, N) que no existe en los cor- 3 f(Q, N) = frase generada. pora utilizados para construir el modelo (ver Figura 2). 1. f(AMOR,10) = En el aprecio est´ael

TA cari~noforzoso de una simpat´ıa. POS - EGP Texto Lexico enlatado 2. f(AMOR,10) = Los cari~nosno conocen {Vk} Léxico asociado a la etiqueta POS k o Palabra original de la frase fo de nada a un respeto loco.

Léxicos ordenados, guiados por Q 5KL pre- Vk o procesado [o] 3. f(AMOR,10) = No est´ala simpat´ıaen 1 2 ... 10 U⃗ =[o]+[Q]+[w] las bondades de la envidia. Word2vec [Q] 1 2 ... 10 1 2 ... 11 12 ... 21 ... 30 4. f(GUERRA,9) = Existe demasiada [w] X⃗ =dist (⃗o ,U⃗ ) innovacion en torno a muy pocos 1 2 ... 10 W⃗ =dist(w ⃗ ,U⃗ ) Query Q Q⃗ =dist(Q,⃗ U⃗ ) sucesos. max( β ) 5. f( ) = min( Θ ) GUERRA,9 En la pelea todo debe motivo, menos la retirada. f3(Q,N) 6. f(GUERRA,10) = La codicia, siempre adversa, es terrible engendrada Figura 2: Aproximaci´onsem´antica basada en contra un desgraciado. interpretaci´ongeom´etrica min-max 7. f(SOL,9) = Si tus dulces fueran amanecer, mis ojos marchitas 5 Resultados y evaluaci´on fueran. Dados las caracter´ısticasde nuestro modelo 8. f(SOL,11) = Con rapidez, los (idioma, corpora empleados, generaci´onins- mon´ogamosimpedimentos buscan pirada en homosintaxis), no es posible com- para iluminar nos la luz. pararse con otros, ya que estos se centran en ´areasespec´ıficas de la literatura y no par- 9. f(SOL,10) = Incluso los luceros ten de un enfoque general como es el caso ingratos son comilones, y por en este trabajo. Sin embargo, previo a este tanto antiguos. 34 Generación de frases literarias: un experimento preliminar

5.2 Evaluaci´on persona. Esto, aunque parece ser un score ba- Presentamos en esta secci´onun protocolo de jo, comparado con los trabajos relacionados evaluaci´onmanual. El experimento completo con este tema, resulta ser una buena evalua- consisti´oen la generaci´onde 45 frases. Se con- ci´on,considerando que el objetivo no es ge- sideraron tres queries, Q = {AMOR, GUERRA, nerar texto aleatorio, sino un texto con ca- SOL} y se generaron 15 frases de cada uno. racter´ısticasliterarias. Las frases fueron mezcladas antes de presen- tarlas a los 7 evaluadores seleccionados. Los 6 Conclusiones evaluadores se eligieron considerando que po- En este trabajo presentamos una primera seen estudios universitarios y son hispanoha- aproximaci´on de un modelo generativo de blantes nativos, adem´asde cierto h´abitoa la texto literario usando modelos neuronales de lectura que les permita la buena compresi´on tipo Word embedding. El modelo produce de este tipo de textos. Se les pidi´oevaluar, frases aisladas en espa˜nolcon un cierto con- usando la escala 0=mal, 1=aceptable y 2=co- tenido literario. No se requiere pr´acticamente rrecto, los criterios siguientes: intervenci´ondel usuario (excepto por el query y la longitud de la frase requerida). La gene- Gramaticalidad: ortograf´ıa, conjuga- raci´onde p´arrafosy el uso de rimas sera el ciones correctas, concordancia en g´enero objeto de trabajos futuros (Medina-Urrea y y n´umero. Torres-Moreno, 2019). Dada la estructura del modelo propuesto, la extensi´ona otros idio- Coherencia: legibilidad, percepci´onde mas (franc´esy portugu´es)tambi´enser´acon- una idea general. templada (Charton y Torres-Moreno, 2011). Contexto: relaci´onde la frase con res- pecto al query. Bibliograf´ıa Agirrezabal, M., B. Arrieta, A. Astigarraga, Se realiz´ouna adaptaci´ondel Turing. A y M. Hulden. 2013. Pos-tag based poetry los evaluadores se les hizo creer que hab´ıa generation with . En 14th Euro- algunas frases escritas por personas y otras pean Workshop on Natural Language Ge- escritas por los algoritmos. Se les pidi´oindi- neration, p´aginas162–166. ACL. car cu´alesfrases pensaban que hab´ıan sido Bengio, Y., A. Courville, y P. Vincent. 2013. generadas por personas (0) y cu´alespor al- Representation learning: A review and goritmos (1). La Tabla 3 presenta la media new perspectives. IEEE Transactions on aritm´eticay la desviaci´onest´andarde los re- Pattern Analysis and Machine Intelligen- sultados obtenidos normalizados a una escala ce, 35(8):1798–1828. entre 0-1. Boden, M. A. 2004. The creative mind: Criterio Resultados Myths and Mechanisms. Routledge. Gramaticalidad 0.77 ± 0.13 Coherencia 0.60 ± 0.14 Bracewell, D., F. Ren, y S. Kuriowa. 2005. Contexto 0.53 ± 0.19 Multilingual single document keyword ex- Turing 0.44 ± 0.15 traction for information retrieval. En 2005 International Conference on Natural Tabla 3: Evaluaci´ondel sistema Language Processing and Knowledge En- gineering, p´aginas517–522, Wuhan, Chi- Se observa que las frases son percibidas co- na. IEEE. mo gramaticales y coherentes, con una eva- luaci´onde 0,77 y 0,60 respectivamente. El Brantley, K., K. Cho, H. Daum´e,y S. We- contexto obtuvo un resultado m´asbajo. Es- lleck. 2019. Non-monotonic sequential to puede deberse a que las EGP contie- text generation. En 2019 Workshop on nen elementos fijos (palabras funcionales) que Widening NLP, p´aginas57–59, Florence, en ocasiones pueden alterar el contexto o Italy. ACL. sem´antica de las palabras insertadas. A pesar Charton, E. y J.-M. Torres-Moreno. 2011. de ello, los resultados desde una perspectiva Automatic modeling of logical connectors general son bastante alentadores. by statistical analysis of context. Cana- En el test de Turing, los evaluadores per- dian Journal of Information and Library ciben un 44 % como frases generadas por una Science, 35(3):287–306. 35 Luis-Gil Moreno-Jiménez, Juan-Manuel Torres-Moreno, Roseli S. Wedmann

Colton, S. 2012. Automated theory forma- Moreno-Jim´enez, L.-G., J.-M. Torres- tion in pure mathematics. Springer Scien- Moreno, y R. S. Wedemann. 2020. ce & Business Media. Generaci´on autom´atica de frases litera- rias. Linguamatica. Accepted. Colton, S., G. A. Wiggins, y others. 2012. Computational creativity: The final fron- Oliveira, H. G. 2017. A survey on intelli- tier? En 20th European Conference on Ar- gent poetry generation: Languages, featu- tificial Intelligence, p´aginas21–26. ACL. res, techniques, reutilisation and evalua- tion. En 10th ICNLG, p´aginas11–20. Fu, R., J. Guo, B. Qin, W. Che, H. Wang, y T. Liu. 2014. Learning semantic hie- Oliveira, H. G. 2012. Poetryme: a versa- rarchies via word embeddings. En 52nd tile platform for poetry generation. En Annual Meeting of the ACL, volumen 1, Computational Creativity, Concept Inven- p´aginas1199–1209, Baltimore, Maryland, tion and General Intelligence, volumen 1, USA. ACL. Osnabr¨uck, Germany. Institute of Cogni- tive Science. Huang, E. H., R. Socher, C. D. Manning, y A. Y. Ng. 2012. Improving word repre- Oliveira, H. G. y A. Cardoso. 2015. sentations via global context and multiple Poetry generation with poetryme. En word prototypes. En 50th Annual Meeting Computational Creativity Research: To- of the ACL, volumen 1, p´agina873–882, wards Creative Machines, volumen 7, Pa- USA. ACL. ris, France. Atlantis Thinking Machines. Lebret, R., D. Grangier, y M. Auli. 2016. Padr´o,L. y E. Stanilovsky. 2012. Freeling Neural text generation from structured 3.0: Towards wider multilinguality. En 8th data with application to the biography do- LREC), Istanbul, Turkey. main. arXiv, p´aginaarXiv 1603.07771. P´erezy P´erez,R. 2015. Creatividad Compu- McRoy, S., S. Channarukul, y S. Ali. 2003. tacional. Larousse - Grupo Editorial Pa- An augmented template-based approach tria, M´exico. to text realization. Natural Language En- Sridhara, G., E. Hill, D. Muppaneni, L. Po- gineering, 9:381 – 420. llock, y K. Vijay-Shanker. 2010. Towards Medina-Urrea, A. y J.-M. Torres-Moreno. automatically generating summary com- 2019. Rimax: Ranking semantic rhymes ments for java methods. En IEEE/ACM by calculating definition similarity. International Conference on Automated Software Engineering, p´agina43–52, Ant- Mikolov, T., I. Sutskever, K. Chen, G. S. Co- werp, Belgium. ACM. rrado, y J. Dean. 2013. Distributed repre- Szymanski, G. y Z. Ciota. 2002. Hidden mar- sentations of words and phrases and their kov models suitable for text generation. compositionality. En Advances in Neural En N. Mastorakis V. Kluev, y D. Koru- Information Processing Systems 26. Cu- ga, editores, WSEAS, p´aginas3081–3084, rran Associates, Inc., p´aginas3111–3119. Athens, Greece. WSEAS - Press. Mikolov, T., W.-t. Yih, y G. Zweig. 2013. Torres-Moreno, J.-M. 2014. Automatic Text Linguistic regularities in continuous space Summarization. ISTE, Wiley, London, word representations. En NACACL: Hu- UK, Hoboken, USA. man Language Technologies, p´aginas746– 751, Atlanta, Georgia, USA. ACL. van Deemter, K., M. Theune, y E. Krahmer. 2005. Real versus template-based natural Mikolov, T. y G. Zweig. 2012. Context de- language generation: A false opposition? pendent recurrent neural network langua- Computational Linguistics, 31(1):15–24. ge model. En 2012 IEEE Spoken Langua- ge Technology Workshop (SLT), p´aginas Zhang, X. y M. Lapata. 2014. Chinese 234–239, Miami, FL, USA. IEEE. poetry generation with recurrent neural networks. En 2014 EMNLP, p´aginas670– Molins, P. y G. Lapalme. 2015. JSrealB: 680, Doha, Qatar. ACL. A bilingual text realizer for web program- ming. En 15th ENLG, p´aginas109–111, Brighton, UK. ACL. 36 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 37-44 recibido 30-04-2020 revisado 04-06-2020 aceptado 04-06-2020

Cross-lingual Training for Multiple-Choice Question Answering

Entrenamiento Crosling¨uepara B´usqueda de Respuestas de Opci´onM´ultiple

Guillermo Echegoyen, Alvaro´ Rodrigo, Anselmo Pe˜nas Universidad Nacional de Educaci´ona Distancia (UNED) {gblanco, alvarory, anselmo}@lsi.uned.es

Abstract: In this work we explore to what extent multilingual models can be trained for one language and applied to a different one for the task of Multiple Choice Question Answering. We employ the RACE dataset to fine-tune both a monolingual and a multilingual models and apply these models to another differ- ent collections in different languages. The results show that both monolingual and multilingual models can be zero-shot transferred to a different dataset in the same language maintaining its performance. Besides, the multilingual model still performs good when it is applied to a different target language. Additionally, we find that exams that are more difficult to humans are harder for machines too. Finally, we advance the state-of-the-art for the QA4MRE Entrance Exams dataset in several languages. Keywords: Question Answering; Multiple-Choice Reading Comprehension; Multi- linguality Resumen: En este trabajo exploramos en qu´emedida los modelos multiling¨ues pueden ser entrenados para un solo idioma y aplicados a otro diferente para la tarea de respuesta a preguntas de opci´onm´ultiple.Empleamos el conjunto de datos RACE para ajustar tanto un modelo monoling¨uecomo multiling¨uey aplicamos estos modelos a otras colecciones en idiomas diferentes. Los resultados muestran que tanto los modelos monoling¨uescomo los multiling¨uespueden transferirse a un conjunto de datos diferente en el mismo idioma manteniendo su rendimiento. Adem´as,el modelo multiling¨uetodav´ıafunciona bien cuando se aplica a un idioma de destino diferente. Asimismo, hemos comprobado que los ex´amenesque son m´asdif´ıcilespara los humanos tambi´enson m´asdif´ıcilespara las m´aquinas. Finalmente, avanzamos el estado del arte para el conjunto de datos QA4MRE Entrance Exams en varios idiomas. Palabras clave: B´usquedade Respuestas; Opci´onm´ultiple;Multiling¨uismo

1 Introduction vlin et al., 2019; Yang et al., 2019)), based on the idea of transformers (Vaswani et al., Question Answering (QA) has manifold di- 2017). These models have boosted all avail- mensions, depending on the source of the able NLP tasks, becoming the go-to tech- information (i.e. free text vs. knowledge nique in many cases, and QA is not an ex- bases), and how the question is to be re- ception. sponded. In this work, we focus on Multiple- Choice QA, where systems has to select the These models could be a handicap for un- correct answer from a set of candidates ac- derrepresented languages in terms of avail- cording to a given text. This format is usu- able resources to train them. In addition, the ally applied to evaluate language understand- training of these models requires a huge com- ing with humans. putation power. Hence, there is a great in- To solve this task, the tendency has terest from the research community towards steered towards deep, attention-based lan- developing multilingual models trained once guage models, like BERT, or XLNET ((De- for many different languages. So far, mono- ISSN 1135-5948. DOI 10.26342/2020-65-4 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Guillermo Echegoyen, Álvaro Rodrigo, Anselmo Peñas lingual models still seem to perform better RQ 1 How monolingual and multilingual than multilingual ones for some tasks (Mar- models perform for Multiple-Choice tin et al., 2019; Agerri et al., 2020; Ca˜neteet QA when they are trained for a specific al., 2020). language to work in a different one? In order to use these pre-trained mod- RQ 2 When monolingual and multilingual els, systems must be fine-tuned to the tar- models are trained and tested for the get task. Unfortunately, the majority of lan- same language, is their performance guages don’t have large enough datasets for comparable? several tasks, this is the case of Multiple- Choice QA. Given the size of the RACE RQ 3 Can multilingual models advance the dataset (Lai et al., 2017), it is possible to current state-of-the-art for some lan- fine-tune an English model with it, but you guages where there is not enough train- cannot do the same for other languages as ing data? the datasets are too small. Such is the case of Spanish or Italian. The creation of these 2 Related Work datasets is usually very costly both in time Question Answering (QA) is the task of re- and money. Thus, the majority of collections turning a precise and short answer given a are either too small to train a deep model or Natural Language question. QA can be ap- are only available in English (Hsu, Liu, and proached from two main perspectives (Rogers Lee, 2019). et al., 2020): 1) Open QA, where systems In this work, we study how to apply the collect evidences and answers across several models trained with a dataset in a language sources such as Web pages and knowledge to a different collection in another language. bases (Fader, Zettlemoyer, and Etzioni, 2013) For this purpose, we use the RACE collection and, 2) Reading Comprehension (RC), where to train a model and we test the model using the answer is gathered from a single docu- the Entrance Exams (EE) datasets, which are ment. available in several languages (Rodrigo et al., RC systems can be oriented to: (1) ex- 2018). tract spans of text with the answer (extrac- According to these observations, the ob- tive QA), (2) select an answer from a set of jectives of this work are: candidates (multiple-choice QA) or (3) gen- i) Compare the results obtained in EE and erate an answer (generative QA). Extractive RACE. In principle, they target similar QA has received a lot of attention fostered by human language skills: middle and high the availability of popular benchmarks such school English level in the case of RACE as SQuAD (Rajpurkar, Jia, and Liang, 2018). and university admission in the case of On the other hand, generative QA has re- EE. ceived less attention given that it is difficult to perform an exact evaluation and there are ii) Determine if there is a correlation be- few datasets (Koˇcisk´yet al., 2018). tween the difficulty of the exercises for In this work we focus on Multiple-Choice both humans and computers. (MC) QA. Since MC is a common way to iii) Determine if the knowledge learnt by measure reading comprehension in humans, fine-tuning with a collection (RACE in the task is very realistic. In fact, the datasets English) can be transferred to perform employed in this research (presented ahead) with another collection (EE English and are based on real world exams. Besides, some other languages). researches have pointed MC format as a bet- ter format to test language understanding of iv) Test the performance of both mono- automatic systems (Rogers et al., 2020). lingual and multilingual BERT models There exists several MC collections, once they are trained in one language mostly in English. In some cases it in- and evaluated in different ones. volves paying crowd-workers to gather docu- v) Advance the state-of-the-art for the EE ments and/or pose questions regarding those task in various languages. documents. MCTest (Richardson, Burges, and Renshaw, 2013), for example, proposed These objectives are motivated by the fol- for the workers to invent short, children lowing research questions: friendly, fictional stories and four questions 38 Cross-lingual training for multiple-choice question answering with four answers each, including deliber- 3.1 RACE ately wrong answers. As a way to encour- RACE (Lai et al., 2017) was collected from age a deeper understanding of texts, the the English exams for middle (subset named QuAIL dataset includes unanswerable ques- RACE-M) and high (subset named RACE- tions (Rogers, Kovaleva, and Rumshisky, H) school Chinese students. There are two 2020). Other datasets were created from real subsets depending on the level of the exams: world exams. This is the case of the well RACE-M for middle school and RACE-H for known MC dataset RACE (Lai et al., 2017), high school. or the multilingual Entrance Exams (Rodrigo The authors proposed it to evaluate the et al., 2018), described in more detail in the reading comprehension task using MC for- next Section. mat. RACE consists of more that 100K ques- When doing QA, there usually exists the tions generated from human experts (English constraint on the language and size of the instructors). Table 1 shows the details of datasets available. In this sense, many times RACE. We can see in the table that RACE-H there is not enough training data to fine-tune contains more data than RACE-M. a model in a specific language. (Asai et al., In order to evaluate the difficulty of the 2018) tried to solve this issue by translat- collection, the authors employed Amazon ing the target collection to a language with Mechanical Turk1 to annotate question types enough training data and using a QA system of a subset. The authors found a higher trained in the second language. However, this ratio of reasoning questions with respect to approach relies too much on the quality of the CNN (Hermann et al., 2015), SQUAD and translation. NEWSQA (Trischler et al., 2017), which jus- A common practice to fill this gap is zero- tifies that RACE is more difficult than those shot learning, which aims to solve a task datasets. without receiving any example of that task at the training phase. That is, we fine-tune 3.2 Entrance Exams for task A and evaluate in task B, possibly The Entrance Exams (EE) data was collected in another language. Thus, we expect for the from standardized English examinations for knowledge to be transferred from one task to university admission in Japan and used in the the other (in another language) with mini- Entrance Exams task at CLEF in 2013, 2014 mum overhead. and 2015 (Rodrigo et al., 2018). Exams were We have found similar efforts in the lit- created by the Japanese National Center for erature. Hsu, Liu, and Lee (2019), for ex- University Admission Tests. Only the exams ample, studies how to train Multi BERT for with MC format were included in the dataset. extractive QA in a language to test it in We show in Table 2 the number of documents another language, they obtain promising re- and questions released in each edition, as well sults. We differ from them in: (1) the lan- as the number for the overall set. guages employed: they test English, Korean The organizers of the task proposed also and Chinese; (2) the task: they work on ex- the same task in other languages different tractive QA and; (3) the type of collections: from English by collecting parallel transla- we use collections crafted for human evalua- tions from volunteers at the translation for tion, which allows us to study how difficulty progress website2. EE data is also available in for humans correlates with difficulty for au- French, Italian, Spanish and Russian. Trans- tomatic systems. More specifically, we zero- lations for German are only available for the shot transfer a model from RACE (fine-tune) 2015 dataset. Thus, EE allows to test QA to Entrance Exams (no training data avail- systems in other languages besides English, able) in multiple, unseen languages. which is the language where almost all the QA datasets are available. However, EE re- 3 Datasets ceived only participants for the English and French tasks. Therefore, this paper repre- In our experiments we use RACE and En- sents the first attempt to solve EE in the trance Exams. Both collections are derived other languages. from real human evaluations. The following subsection gives further details of each collec- 1https://www.mturk.com/mturk/welcome tion. 2http://www.translationsforprogress.org/ 39 Guillermo Echegoyen, Álvaro Rodrigo, Anselmo Peñas

Dataset RACE-M RACE-H RACE Subset Train Dev Test Train Dev Test Train Dev Test All # documents 6,409 368 362 18,728 1,021 1,045 25,137 1,389 1,407 27,933 # questions 25,421 1,436 1,436 62,445 3,451 3,498 87,866 4,887 4,934 97,687

Table 1: Details of RACE-M, RACE-H and RACE collections

2013 2014 2015 All Yogatama, 2019). # documents 9 12 19 40 # questions 46 56 89 191 2. No language specific knowledge is used to improve in any part of the model. In- Table 2: Details of Entrance Exams collec- tuitively, one should be able to pre-train tion a BERT model in any language, apply- ing language-specific knowledge and im- prove over M-BERT (e.g.: In (Agerri et We want to remark also that the EE orga- al., 2020), the authors employ a basque- nizers proposed two kinds of evaluation. The specific word piece vocabulary to im- first one is based in traditional QA evalua- prove the basque model). tion, measuring the overall performance of a system over the whole set of questions. The In this paper, we compare both models second approach proposed to measure the and their ability to perform zero-shot cross- number of tests passed by a system. Accord- lingual transfer in multiple-choice QA. We ing to this approach, each test is made of a describe the experiments in the next Section. document and the questions about it. Then, a system passes a test if it manages to answer 5 Experiments correctly to 50% or more of the questions, To the best of our knowledge, there exists similar to human evaluations. no Spanish MC collections big enough to fine tune a model. So, we have fine-tuned a model 4 BERT and Multilingual BERT on English RACE and test it on other lan- BERT and Multilingual BERT (M-BERT guages. from now on) are transformer-based language In our experiment, we use a simple BERT representation models. They have been pre- model and a M-BERT. Each model has been trained from unlabeled text to do Masked fine-tuned over RACE for three epochs, as Language Modeling and Next Sentence Pre- recommended by the developers. We have diction (Devlin et al., 2019). Afterwards, employed the well known transformers li- each model can be fine-tuned in specific tasks brary from huggingface3, following the hyper- such as those at Glue (Wang et al., 2018) or parameters stated by the first BERT4 base QA. Albeit both models share the same ar- model result on RACE’s leaderboard5. Ad- chitecture (a twelve layer transformer), they ditionally, all the source code is available in a were trained with different corpus. BERT Github repository6. Every model was trained was trained with BooksCorpus (800M words) on Google Cloud Platform with a Tesla T4 and M-BERT with the Wikipedia in 104 dif- for three epochs. ferent languages. Even so, both use the same The experiments followed with each model word piece vocabulary (and tokenizer) and are: have no information about the language in training. 1. Fine-tune model on RACE train collec- The idea behind M-BERT is to learn all tion, with both high and middle splits. languages at once, delivering a single model 2. Measure7 performance on RACE test capable of operating in multiple languages. collection. There are several caveats with this approach: 3https://huggingface.co/transformers/ 1. Languages compete against one another 4https://github.com/NoviScl/BERT-RACE 5 for a fraction of hyper parameters, af- http://www.qizhexie.com/data/RACE_ leaderboard.html fecting underrepresented languages. Al- 6https://github.com/m0n0l0c0/ though oversampling is applied, it is not race-experiments completely solved (Artetxe, Ruder, and 7We use accuracy 40 Cross-lingual training for multiple-choice question answering

Dataset BERT MultiBERT Random Longest RACE Mid 0.5265 0.6114 0.2500 0.3078 RACE High 0.4774 0.5031 0.2500 0.3059 RACE All 0.4917 0.5347 0.2500 0.3059 EE English 0.4921 0.4974 0.2500 0.2304 EE Spanish 0.3665 0.4503 0.2500 0.2932 EE Italian 0.2880 0.4293 0.2500 0.2775 EE French 0.3037 0.4346 0.2500 0.2565 EE Russian 0.2618 0.3403 0.2500 0.2723 EE German** 0.3708 0.4494 0.2500 0.2584

Table 3: Accuracy of each model, including baselines: BERT, M-BERT, Random and Longest over each dataset RACE (every split) and Entrance Exams (over every year and language) **German data is a single result, there is data only for 2015

3. Measure performance on EE in: English, Both models are above the baselines, ex- Spanish, Italian, French, Russian and cluding BERT’s Entrance Exams - Russian German result which is the worst by far. Among the Entrance Exams collection, We have also ran two baselines to estab- the English version obtains the highest score lish a lower bound every model should sur- for both models. Taking into account that pass. The first baseline choose every answer both models were fine-tuned with a single at random. Since we have four candidates task in English, this was expected. On the per question, this baseline achieves an accu- other hand, EE Russian was the lowest scored racy score of 0.25. The second, following the collection. Our intuition is that it is due to work of (Rogers et al., 2020), just yields the using a different alphabet. longest answer. No model had previous clue of Entrance Exams task, it was zero-shot transferred from 6 Results RACE. That is only available in English. Table 3 shows the results, according to ac- This means that in the case of BERT, curacy, obtained with the conducted exper- there are languages that it has never seen iments. We list the results obtained in before, because it is monolingual. Further- each dataset (RACE and Entrance Exams) more, it’s performance worsens hardly when by each employed model (BERT, M-BERT, exposed to unseen languages, specially with Random and Longest baselines). In all cases, Russian and German, this is the expected we report the results for the test split. For behavior since languages with very different Entrance Exams we show the results for all semantics are not tokenized nor understood years averaged together. correctly by the model (Artetxe, Ruder, and BERT scores similar in RACE and En- Yogatama, 2019; Agerri et al., 2020). This is trance Exams, though it obtains its best the case of Russian. scores in RACE middle. Even so, it is out- M-BERT scores above 0.4 in all cases but performed by M-BERT in all cases. The lat- Russian. Additionally, it is very close to pass- ter performs better in RACE than Entrance ing the exam (to answer correctly at least Exams, which are harder. Both models are 50% of the questions) for English. above the baselines excluding BERT with Tables 4, 5 and 6 show the results of EE di- Russian EE. vided by years (from 2013 to 2015). We give Both models’ scores visibly decrease when results according to accuracy and the propor- raising the education grade. RACE middle is tion of tests (a document with their corre- the highest scored, which corresponds to mid- sponding questions) passed (an accuracy of dle school exams, the easiest of the three col- at least 0.5). These results correspond to lections for humans. Following, RACE high the two evaluation perspectives applied in EE and, in the last place, Entrance Exams, which and described in Section 3.2. We include re- are tests for university entrance. This ten- sults in each language for each model, the two dency matches that of the humans, when in- baselines and the best performing system at creasing in difficulty, students score lower. EE (in each edition). There are results from 41 Guillermo Echegoyen, Álvaro Rodrigo, Anselmo Peñas

Entrance Exams 2013 BERT MultiBERT Random Longest NIIJ-3* Acc Tests Acc Tests Acc Tests Acc Tests Acc Tests English 0.43 0.22 0.41 0.44 0.25 0 0.22 0.00 0.35 0.33 Spanish 0.37 0.33 0.43 0.22 0.25 0 0.22 0.11 - - Italian 0.28 0.11 0.33 0.22 0.25 0 0.28 0.22 - - French 0.35 0.11 0.43 0.33 0.25 0 0.17 0.00 - - Russian 0.22 0.11 0.26 0.00 0.25 0 0.17 0.00 - -

Table 4: Accuracy of each model and proportion of passed tests, 2013 edition had 9 tests in total. Results from all models and baselines: BERT, M-BERT, Random and Longest over every language in Entrance Exams 2013. The best previous work (*) on (Rodrigo et al., 2018) from the National Institute of Informatics of Japan (Li et al., 2013). They originally presented results only for English

Entrance Exams 2014 BERT MultiBERT Random Longest Synapse* Acc Tests Acc Tests Acc Tests Acc Tests Acc Tests English 0.50 0.58 0.52 0.67 0.25 0 0.30 0.33 0.45 0.58 Spanish 0.38 0.33 0.45 0.50 0.25 0 0.34 0.25 - - Italian 0.32 0.33 0.43 0.33 0.25 0 0.29 0.17 - - French 0.30 0.17 0.48 0.50 0.25 0 0.30 0.25 0.59 0.75 Russian 0.30 0.17 0.32 0.17 0.25 0 0.34 0.17 - -

Table 5: Accuracy of each model and proportion of passed tests, 2014 edition had 12 tests in total. Results from all models and baselines: BERT, M-BERT, Random and Longest over every language in Entrance Exams 2014. The best previous work (*) on (Rodrigo et al., 2018) from Synapse (Laurent et al., 2014). They originally presented results only for French and English

Entrance Exams 2015 BERT MultiBERT Random Longest Synapse* Acc Tests Acc Tests Acc Tests Acc Tests Acc Tests English 0.52 0.63 0.53 0.63 0.25 0 0.19 0.21 0.58 0.84 Spanish 0.36 0.32 0.46 0.53 0.25 0 0.30 0.26 - - Italian 0.27 0.26 0.48 0.53 0.25 0 0.27 0.16 - - French 0.28 0.11 0.40 0.47 0.25 0 0.27 0.32 0.56 0.84 Russian 0.26 0.11 0.39 0.47 0.25 0 0.28 0.32 - - German** 0.37 0.42 0.45 0.58 0.25 0 0.26 0.16 - -

Table 6: Accuracy of each model and proportion of passed tests, 2015 edition had 19 tests in total. Results from all models and baselines: BERT, M-BERT, Random and Longest over every language in Entrance Exams 2015. The best previous work (*) on (Rodrigo et al., 2018) from Synapse (Laurent et al., 2015). They originally presented results only for French and English **This is the only year with German data previous systems for English in all editions ing higher grades. However, M-BERT scores and French in 2014 and 2015. Thus, our work better in general, passing more than 44% of is setting the results for these collections in the tests. several languages. The rest of results for 2013 are dominated BERT model shows its best result in En- by M-BERT model, which states a new best glish, 2013, but M-BERT passes a higher result in English, outperforming the previous proportion of tests (almost twice). This systems. In the case of the second campaign, means that BERT finds the correct answer M-BERT lands second on the results table for for more questions than M-BERT but dis- French (where the best result is obtained by tributed across just a few documents, obtain- (Laurent et al., 2014)), and surpasses previ- 42 Cross-lingual training for multiple-choice question answering ous results in English. M-BERT model also E. Agirre. 2020. Give your Text Repre- sets the new results in Spanish and Italian sentation Models some Love: the Case for (the best result for Russian is achieved by Basque. mar. the longest baseline). From table 6, the best results in English and French go for Synapse Artetxe, M., S. Ruder, and D. Yogatama. (Laurent et al., 2015), who developed a com- 2019. On the Cross-lingual Transferability plex system including background knowledge of Monolingual Representations. oct. and trained over the previous dataset of EE. Asai, A., A. Eriguchi, K. Hashimoto, Thus, it seems that simple pre-trained mod- and Y. Tsuruoka. 2018. Multilin- els, transferred over a different dataset, are gual extractive reading comprehension by close to those results but not enough. For runtime machine translation. CoRR, the rest of languages, the best results come abs/1809.03275. from M-BERT model, that sets the state-of- the-art without requiring a dataset in those Ca˜nete,J., G. Chaperon, R. Fuentes, and languages. J. P´erez. 2020. Spanish Pre-Trained Overall, we can observe a best perfor- BERT Model and Evaluation Data. In to mance of M-BERT with respect to BERT, appear in PML4DC at ICLR 2020. which is focused on the English language. In Devlin, J., M.-W. Chang, K. Lee, and fact, according to our experiments, we can K. Toutanova. 2019. BERT: Pre-training fine-tune M-BERT in English for MC using of Deep Bidirectional Transformers for one dataset and transfer that knowledge to Language Understanding. In Proceed- datasets in other languages. ings of the 2019 Conference of the North {A}merican Chapter of the Association 7 Conclusions and Future Work for Computational Linguistics: Human The results obtained show that both mono- Language Technologies, Volume 1 (Long lingual and multilingual models can be fine- and Short Papers), pages 4171–4186, Min- tuned for task and transferred to another task neapolis, Minnesota, oct. Association for in the same language. Furthermore, multilin- Computational Linguistics. gual models are transferable also to different languages. Fader, A., L. Zettlemoyer, and O. Etzioni. Also, we obtain evidence that systems per- 2013. Paraphrase-driven learning for open formance is hampered by exams difficulty in question answering. In Proceedings of the same way human grades do. the 51st Annual Meeting of the Associa- In this work we established the state-of- tion for Computational Linguistics (Vol- the-art results over the Entrance Exams task ume 1: Long Papers), pages 1608–1618, in four more languages. Sofia, Bulgaria, August. Association for We would like to continue pursuing meth- Computational Linguistics. ods to cope with low resource languages. To Hermann, K. M., T. Kocisky, E. Grefen- do so, we will continue exploring how fine- stette, L. Espeholt, W. Kay, M. Suley- tuned transformer bodies can be transferred man, and P. Blunsom. 2015. Teaching to reuse knowledge about specific tasks, fol- machines to read and comprehend. In lowing the lead of (Artetxe, Ruder, and Yo- Advances in neural information processing gatama, 2019; Otegi et al., 2020). systems, pages 1693–1701. Acknowledgments Hsu, T.-Y., C.-L. Liu, and H.-y. Lee. 2019. Zero-shot Reading Comprehen- This work has been funded by the Span- sion by Cross-lingual Transfer Learning ish Research Agency under CHIST-ERA with Multi-lingual Language Representa- LIHLITH project (PCIN-2017-085/AEI) tion Model. In Proceedings of the 2019 and deepReading (RTI2018-096846-B-C21 / Conference on Empirical Methods in Nat- MCIU/AEI/FEDER,UE). ural Language Processing and the 9th In- ternational Joint Conference on Natural References Language Processing (EMNLP-IJCNLP), Agerri, R., I. S. Vicente, J. A. Campos, pages 5933–5940, Hong Kong, China. As- A. Barrena, X. Saralegi, A. Soroa, and sociation for Computational Linguistics. 43 Guillermo Echegoyen, Álvaro Rodrigo, Anselmo Peñas

Koˇcisk´y, T., J. Schwarz, P. Blunsom, Richardson, M., C. J. C. Burges, and E. Ren- C. Dyer, K. M. Hermann, G. Melis, and shaw. 2013. MCTest: A Challenge E. Grefenstette. 2018. The NarrativeQA Dataset for the Open-Domain Machine reading comprehension challenge. Trans- Comprehension of Text. Technical report, actions of the Association for Computa- nov. tional Linguistics, 6:317–328. Rodrigo, A., A. Pe˜nas, Y. Miyao, and Lai, G., Q. Xie, H. Liu, Y. Yang, and N. Kando. 2018. Do systems pass univer- E. Hovy. 2017. RACE: Large-scale ReAd- sity entrance exams? Information Pro- ing Comprehension Dataset From Exam- cessing & Management, 54(4):564–575, inations. EMNLP 2017 - Conference on jul. Empirical Methods in Natural Language Rogers, A., O. Kovaleva, M. Downey, and Processing, Proceedings, pages 785–794, A. Rumshisky. 2020. Getting closer to apr. ai complete question answering: A set of Laurent, D., B. Chardon, S. N`egre, prerequisite real tasks. C. Pradel, and P. S´egu´ela.2015. Reading Rogers, A., O. Kovaleva, and A. Rumshisky. comprehension at entrance exams 2015. 2020. A Primer in BERTology: What we In L. Cappellato, N. Ferro, G. J. F. know about how BERT works. feb. Jones, and E. SanJuan, editors, Work- Trischler, A., T. Wang, X. Yuan, J. Har- ing Notes of CLEF 2015 - Conference and ris, A. Sordoni, P. Bachman, and K. Sule- Labs of the Evaluation forum, Toulouse, man. 2017. NewsQA: A machine com- France, September 8-11, 2015, volume prehension dataset. In Proceedings of the 1391 of CEUR Workshop Proceedings. 2nd Workshop on Representation Learn- CEUR-WS.org. ing for NLP, pages 191–200, Vancouver, Laurent, D., B. Chardon, S. N`egre, and Canada, August. Association for Compu- P. S´egu´ela.2014. French Run of Synapse tational Linguistics. D´eveloppement at Entrance Exams 2014. Vaswani, A., N. Shazeer, N. Parmar, In CLEF (Working Notes), pages 1415– J. Uszkoreit, L. Jones, A. N. Gomez, 1426. L.Kaiser, and I. Polosukhin. 2017. Atten- Li, X., R. Tian, N. L. T. Nguyen, Y. Miyao, tion is All you Need. In I. Guyon, U. V. and A. Aizawa. 2013. Question An- Luxburg, S. Bengio, H. Wallach, R. Fer- swering System for Entrance Exams in gus, S. Vishwanathan, and R. Garnett, QA4MRE. In CLEF (Working Notes). editors, Advances in Neural Information Citeseer. Processing Systems 30. Curran Associates, Inc., pages 5998–6008. Martin, L., B. Muller, P. J. O. Su´arez, Y. Dupont, L. Romary, E.´ V. de la Clerg- Wang, A., A. Singh, J. Michael, F. Hill, erie, D. Seddah, and B. Sagot. 2019. O. Levy, and S. Bowman. 2018. GLUE: CamemBERT: a Tasty French Language A Multi-Task Benchmark and Analysis Model. nov. Platform for Natural Language Under- standing. In Proceedings of the 2018 Otegi, A., A. Agirre, J. A. Campos, A. Soroa, {EMNLP} Workshop {B}lackbox{NLP}: and E. Agirre. 2020. Conversational Analyzing and Interpreting Neural Net- Question Answering in Low Resource Sce- works for {NLP}, pages 353–355, Brus- narios: A Dataset and Case Study for sels, Belgium. Association for Computa- Basque. In 12th International Conference tional Linguistics. on Language Resources and Evaluation. Yang, Z., Z. Dai, Y. Yang, J. Carbonell, Rajpurkar, P., R. Jia, and P. Liang. 2018. R. R. Salakhutdinov, and Q. V. Le. 2019. Know What You Don{’}t Know: Unan- XLNet: Generalized Autoregressive Pre- swerable Questions for {SQ}u{AD}. In training for Language Understanding. In Proceedings of the 56th Annual Meeting H. Wallach, H. Larochelle, A. Beygelz- of the Association for Computational Lin- imer, F. d’Alch´eBuc, E. Fox, and R. Gar- guistics (Volume 2: Short Papers), pages nett, editors, Advances in Neural Informa- 784–789, Melbourne, Australia. Associa- tion Processing Systems 32. Curran Asso- tion for Computational Linguistics. ciates, Inc., jun, pages 5754–5764. 44 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 45-52 recibido 30-04-2020 revisado 05-06-2020 aceptado 05-06-2020

ContextMEL: Classifying Contextual Modifiers in Clinical Text

ContextMEL: un Clasificador de Modificadores Contextuales en Texto Cl´ınico Paula Chocr´on, Alvaro´ Abella, Gabriel de Maeztu IOMED {paula.chocron,alvaro.abella,gabriel.maeztu}@iomed.es

Abstract: Taking advantage of electronic health records in clinical research re- quires the development of natural language processing tools to extract data from unstructured text in different languages. A key task is the detection of contextual modifiers, such as understanding whether a concept is negated or if it belongs to the past. We present ContextMEL, a method to build classifiers for contextual modi- fiers that is independent of the specific task and the language, allowing for a fast model development cycle. ContextMEL uses annotation by experts to build a cu- rated dataset, and state-of-the-art deep learning architectures to train models with it. We discuss the application of ContextMEL for three modifiers, namely Negation, Temporality and Certainty, on Spanish and Catalan medical text. The metrics we obtain show our models are suitable for industrial use, outperforming commonly used rule-based approaches such as the NegEx algorithm. Keywords: Clinical text, Temporality, Negation, Certainty, deep learning, annota- tion Resumen: Las historias cl´ınicas electr´onicaspueden traer grandes avances en la investigaci´onm´edica,pero requieren el desarrollo de herramientas para proce- sar texto no estructurado en diferentes idiomas. Una tarea clave es la detecci´on de distintos modificadores contextuales, como el aspecto temporal de un concepto, o si est´anegado. En este trabajo presentamos ContextMEL, un m´etodo para con- struir clasificadores para modificadores contextuales que es independiente tanto de la tarea espec´ıficacomo del lenguaje, permitiendo un ciclo de desarrollo din´amico.Con- textMEL usa anotaciones de expertos para crear un dataset curado, y las ´ultimastec- nolog´ıasen aprendizaje profundo. En este art´ıculodiscutimos la aplicaci´onde Con- textMEL para tres modificadores contextuales (temporalidad, negaci´on,y certeza) en texto m´edicoen castellano y catal´an. Los resultados obtenidos muestran que nuestros modelos pueden utilizarse en un entorno industrial, y que son m´asprecisos que conocidos m´etodos basados en reglas, como el algoritmo NegEx. Palabras clave: Texto cl´ınico,clasificaci´on,aprendizaje profundo, anotaci´on 1 Introduction The main goal of our NLP system is to The growing adoption of electronic health find the units of observation (i.e data points) records (EHR) in hospitals provides a unique required by clinical research from the EHR opportunity to analyze clinical text data au- of the patients. Therefore extracting data tomatically. Firstly, being able to efficiently points from clinical text usually means, at its and accurately extract data from these un- core, understanding whether a given written structured text corpora would greatly ac- observation is true or false for each case. For celerate the analysis process. Secondly, it example, a study may require the identifica- would make it easier to comply with the pri- tion of all the cases of anemia, to later ana- vacy policies that such sensitive text requires. lyze pathologies that are commonly shared Driven by these considerations, Natural Lan- with that diagnosis. This requires as a guage Processing (NLP) applications for the first step a tool to identify concepts in a clinical domain has been a very active field of text, commonly known as a concept extrac- research in the past decades. tor. These tools standardly take advantage ISSN 1135-5948. DOI 10.26342/2020-65-5 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Paula Chocron, Álvaro Abella, Gabriel de Maeztu of some of the large and curated knowledge 2 Related Work bases that exist for medical terminology such as SNOMED CT1, performing matches be- tween their entities and the ones that are The problem of identifying contextual con- found in the text. These matches can ei- cept modifiers is very broad, and it has ther use syntactic distances or rules (Abacha been researched extensively. We will restrict and Zweigenbaum, 2011), or machine learn- to discuss approaches developed for medical ing (Torii, Wagholikar, and Liu, 2011). text. Most of the existent approaches tackle To construct an appropriate dataset, how- the detection of a particular contextual mod- ever, it is also essential to be able to iden- ifier, and for a specific language. We will tify contextual modifiers that modify the con- discuss solutions that identify Negation, Cer- cepts in question. For example, consider the tainty and Temporality. following text: Previous episodes of Anemia, The development of tools to determine the patient’s Hgb levels are now normal. Al- contextual concept modifiers for medical text though a good concept extractor would find has evolved side by side with general progress a reference to anemia, it is as important to in NLP techniques. Early approaches re- determine that it refers to the past, and that lied on syntactic and rule-based methods. it is therefore not an active pathology any- For example, the NegEx algorithm (Chap- more. Another typical case in which contex- man et al., 2001) applies a rule-based sys- tual modification is pivotal is in the identi- tem to perform Negation identification in En- fication of whether a concept is negated or glish. NegEx has been extended to include not. In the next section we discuss different other modifiers in systems such as ConText approaches which have been proposed to ad- (Harkema et al., 2009), which can also iden- dress these kind of problems. These tools are, tify the experiencer (i.e. the patient or a in general, language-specific, and designed member of her family) and a concept’s tem- to identify one specific contextual modifier poral modifier. NegEx has also been ex- (such as Negation). However, the field of clin- tended to other languages. Relevant exam- ical research is constantly changing, and it is ples for Spanish are (Costumero et al., 2014; very likely to have studies that will need to Stricker, Iacobacci, and Cotik, 2015) and handle modifiers that have never been taken , which we will discuss later. into account before. NegEx-MES In this paper we present ContextMEL, a More recent techniques incorporate the system that allows to train models to per- use of machine learning methods to the form contextual concept analysis, irrespec- problem of contextual modifier identification, tive of their source language and the partic- such as Conditional Random Fields (Agarwal ular task. At its core, ContextMEL proceeds and Yu, 2010). The work in (Cruz Diaz et by annotating data and then training mod- al., 2012) uses machine learning techniques els with these annotations. Here, we propose to identify Negation and Speculation (Cer- to use established state-of-the-art deep learn- tainty) for Spanish medical text. Finally, ing models that are particularly well-suited the latest approaches rely on deep learning to solve the problem of assigning labels to techniques. The work in (Dalloux, Claveau, a concept in the context of a longer text. and Grabar, 2019) uses Long-Short Term These models are trained and evaluated on Memory (LSTM) networks to solve Negation three different context-related tasks, all ap- and Certainty in French corpora, while (Fan- plied to clinical data, using both Spanish and cellu, Lopez, and Webber, 2016) applies them Catalan sources. Since our method and the for English. The work on (Tourille et al., deep learning models we use are domain and 2017) uses them to detect temporal modi- language independent, our method can be fiers in English. Finally, the latest work on straightforwardly applied to other domains modifier detection for clinical text uses mod- or languages. Our technique shows promis- ern language models based on Transformers ing results for these tasks, and it outperforms such as BERT. Examples are (Khandelwal one of the best-known methods for Negation and Sawant, 2019) for Negation in English, detection. (Garc´ıa-Pablos, Perez, and Cuadros, 2020) for Negation in Spanish, and (Sergeeva et al., 1http://www.snomed.org/ 2019), which also tackles Speculation. 46 ContextMEL: Classifying contextual modifiers in clinical Text

3 Context-MEL: a General mally available when working with clinical System for Contextual data. Unlabeled text abounds, and there are Classification reliable, well-curated knowledge bases that can be useful to identify concepts. In our ContextMEL is a system to tackle the prob- case, we used a simple NER system built on lem of contextual modifier analysis. This ULMS vocabulary, with most of the terms problem can be defined formally as follows: coming from the Spanish SNOMED CT (see Let t be a text, and c be a concept formed by (Torii, Wagholikar, and Liu, 2011) and (Sol- a subset of one or more words in t. Given a daini and Goharian, 2016) for examples of set of labels L, assign a label from L to c in other medical concept extraction systems). the context of t. For example, in the prob- Annotators, on the other hand, are normally lem of identifying negations, the set of labels expensive. However, we will show that a rela- is composed of L = {Positive, Negative}. tively small amount of annotated data is nec- Given t = “The patient does not have ane- essary. mia.” and the concept c = anemia, the ob- jective of the task is to assign one of the Our system aims to be as generalizable as two labels to c in t. In theory, concepts possible, providing a procedure that can be could be any n-gram. In practice, we will adapted to different domains and contextual use (and therefore develop methods for) only modifiers that need to be identified. In our those that are recognized as medical concepts particular case, we used the system to build by a knowledge base. models to classify concepts in medical text ContextMEL provides a general frame- written in Spanish and Catalan. In many work to build classification models that can cases one same text contains bits of both be used to solve the problem of contextual languages. We used the framework to build modifier analysis for different tasks, inde- models for three different contextual modi- pendently of the language of the text or fiers: its domain. Concretely, the system con- Temporality. The goal of the Temporal- sists of a first phase, during which it uses ity task is to understand whether a concept manual expert annotators to build a labeled belongs to the current episode, to the clinical dataset for training. In a second phase history or to the future plan of a patient. The we use this dataset to train a classification set of labels for this modifier is {Antecedent, model that can assign labels to concept-text Plan, Current Episode}. For example, in pairs. We use well-known model architec- “Patient with kidney transplant in 2010. Has tures that were originally developed for the mild back pain. I prescribe Ibuprofen every problem of Aspect-Based Sentiment Analy- 8 hours.”, the concept kidney transplant is sis, and we show that we can apply them Antecedent, back pain is Current Episode, without changes to a variety of other tasks. and Ibuprofen is Plan. The models that are obtained with Con- Certainty. This task is about identifying textMEL can be applied directly on a con- whether a concept is certain, or if it refers cept extraction pipeline. First, texts are pre- to a hypothesis or conditional that could be processed and analyzed by a concept extrac- true or not. The set of labels for this modi- tor which links expressions to entities in a fier is {Certain, Uncertain}. For example, knowledge base. Then, these concepts to- in “If there is high fever, take ibuprofen”, the gether with the original text are fed to the concept fever is a hypothesis, while ibuprofen contextual models built with ContextMEL, is conditional (depends on the fever). They which assigns them a label. The methods both are, therefore, uncertain. It could be ar- that conform our concept extractor are out- gued that no concept is totally certain, since side of the scope of the paper, and therefore the doctor could be wrong. It is important to are not discussed in depth here. keep in mind that we are not trying to classify ContextMEL assumes the following re- whether a concept is true or not, but rather sources are available: 1. A large corpora of if the doctor considers it as a fact. Certainty unlabeled text. 2. A relatively basic system is sometimes referred to as speculation in the that finds concepts, or the entities that are of literature. interest and would require contextual labels. 3. A team of expert annotators. Negation. Negation is perhaps the sim- The first and the second items are nor- plest of the tasks. It aims to identify whether 47 Paula Chocron, Álvaro Abella, Gabriel de Maeztu a concept is negated or not in a text. The annotators could choose more than one. For set of labels for this modifier is {Positive, example, they could mark the word ibuprofen Negative}. For example, in the sentence in “take ibuprofen if there is pain” as Plan, “Patient with anemia had no liver abnormal- Uncertain. Figure 1 shows an example of a ity.”, anemia is positive and liver abnormal- concept annotation. ity is negative. We have already discussed related work for this task in the introduction.

4 Annotation: Building a training dataset for contextual modifier analysis The first step of Context-MEL consists on producing a high-quality dataset that can be used for training a machine learning model. While the resulting datasets are specific to a particular contextual modifier, language, and domain, we provide a general framework to obtain annotations. Suppose we are trying to solve a contex- tual modifier analysis problem with set of la- bels L. The annotation phase requires to Figure 1: Labels for the annotator’s task employ a team of domain expert annotators, who will assign labels from L to pairs of text-concept. To do so, annotators have ac- The full annotation cycle lasted 3 months. cess to an interface where they can read a We organized a meeting in person at the be- text with a highlighted concept. The high- ginning and one midterm, to explain the task lighted concepts are obtained automatically and clarify possible doubts. from the text, using the automatic concept We gave each pair concept-text to three matching system we mentioned in the previ- different annotators, varying the composition ous section. Next to the text there is a table of the groups. We considered an annotation where they can select labels from L (whether as valid if at least two of the three annotators it is possible to select more than one label agreed on the label. depends on the specific semantics of the con- textual modifier being analyzed). Finally, an- notators can choose to finish the annotation 4.1 A more efficient annotation of the concept by pressing “OK” (meaning strategy: Confirming Entities that they submit their labels), “NO” (mean- First ing that there is something wrong with the concept) or “PASS” (meaning that they are We noticed rapidly of a problem in our ap- unsure of the annotation and they prefer to proach: it relied too much on the automatic jump to another concept). Once a concept is concept matcher, which was not perfect. In annotated, another one from the same text is particular, our matcher had many false posi- highlighted, until the annotator labels them tives that were considered concepts although all and jumps to the next text. they should not be. This meant that annota- In our particular case, we employed a team tors had to spend valuable time deciding how of five medical practitioners. To make anno- to label something that was not really a con- tation more efficient, and since we were de- cept worth labeling. To tackle this problem, veloping models for the three modifiers (Tem- we set up a preliminary task that consisted porality, Certainty, Negation) in parallel, we on confirming whether a sequence of words merged their labels to create one unique an- was a medical concept or not. Annotators notation task. We only included labels that where very fast on this task, since they only were not the default one for each of the had to choose between two labels. Later, we modifiers: History, Plan, Uncertain and used these results on the main task, by as- Negative. Since we were mixing independent signing them only those concepts that had modifiers, we made labels not exclusive, and been verified as correct. 48 ContextMEL: Classifying contextual modifiers in clinical Text

5 Training: building classifiers very large. For Certainty, for example, train- for contextual modifier analysis ing with only 2025 examples (of which just Once we obtained the annotated dataset, we 709 are uncertain) is enough to achieve the proceeded to train a model that would per- results we report next. form the labeling automatically. We propose 5.1 Contextual analysis via LSTM to consider the problem of contextual mod- models ifier analysis as a generalization of aspect- based sentiment analysis (ABSA). ABSA is a The first alternative we explored was to use type of sentiment analysis that takes into ac- a method proposed in (Tang et al., 2016). count the fact that a text can express differ- Their approach consists essentially in solv- ent sentiments about different things. For ex- ing ABSA with two unidirectional LSTM net- ample, the text “The movie has a very inter- works, one modeling the part of the context esting plot, but I found the main actor bland” to the left of the concept, and the other one is not positive or negative per-se, but rather the part to the right. The LSTM that mod- positive about the plot and negative about els the right part receives the text reversed, the actor. ABSA can be defined as classify- so that the input to both networks ends with ing the polarity of a concept c (which they the concept. More formally, if the context c call aspect) in a context t. This is equiva- has n tokens and the concept to analyze is in lent to our problem when the set of labels positions i to i + t, the left LSTM receives L express polarity (for example Positive, input c[0 : i + t] and the right LSTM receives Negative). The community working on solv- reverse(c[i : n]). Using a concrete example, ing ABSA has proposed different architec- if the we want to classify the concept anemia tures to perform this kind of targeted classi- in the text “60 years old woman. History of fication; a survey can be found in (Schouten anemia. Normal values now.”, the inputs for and Frasincar, 2015). For ContextMEL we the left and right LSTMs would be: compared the performance of an approach Left: 60 years old woman . History of based on LSTMs and one based on BERT. anemia Temp Certainty Negation Right: . now values Normal . anemia Train 6286/1000 1743/282 4931/689 These models are sometimes referred to as Val 1459/255 384/84 1162/ 218 TD-LSTM (Target-Dependent LSTM). Test 261/79 246/90 173/75 We trained an LSTM with 1 layer for each of our context dimensions. To vectorize the input, we used a specialized embedding that Table 1: Size of the datasets, with the num- we trained on our large corpora of unlabeled ber of Spanish/Catalan examples on each one Spanish and Catalan texts. This embedding had 200 dimensions and was trained using Before discussing the models in the fol- word2vec (Mikolov et al., 2013). We trained lowing subsections, let us comment on the the LSTMs for 15 epochs on each dataset. datasets we used for training. Our data was We were able to train these models on a ma- naturally highly imbalanced (for example the chine with only one CPU and 8 GB of RAM Temporality task was skewed to the “Cur- memory. rent Episode” label). To minimize the ef- fect of this bias, we used balanced datasets 5.2 Contextual analysis via BERT that contained 65% of the prevalent class and The second approach we investigated was a 35% of the other one for the datasets with BERT-based methodology. Language mod- two labels (Certainty and Negation) and a els such as BERT (Devlin et al., 2018) are 40% − 20% − 20% split for the Temporality neural networks based on the Transformer ar- dataset. Table 1 shows the dimensions of the chitecture (Vaswani et al., 2017) which have train and validation datasets that we used been trained on a huge dataset for a gen- for training, together with a testset that we eral task such as guessing a masked word. will discuss in the next section. As we can This setup makes them very flexible, which see, the datasets are mixed in Spanish and has proved very beneficial to perform trans- Catalan, although Catalan is notably under- fer learning. In other words, the knowledge represented. Note that the datasets are not in the language model can be reused to solve 49 Paula Chocron, Álvaro Abella, Gabriel de Maeztu different specific tasks, needing only a small one one annotator had marked it as such task-specific dataset. (which often consist on harder or edge cases). BERT architectures have been used in Then we labeled the examples ourselves, and different ways to solve ABSA (Zeng et al., used the results as a golden test dataset. 2019). In our case we used a very simple To ensure a fair evaluation framework, we approach that took advantage of the type used in the testset only notes that had not of input expected by BERT. In addition been seen at train time. That is, we not only to the masked word task that we already made sure the pair concept-text was previ- mentioned, BERT is also trained for the next ously unseen, but we also used completely sentence prediction task. Shortly, the task new texts. Table 2 shows the results for consists on deciding whether two sentences the three different tasks. As we can see, are consecutive or not. To implement this we obtain good accuracy levels, which could task, the model uses a special token [SEP], be used industrially. Results are particularly which indicates the separation between two good for the Negation task, which is unsur- sentences. If the input sentences are a and b, prising since it is notably simpler than the the model input is then [[CLS] a [SEP] b], other two. Using BERT models improves where [CLS] is a special token used at the the accuracy significantly for Tense and Cer- beginning of text. We used this mechanism tainty, which is expected since it is a very to give, as input to the model, both the text large model. When evaluating the results, and the concept, using [SEP] to indicate it is important to keep in mind that train- their boundaries. That is, for a pair of text ing BERT requires a GPU and more mem- t and concept c, our input was [[CLS] t ory, while the LSTMs can be trained on a [SEP] c]. Using the same example as before, much simpler computer; the size of the model to classify anemia in “60 years old woman. would also impact the inference time. No- History of anemia. Normal values now.” tably, BERT models have better results even the input would be [ [CLS] 60 years old when they use the very generic Multilingual woman . History of anemia . Normal model as a base, while the LSTM ones use values now . [SEP] anemia [SEP] ]. a vector space specifically designed for the Since our text is in Spanish and Cata- task. Investigating whether further training lan, we used the Multilingual (-base- the base model leads to better results is part multilingual-cased) version of BERT as a of our planned future research. base, which we fine-tuned with the same In Table 3 we also show the precision ans datasets we used for the LSTM-version2. The recall values for each class. As we can see, the models required only 3 epochs to achieve Temporality case is particularly good identi- their best performance. To train the BERT fying future plans, which are normally very models, we used a machine with 1 GPU and clear in the text, while the difference between 30 GB RAM. Current Episode and Antecedent is some- times more subtle. The Negation model is, as 6 Results expected, better at identifying positives (the In this section we present and discuss the prevalent class). In Certainty, we observe the results we obtained using both techniques certain case is slightly better. (LSTM and BERT-based) for the three tasks Temp Certainty Negation we described before. To evaluate our models as accurately as possible, we built a golden LSTM 81.82% 85.71% 95.70% testset for each task which was manually re- BERT 84.41% 88.10% 95.16% vised by us, using the annotator’s data as a basis. For each label l (including a la- bel Blank as a common one for Positive, Table 2: Accuracies for the three tasks for Current Episode, Certain), we selected each model 100 examples that at least 2 annotators had marked with l, and 100 examples where only We consider the possibility of building models specifically designed for a particu- 2While there are Spanish-only versions of BERT, we are not aware of any Catalan-Spanish one, or even lar language, task, and domain to be one of any Catalan-only version. Catalan is, however, in- the main advantages of our system. How- cluded in the multilingual version of BERT ever, this makes comparing our techniques 50 ContextMEL: Classifying contextual modifiers in clinical Text

Precision Recall We applied ContextMEL for the identifi- Certain 85.06% 87.06% cation of three context modifiers: Temporal- Uncertain 86.42% 84.34% ity, Negation, and Certainty. We obtained promising results that are already, after only Positive 96.67% 95.47% one development iteration, suitable for indus- Negative 92.86% 94.71% trial use. Moreover, the results for Negation Past 77.93% 78.03% outperformed the ones obtained with a com- monly used rule-based tool. In summary, we Present 77.89% 77.93% think ContextMEL harnesses the power of Future 89% 88.80% the latest deep learning architectures to make the best possible use of a dataset that is care- fully built, but not huge. For example, for Table 3: Precision and recall per class for Certainty we obtained an accuracy of 88.1% each task (LSTM) with only 2025 (and 709 uncertain) examples. We have two clear directions of develop- NegEx Context-MEL ment and research for this project. First, 83.14% 94.22% we plan to study whether using a BERT model specially trained for Spanish and Cata- lan clinical text as a base for our fine-tuning Table 4: Results for the Spanish dataset would improve the results we already have. This model does not exist yet, so we plan against other tools challenging. To the best to build it ourselves with our corpora of un- of our knowledge, there are no available tools labeled data. If results are better, it would to perform temporal analysis of clinical text mean another step towards building a gen- in Spanish and Catalan. The task of iden- eral framework to make fine-tuning for spe- tifying negations has been the most prolific cific tasks even more lightweight. of the three we worked on. In what follows As another line for future work we plan we show how our model compares with the to work on a Quality Assessment-based de- NegEx-MES3 system, which applies the NegEx velopment cycle to refine models using anno- technique to Spanish medical texts. NegEx is, tations. Our objective is to define a work- at this moment, the best-known method to flow that allows us to build the training identify negated concepts, and its implemen- dataset iteratively and on demand. This will tations are commonly applied industrially. be achieved by evaluating systematically the Since NegEx-MES was developed for models that are obtained with each iteration Spanish texts, we compared the systems only until they reach an industry-suitable accu- for that portion of our testset (a 73.59%, ac- racy threshold. This avoids annotating un- cording to Table 1). These results can be necessarily large amounts of data, which is seen in table 4. As we can see, our system im- yet another way of making model develop- proves significantly the accuracy over NegEx- ment faster. MES. References 7 Conclusions and Future Work Abacha, A. B. and P. Zweigenbaum. 2011. We described ContextMEL, a system to Automatic extraction of semantic rela- build, in a general way, models to detect tions between medical entities: a rule contextual modifiers. The framework in- based approach. Journal of biomedical se- cludes an efficient way of annotating data by mantics, 2(S5):S4. experts and state-of-the-art model architec- tures to harness the knowledge in the result- Agarwal, S. and H. Yu. 2010. Biomed- ing dataset. Importantly, the development of ical negation scope detection with con- the system is not bounded to a particular lan- ditional random fields. Journal of the guage or a particular task, allowing for a fast, American medical informatics associa- simple, and organized model development cy- tion, 17(6):696–701. cle. Chapman, W., W. Bridewell, P. Hanbury, 3https://github.com/PlanTL-SANIDAD/NegEx- G. Cooper, and B. Buchanan. 2001. A MES simple algorithm for identifying negated 51 Paula Chocron, Álvaro Abella, Gabriel de Maeztu

findings and diseases in discharge sum- Schouten, K. and F. Frasincar. 2015. Survey maries. Journal of Biomedical Informat- on aspect-level sentiment analysis. IEEE ics, 34:301–310, 11. Transactions on Knowledge and Data En- gineering, 28:1–1, 01. Costumero, R., F. Lopez, C. Gonzalo-Mart´ın, M. Millan, and E. Menasalvas. 2014. An Sergeeva, E., H. Zhu, A. Tahmasebi, and approach to detect negation on medical P. Szolovits. 2019. Neural token repre- documents in spanish. In Brain Infor- sentations and negation and speculation matics and Health. Springer International scope detection in biomedical and gen- Publishing, pages 366–375. eral domain text. In Proceedings of the Tenth International Workshop on Health Cruz Diaz, N., M. L´opez, J. Mata V´azquez, Text Mining and Information Analysis and V. Pach´on. 2012. A machine-learning (LOUHI 2019), pages 178–187. approach to negation and speculation de- tection in clinical texts. Journal of the Soldaini, L. and N. Goharian. 2016. Quick- American Society for Information Science umls: a fast, unsupervised approach for and Technology, 63:1398–1410, 07. medical concept extraction. In MedIR workshop, sigir, pages 1–4. Dalloux, C., V. Claveau, and N. Grabar. 2019. Speculation and negation detec- Stricker, V., I. Iacobacci, and V. Cotik. 2015. tion in French biomedical corpora. In Negated findings detection in radiology re- Proceedings of the International Confer- ports in spanish: an adaptation of negex ence on Recent Advances in Natural Lan- to spanish. 07. guage Processing (RANLP 2019), pages Tang, D., B. Qin, X. Feng, and T. Liu. 2016. 223–232, Varna, Bulgaria, September. IN- Effective LSTMs for target-dependent COMA Ltd. sentiment classification. In Proceedings Devlin, J., M. Chang, K. Lee, and of COLING 2016, the 26th International K. Toutanova. 2018. BERT: pre- Conference on Computational Linguistics: training of deep bidirectional transform- Technical Papers, pages 3298–3307, Os- ers for language understanding. CoRR, aka, Japan, December. The COLING 2016 abs/1810.04805. Organizing Committee. Fancellu, F., A. Lopez, and B. Webber. 2016. Torii, M., K. Wagholikar, and H. Liu. Neural networks for negation scope detec- 2011. Using machine learning for con- tion. In Proceedings of the 54th annual cept extraction on clinical documents from meeting of the Association for Computa- multiple data sources. Journal of the tional Linguistics (volume 1: long papers), American Medical Informatics Associa- pages 495–504. tion, 18(5):580–587. Garc´ıa-Pablos, A., N. Perez, and M. Cuadros. Tourille, J., O. Ferret, A. Neveol, and X. Tan- 2020. Sensitive data detection and clas- nier. 2017. Neural architecture for tempo- sification in spanish clinical text: Ex- ral relation extraction: a bi-lstm approach periments with bert. arXiv preprint for detecting narrative containers. In Pro- arXiv:2003.03106. ceedings of the 55th Annual Meeting of the Association for Computational Linguistics Harkema, H., J. N. Dowling, T. Thornblade, (Volume 2: Short Papers), pages 224–230. and W. W. Chapman. 2009. Context: An algorithm for determining negation, expe- Vaswani, A., N. Shazeer, N. Parmar, riencer, and temporal status from clinical J. Uszkoreit, L. Jones, A. N. Gomez, reports. Journal of biomedical informat- L. Kaiser, and I. Polosukhin. 2017. At- ics, 42 5:839–51. tention is all you need. Khandelwal, A. and S. Sawant. 2019. Neg- Zeng, B., H. Yang, R. Xu, W. Zhou, and bert: A transfer learning approach for X. Han. 2019. Lcf: A local context fo- negation detection and scope resolution. cus mechanism for aspect-based sentiment classification. Applied Sciences, 9:3389, Mikolov, T., K. Chen, G. Corrado, and 08. J. Dean. 2013. Efficient estimation of word representations in vector space. 52 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 53-58 recibido 19-03-2020 revisado 09-06-2020 aceptado 09-06-2020

Limitations of Neural Networks-based NER for Resume Data Extraction

Limitaciones en Reconocimiento de Entidades Nombradas (REN) basado en Redes Neuronales para la extracci´onde datos de Curriculum Vitae Juan F. Pinzon, Stanislav Krainikovsky and Roman Samarev dotin Inc. California, USA [email protected]

Abstract: We provided research about the abilities of neural network-based NER models, quality, and their limitations in resolving entities of different types of com- plexity (emails, names, skills, etc.). It has been shown that the quality depends on the entity type and complexity, and estimate “ceilings”, which model quality can achieve in the case of proper realization and well-labelled dataset. Keywords: NER, Neural-Networks, Curriculum Vitae, Resumes, NLP, Data Ex- traction Resumen: Proporcionamos investigaci´onsobre las capacidades de los modelos REN basados en redes neuronales, la calidad y sus limitaciones para resolver entidades de diferentes tipos de dificultad (correos electr´onicos,nombres, habilidades, etc.). Se ha demostrado que la calidad depende del tipo y la complejidad de la entidad, y estima los l´ımites, que la calidad del modelo puede lograr en el caso de una realizaci´on adecuada y un corpus bien etiquetado. Palabras clave: REN, Redes Neuronales, Curriculum Vitae, PLN, Extracci´onde Datos 1 Introduction We intend to analyze if the Machine Learn- The task of parsing information from Re- ing and AI recent improvements can obtain sumes is of utmost importance and relevance better results for this task over rule-based ap- in the NLP and Computer Linguistics com- proaches which use a predefined set of rules munities. The Human Resources industry to extract the content. and recruiting companies will greatly benefit Our main goal is to produce a machine from its successful outcomes, not only reduc- learning model that will be able to extract ing time and costs in the processing of CVs main entities, such as name, email, educa- and their information but also making bet- tion/university, companies worked at, skills, ter quality matches between job postings and etc., from semi-structured and unstructured candidates. The impact and positive out- resumes while obtaining good quality and comes that these solutions can provide has performance. We believe that the recent ad- been extensively discussed on research pa- vancements in NLP can help achieve this pers, such as “The Impact of Semantic Web task with smaller, but properly annotated, Technologies on Job Recruitment Processes” datasets and at the same time relying as (Bizer et al., 2005). little as possible on pre- or post-processing Resume Parsing solutions usually rely on contrasting what was presented in the mod- complex rules statistical algorithms to cor- els of the “Resume Parser: Semi-structured rectly capture desired information from re- Chinese Document Analysis” (Zhang et al., sumes. There are many variations of writ- 2009) and “Study of Information Extraction ing styles, words, syntax and to make things in Resume” (Nguyen, Pham, and Vu, 2018) worse, the ever-increasing globalized world of studies. nowadays means it is also important to take 2 Methods into account for cultural differences that af- fect the style and word choices among others. For this task, we used the named entity recog- nition (NER) approach, using the FlairNLP ISSN 1135-5948. DOI 10.26342/2020-65-6 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Juan F. Pinzón, Stanislav Krainikovsky, Roman S. Samarev

pre-trained over web crawls, with the Flair contextual string embeddings, which are pre- trained with 1 billion word corpus. 2.1 Data For the training of the before mentioned NER model it was required to obtain considerable amounts of annotated resumes. Two main sources of datasets with resumes were found, first one contained 220 annotated resumes from Indeed (employment website and search engine). These CVs are in a semi-structured form and have annotations for extracting 10 entities categories (Name, College Name, De- gree, Graduation Year, Years of Experience, Companies worked at, Designation, Skills, Location Email Address), they were an- notated collaboratively, by DataTurks com- munity with the DataTurks online anno- tation tool (Trilldata-Technologies, 2018), the dataset is available in their repository (Narayanan and DataTurks, 2018). The second source was obtained from a Kaggle Figure 1: Neural Network NER Model Archi- dataset (Palan, 2019). This dataset con- tecture tains 1,200 non-annotated and unstructured resumes in a CSV file. framework (Akbik et al., 2019). This frame- After analysing both sources of resumes work allows us to easily implement the cur- it was discovered that the 220 annotated rent state-of-the-art NER model that uses CVs from DataTurks were very poorly an- the LSTM variant of bidirectional recurrent notated, generating excessive noise on our neural networks (BiLSTMs) and a condi- initial NER training. For that reason, the tional random field (CRF) decoding layer, ar- annotations were removed and both sources chitecture depicted in Figure 1. The biggest were merged and re-annotated by us, us- advantage of using this framework was be- ing the Doccano open-source text annotation ing able to leverage the power of Contextual tool (Nakayama et al., 2018), initially us- String Embeddings - these embeddings can ing the same 10 entities used by DataTurks capture the hidden syntactic-semantic infor- (Trilldata-Technologies, 2018). mation, exceeding far beyond the standard Posterior to training several NER mod- word embeddings. Their distinct properties els, it was evident that the entity Skills was are: “(a) they are trained without any ex- too broad and general and caused difficulties plicit notion of words and thus fundamen- with the training of the NER models. The tally model words as sequences of charac- Skills entity was separated into Job-Specific ters, and (b) they are contextualized by their Skills, Soft Skills and Tech Tools, the mod- surrounding text, meaning that the same els were trained and compared against the 10 word will have different embeddings depend- entities annotations, using the same resumes, ing on its contextual use” (Akbik, Blythe, improving the results. and Vollgraf, 2018). Additionally, the stack- A total of 507 resumes were annotated for ing embeddings technique was used, which each of these annotations types (10 and 12 consists of combining different types of em- entities) and a set-aside test set of 50 CVs bedding models by concatenating each em- (also with 10 and 12 entities annotations) bedding vector to generate the final word vec- were used for testing the best model. It is tors. It is proven to be beneficial to add important to note that due to computational classic word embeddings to enhance latent limitations, the resumes that had a count of word-level semantics; for our task, we stacked tokens bigger than 2,000, were discarded be- FastText embeddings (Mikolov et al., 2018), fore training. The corpus used to obtain the

54 Limitations of neural networks-based NER for resume data extraction

TRAIN DEV TEST 10 Annotated Ent. 12 Annotated Ent. F1(%) F1(%) # of Documents 456 51 50 Micro Avg. 72.1% 78% # of Tokens per Tag: Graduation Year 92.2% 87.2% No-tag 254,522 28,923 26,189 Name 89.1% 88.9% Email 97.3% 98.4% Name 1,479 124 170 Location 87.9% 89% Email Address 2,010 227 222 Degree 81.5% 85.5% Designation 6,866 670 613 Years Exp. 88% 90.9% Designation 77.4% 81% Job Specific Skills 10,434 1,232 1,108 College Name 85.3% 87.2% Tech Tools 7,393 992 712 Company 67.6% 76.1% Companies worked at 5,412 524 494 Skills 56.1% - Job Specific Skills - 58.4% Location 5,872 698 705 Soft Skills - 63.8% Years of Experience 6,203 696 641 Tech Tools - 73.8% College Name 3,440 345 311 Degree 4,400 410 437 Table 2: 10 & 12 Annotated entities compar- Graduation Year 1,007 93 105 ison Soft Skills 2,216 268 308

Table 1: Annotated Corpus description • Anneal factor: 0.5 • Patience: 3 highest score had the following composition • Use CRF: True of tokens and tags, see Table 1. 2.2 Cross-validated Data Quantity 2.4 Evaluation Metrics Analysis For the evaluation of the NER models, A 5 fold cross-validation analysis was per- the metrics presented in the “CoNLL-2003 formed with different resumes amount, to un- shared task: language-independent Named derstand the behaviour and performance of Entity Recognition” (Tjong Kim Sang and the models, as a whole (micro average) and De Meulder, 2003) were used. These are ex- per entity. act match precision, recall and F1 score with The second purpose was to evaluate if ob- with true-, false-positives (TP, FP) and false- taining new annotated data will produce big negatives (FN): enough improvements to the results. The TP amounts of CVs were 50, 100, 200, 300, P recision = (1) 400 and 500, they were taken from the Train TP + FP TP and Development sets of annotated CVs pre- Recall = (2) sented in Table 1. TP + FN 2PR 2.3 Best Model F 1 = (3) P + R The best model performance was achieved us- ing the corpus mentioned in in Table 1 (using Mainly the F1 (3) metrics will be pre- 12 annotated entities), the following hyper- sented since this measure represents a har- parameters and embeddings were provided to monic mean between Precision and Recall. the Flair sequence tagger (NER) model: 3 Results • Embedding stack: 3.1 10 & 12 Entities Comparison – FastText word embeddings The results presented in Table 2 demon- (Mikolov et al., 2018). strate the success of the experiment of divid- ing the Skills entity into 3 separate entities: – Flair Contextual String Embed- Job-Specific Skills, Soft Skills & Tech Tools, dings (“news-forward”) & (”news- and therefore, helped the model in overall. backward”), (Akbik, Blythe, and Mainly, due to the difficulty and implicit na- Vollgraf, 2018). ture of the Job-Specific Skills is localized and • Initial learning rate: 0.5 contained allowing for the “easier” entities like Soft Skills and Tech Tools, to be trained • Dropout: p=0.12 more precisely, thus, improving the model as • RNN layers: 2 BiLSTM layers a whole. • Hidden size: 128

55 Juan F. Pinzón, Stanislav Krainikovsky, Roman S. Samarev

Figure 4: F1 Score vs. CV quantity for Soft Skills & Tech Tools entities F1 Recall Precision (%) (%) (%) Figure 2: Micro average F1 vs. data quantity Micro Avg. 78.71 ±0.48 77.69 79.75 College Name 85.72 ±1.51 87.58 83.93 Company 74.21 ±2.38 78.8 70.13 Degree 85.65 ±1.53 81.86 89.8 Designation 84.08 ±1.74 91.21 77.99 Email 98.5 ±1.64 97.04 100 Graduation Year 92.73 ±2.98 92.73 92.73 Location 89.76 ±0.84 91.05 88.51 Name 89.43 ±2.28 90.16 88.71 Years Exp. 90.88 ±1.17 93.13 88.74 Job Spec. Skills 61.16 ±4.73 58.62 63.93 Soft Skills 65.27 ±2.03 60.56 70.78 Tech Tools 74.34 ±3.30 79.48 69.83

Figure 3: F1 Score vs. CV quantity for Table 3: Best model results, F1 Score, Preci- Email, Job-Specific Skills sion and Recall

3.2 Cross-validated Data Quantity the time and resources trying to increase the Results amount of well-labelled data. Figures 2, 3 and 4 present a clear trend when It is important to note that the entity more data is added for training, both for “nature” or complexity is not only present Micro average (Figure 2) and for individual for the Job-Specific Skills entity only, Fig- entities (Figures 3 and 4). As data quan- ure 3 and Table 3 illustrate how varied the tity increases the variability of the obtained scores among entities are, having different F1 scores decreases. The performance of “ceilings” amongst them, there is a 37.3% the model rapidly increases until it reaches a difference between the highest and the low- “ceiling”, plateauing after 300 CVs. Further- est entities F1 scores (in best model scores). more, Figure 2 illustrates the inverse corre- 25% (or 3 entities) present a standard devi- lation between the variability of the models ation greater than 2.5%, Job-Specific Skills and training data quantity. having the highest value (4.73%), 58.3% (or 7 entities) with a standard deviation between 3.3 Best Model Results 2.5% and 1.5%. 25% or 3 entities present a The best model scores obtained are presented standard deviation lower than 1.5%. in Table 3. The uneven results obtained across the different entities indicate that a Neural Net- 4 Discussion work NER model alone is not enough to ac- Figures 2, 3 and 4 demonstrate that the complish this task with satisfactory results model reached a plateau in the scores, regard- for production deployment. A big amount ing the data quantity. Even if the labelled of pre- and post-processing will be required data is doubled, the increase in scores is go- in order to identify missed entities and re- ing to be very limited. It might help reduce solve disambiguation among the predictions. statistical variability across experiments but As it has been presented on previous studies without much gain in the F1 score. Given by Zhang et al. (2009) and Nguyen, Pham, this, we consider it is not worth investing and Vu (2018). It can be observed that en- tities that are easily identifiable and are pre-

56 Limitations of neural networks-based NER for resume data extraction sented with the same patterns across differ- mercial and open-source parsers were com- ent CV’s, such as Email Address, Graduation pared by Neumer (2018) in his Master Thesis. Year, Years of Experience and Location get In this work, Rapidparser obtained F1 scores higher scores. Email addresses will always above 94% for start-date, end-date, work de- have a “@” sign in the middle and will finish scription and skills entities (Neumer, 2018). with “.com” or “.edu” or similar. Gradua- NLP models have advanced greatly in re- tion Year and Years of Experience will al- cent years but still fall short when trying to ways be a set of start and finish date or a implement them by themselves, especially for duration period (ex. 2004 - 2008, 4 Years, this task, given that the goal is to build a sys- 9 Months). These easily identifiable and re- tem that can process successfully unstruc- peated patterns are understood by the Neu- tured and semi-structured CVs. ral Network-based NER. Therefore, good re- sults are obtained for these entities. On the 5 Conclusion other hand, entities which do not show any Our results demonstrate that the advance- type of repeated patterns or are unique to a ments in the NLP field, such as the “Con- very small type of jobs and/or CV’s, such as textual String Embeddings” (Akbik, Blythe, Job Specific Skills, Soft Skills and Company and Vollgraf, 2018) and state-of-the-art NER exhibit how this type of model fails to under- architectures have failed to obtain satisfac- stand and generalize due to high amount of tory outcomes to be considered useful, for HR variability that these entities pose. and recruiting industry applications, on their Table 3 presents how the three Skills en- own. The results obtained contain high vari- tities present a high variation in their scores, ance among the different entities intended to this is also due to each of these Skills “Na- be extracted. Obtaining good quality when ture”, Job Spec Skills being clearly the most entities have low complexities and are explic- difficult for this type of model. This is caused itly presented, such as Email, Name and Lo- by the big amount of Jobs found in a diverse cation, but for the more critical and complex corpus of CV’s, each different Job will have entities such as Job-Specific Skills and Soft their own set of skills, making it extremely skills this type of model alone can not cope difficult for the model to understand. A so- with the ambiguity and implicitness of this lution to this would be to train different mod- information in unstructured resumes. els for specific domains, using specific domain In order to be able to meet our spon- training corpus. Contrarily, results show how sor needs for this task, this model will be the entity Tech Tools presents a very differ- improved and enhanced with different ap- ent “Nature”. Tech Tools tend to be re- proaches. More classical peated more often across CV’s, even if the approaches, as the ones discussed in the job category or position of the Resumes are “Applying named entity recognition and co- very different. Common Tech Tools like Mi- reference resolution for segmenting English crosoft Word, Excel, PowerPoint, Microsoft texts” article (Fragkou, 2017) combined with Windows. . . will be found more frequently, other rule-based techniques will be tested, to helping the model obtain a higher percent- resolve disambiguation of the more complex age of total relevant results correctly classi- entities while at the same time trying to catch fied (Recall) compared to the other skill en- the missed ones by the model. tities. As a positive outcome from this project, 33% of the entities obtain F1 scores be- a diverse data-set has been produced. It low 75%, these particular entities are critical contains 550 CVs, 25% semi-structured for accurate data extraction from resumes, and 75% unstructured, ranging from they are Company, Job-Specific Skills, Soft a wide variety of industries paired with skills and Tech tools. Without obtaining at well-performed annotations. This pro- least 90% F1 score for these entities it is our vides a unique and good quality corpus opinion that it can not be considered as suc- for solving this task. The corpus can cessful results for a standalone solution. We be found in the following GitHub reposi- consider 90% as a success criterion based on tory: https://github.com/dotin-inc/resume- scores obtained by the best current commer- dataset-NER-annotations. cial CV parser, Rapidparser (RapidParser, 2020). This CV parser among other com-

57 Juan F. Pinzón, Stanislav Krainikovsky, Roman S. Samarev

References Nguyen, V. V., V. L. Pham, and N. S. Vu. Akbik, A., T. Bergmann, D. Blythe, K. Ra- 2018. Study of information extraction sul, S. Schweter, and R. Vollgraf. 2019. in resume. Technical report, VNU-UET FLAIR: An easy-to-use framework for Repository. state-of-the-art NLP. In Proceedings of Palan, M. 2019. re- the 2019 Conference of the North Amer- sume dataset.csv. Kaggle, Mar. ican Chapter of the Association for Com- https://www.kaggle.com/maitrip/resumes. putational Linguistics (Demonstrations), pages 54–59, Minneapolis, Minnesota, RapidParser. 2020. Cv parsing June. Association for Computational Lin- with rapidparser: Lightning-fast! guistics. www.rapidparser.com. Akbik, A., D. Blythe, and R. Vollgraf. Tjong Kim Sang, E. F. and F. De Meul- 2018. Contextual string embeddings for der. 2003. Introduction to the CoNLL- sequence labeling. In Proceedings of the 2003 shared task: Language-independent 27th International Conference on Com- named entity recognition. In Proceed- putational Linguistics, pages 1638–1649, ings of the Seventh Conference on Natural Santa Fe, New Mexico, USA, August. As- Language Learning at HLT-NAACL 2003, sociation for Computational Linguistics. pages 142–147. Bizer, C., R. Heese, M. Mochol, Trilldata-Technologies. 2018. Dataturks R. Oldakowski, R. Tolksdorf, and R. Eck- - best online annotation tool stein. 2005. The impact of semantic web to build pos, ner, nlp datasets. technologies on job recruitment processes. https://github.com/DataTurks/Entity- In O. K. Ferstl, E. J. Sinz, S. Eckert, and Recognition-In-Resumes-SpaCy. T. Isselhorst, editors, Wirtschaftsinfor- Zhang, C., M. Wu, C.-G. Li, B. Xiao, matik 2005, pages 1367–1381, Heidelberg. and Z. Lin. 2009. Resume parser: Physica-Verlag HD. Semi-structured chinese document analy- Fragkou, P. 2017. Applying named entity sis. pages 12–16, 01. recognition and co-reference resolution for segmenting english texts. Progress in Ar- tificial Intelligence, 6, 05. Mikolov, T., E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin. 2018. Ad- vances in pre-training distributed word representations. In Proceedings of the In- ternational Conference on Language Re- sources and Evaluation (LREC 2018). Nakayama, H., T. Kubo, J. Kamura, Y. Taniguchi, and X. Liang. 2018. doccano: Text annotation tool for human. Software available from https://github.com/doccano/doccano. Narayanan, A. and DataTurks. 2018. traindata.json, Jun. https://github.com/DataTurks/Entity- Recognition-In-Resumes- SpaCy/blob/master/traindata.json. Neumer, T. 2018. Efficient natural lan- guage processing for automated recruiting on the example of a software engineering talent-pool. Master’s thesis, TECHNIS- CHE UNIVERSITAT MUNCHEN.

58 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 59-66 recibido 17-03-2020 revisado 28-05-2020 aceptado 28-05-2020

Miner´ıade argumentaci´onen el Refer´endumdel 1 de Octubre de 2017

Argument Mining in the October 1, 2017 Referendum

Marcos Esteve, Francisco Casacuberta, Paolo Rosso Universitat Polit`ecnicade Val`encia [email protected], {fcn, prosso}@prhlt.upv.es

Resumen: La miner´ıade argumentaci´onpermite, mediante herramientas software, obtener cu´alesson los argumentos que expresan los autores en un determinado texto. En este art´ıculose pretende realizar un an´alisisde la argumentaci´onexpresada por los usuarios en Twitter en relaci´onal referendum del 1 de octubre de 2017. Se uti- lizar´apara ello, el dataset MultiStanceCat proporcionado en la tarea organizada en el IberEval 2018. Dado que las herramientas de miner´ıade argumentaci´ontrabajan en su mayor´ıaen ingl´es,ser´anecesario construir un sistema de traducci´onneuronal con postedici´onque permita realizar una traducci´onde los tweets del espa˜noly el catal´anal ingl´es.Los resultados al realizar la miner´ıade argumentaci´onsobre los tweets traducidos ha demostrado obtener un porcentaje muy reducido de argumen- taci´onen todas las comunidades. Palabras clave: Miner´ıade argumentaci´on,traducci´onautom´atica,1Oct2017 Abstract: Argument mining allows, through software tools, to obtain which are the arguments expressed by the authors in a given text. This article aims to make an analysis of the arguments expressed by users on Twitter in relation to the referendum of October 1, 2017 using the MultiStanceCat dataset provided in the shared task organized at IberEval 2018. Since the tools of argumentation mining work mostly in English, it was necessary to build a neural translation system with post-editing that allows to translate of tweets from Spanish and Catalan to English. The results of argumentation mining on the translated tweets have shown to obtain a minimum percentage of argumentation in all the communities. Keywords: Argument mining, machine translation, 1Oct2017

1 Introducci´on la alta polarizaci´onexistente. Por una parte en la comunicaci´ondentro de la comunidad Una de las principales problem´aticas que se puede observar el fen´omenode aislamien- existen en la actualidad en redes sociales, y to, donde el usuario solo percibe informaci´on que se acrecenta cuando los temas son con- perteneciente a su comunidad viendo de esta trovertidos, como el refer´endumsobre la in- forma una sola realidad. Relacionado con el dependencia de Catalu˜na,el Brexit o las elec- anterior, tambi´ense observa el fen´omenode ciones estadounidenses, es la alta polariza- c´amarade eco, donde, debido al aislamiento ci´onexistente entre sus distintas comunida- del usuario se produce un incremento de sus des (Lai, 2019; Lai et al., 2019). La polariza- creencias. ci´onde las comunidades implica la existencia El objetivo principal de este art´ıculocon- de una alta comunicaci´onentre los usuarios siste en determinar si existe argumentaci´on de la misma comunidad y, adem´as,los men- en los tweets de la tarea MultiStanceCat sajes que intercambian con otras comunida- (Taul´eet al., 2018) presentada en el IberEval des suelen ser mensajes t´oxicosdonde des- del 2018. Este dataset consta de un conjunto taca el odio. En este tipo de comunidades de tweets agrupados seg´unsu posicionamien- existen algunos comportamientos propios de to respecto al refer´endumdel 1 de octubre de ISSN 1135-5948. DOI 10.26342/2020-65-7 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Marcos Esteve Casademunt, Paolo Rosso, Francisco Casacuberta

2018, a favor, en contra, o neutral. Para reali- “(1) Museums and art galleries provi- zar el an´alisis,ser´anecesario hacer uso de he- de a better understanding about arts rramientas de miner´ıade argumentaci´on,que than Internet. (2) In most museums and permitan analizar dichos tweets extrayendo art galleries, detailed descriptions in terms los argumentos expresados por los usuarios of the background, history and author are de las distintas comunidades. Dado que no provided. (3) Seeing an artwork online is existen tecnolog´ıasdesarrolladas de miner´ıa not the same as watching it with our own de argumentaci´onpara tweets en espa˜nolo eyes, as (4) the picture online does not show catal´an,se ha optado por utilizar la herra- the texture or three-dimensional structure mienta desarrollada por Iryna Gurevyc en of the art, which is important to study.” la Universidad T´ecnicade Darmstadt1. Es- ta herramienta ´unicamente trabaja en ingl´es o alem´an,por lo que ser´anecesario traducir Figura 1: Ejemplo de estructura argumenta- los tweets a dicha lengua. Para ello se ha pro- tiva puesto un sistema de traducci´oncon postedi- ci´onbasado en redes neuronales, con el que La miner´ıade argumentaci´onpermite, de facilitar al operario humano la tarea de tra- manera autom´atica,extraer estructuras ar- ducci´onde dichos tweets. gumentativas de grandes vol´umenesde tex- tos. Para ello, tal como indican los autores En los siguientes apartados se detallar´a en (Eger, Daxenberger, y Gurevych, 2017), el estado del arte para las distintas tecno- las investigaciones se centran en varias sub- log´ıasy una introducci´ona los datasets uti- tareas: lizados. Se comentar´anlos sistemas desarro- llados, as´ıcomo la experimentaci´onrealizada • La separaci´onde unidades argumentati- para la traducci´onneuronal. Por ´ultimo, se vas frente a no argumentativas (segmen- detallar´anlos experimentos desarrollados pa- taci´onen componentes) ra la miner´ıade argumentos sobre los tweets traducidos. • La clasificaci´on de componentes argu- mentativas en premisa o afirmaci´on • B´usquedade relaciones entre las estruc- 2 Estado del arte turas argumentativas • Clasificaci´onde las relaciones seg´unapo- 2.1 Miner´ıade argumentaci´on yen o ataquen a la afirmaci´on

La argumentaci´onpermite a un escritor con- Un aspecto interesante a destacar es la vencer a una audiencia sobre unas ideas. En transici´onexistente en el desarrollo de dichas esta ´area,un argumento es una estructura tecnolog´ıas.Mientras que, hist´oricamente, la que consta de varios componentes donde, tal b´usquedade estructuras argumentativas se y como comentan los autores en (Stab y Gu- basaba en el uso de gram´aticasconstruidas revych, 2014), se fundamenta por una afir- manualmente para un dominio especifico (Pa- maci´onque est´aapoyada o atacada por, al lau y Moens, 2009), en la actualidad las ta- menos, una premisa. La afirmaci´ontomar´a reas de miner´ıa de argumentaci´onse est´an por tanto, la parte principal del argumen- centrando en sistemas end-to-end basados en to, y le seguir´anlas premisas que trataran redes neuronales recurrentes (Eger, Daxen- de sustentar la afirmaci´on,persuadiendo de berger, y Gurevych, 2017). esta forma al lector. Por ejemplo, tal como detallan los autores y se puede observar en 2.2 Traducci´onautom´atica la Figura 1, la afirmaci´on 1 tomar´ala par- La traducci´onautom´aticatrata de solventar te principal del argumento y le seguir´anlas el problema de la traducci´onde una lengua premisas 2 y 3, que tratar´ande sustentar la origen a otra destino, utilizando para ello he- afirmaci´on.Por ´ultimo,la premisa 4 susten- rramientas autom´aticas.Esta ´areaha sufrido tar´aa su vez a la premisa 3. un alto auge gracias al incremento de textos biling¨uesparalelos donde, para cada oraci´on en un idioma origen, se dispone de su corres- 1http://www.argumentsearch.com/ pondiente traducci´onen una lengua destino. 60 Minería de argumentación en el referéndum del 1 de octubre de 2017

Hist´oricamente, la traducci´onautom´atica si se tuviera que traducir manualmente. estaba basada en el conocimiento ling¨u´ısti- co, hasta la irrupci´oncon gran ´exito,de la 3 Datasets llamada traducci´onestad´ıstica(Brown et al., Dado que no existen herramientas para la ex- 1990; Koehn y Knight, 2000), donde cada tracci´onde argumentos que trabajen en es- oraci´onorigen pod´ıa ser traducci´onde una pa˜nolo catal´any, que la herramienta desarro- oraci´ondestino y, por tanto, el objetivo con- llada por la Universidad T´ecnicade Darms- sist´ıaen asignar una alta probabilidad en ca- tadt solo trabaja en ingl´eso alem´an;se ha so de que, efectivamente, fueran traducci´on decidido integrar un sistema de traducci´on una de otra y, en caso contrario, asignarle que permita traducir los tweets proporciona- una baja probabilidad. En la ecuaci´on1 se dos en el dataset MultiStanceCat del espa˜nol detalla la formula b´asicade la traducci´ones- y el catal´anal ingl´es.Para ello se requiere de tad´ıstica,donde x es la oraci´ona traducir e corpus paralelos que permitan la obtenci´on y es una posible traducci´on. de traductores tanto para el catal´ancomo pa- I ra el espa˜nol. argmax Y p(y | yi−1, x) (1) I,s i 1 3.1 Dataset MultiStanceCat i=1 El corpus MultiStanceCat (Taul´e et al., Actualmente, los esfuerzos en traducci´on 2018) consiste en una recopilaci´onde tweets se est´ancentrando en la traducci´onneuro- multimodal (texto e imagen), utilizando pa- nal, ya que ha demostrado obtener mejores ra ello los hashtags #1oct, #1O, #1oct2017 resultados frente a la traducci´onestad´ısti- y #1octl6. Tras la recopilaci´onde 220.148 ca (Koehn, 2009). Esto ha sido gracias, en tweets entre el espa˜noly el catal´an,los au- parte, a la inclusi´onde las representaciones tores anotaron manualmente un subconjunto distribucionales conocidas como word embed- de estos, dependiendo de su posicionamiento dings (Bengio et al., 2003; Mikolov et al., respecto al refer´endum(a favor, en contra o 2013). Este tipo de modelado permiti´ore- neutral) obteniendo de esta forma un total presentar las palabras como vectores densos de 5853 tweets en catal´any 5545 en espa˜nol. que capturaban las relaciones sem´anticas y Adem´as,con el fin de garantizar la calidad sint´acticascon otros t´erminos.Gracias a esto de las anotaciones, los autores realizaron dos ´ultimoha sido posible aplicar potentes t´ecni- procesos de acuerdo inter-anotador. cas de redes neuronales a una gran cantidad de tareas textuales, obteniendo resultados del Posicionamiento Catal´an Espa˜nol actual estado del arte. En el caso de traduc- A favor 5106 2099 ci´onautom´atica,las dos arquitecturas que fi- En contra 149 2231 jaron el actual estado del arte, son por una Neutral 598 1215 parte, la arquitectura encoder-decoder con Total 5853 5545 modelo de atenci´on(Luong, Pham, y Man- ning, 2015) basada en redes neuronales recu- Tabla 1: Distribuci´onde los tweets seg´unidio- rrentes y, por otra parte, y m´asrecientemen- ma y seg´unse posicionan a favor, en contra te el uso de modelos basados en Transformers o neutral (Vaswani et al., 2017); basados en redes feed- forward y modelos de auto-atenci´onentre las El n´umerode datos recopilados est´aba- palabras de entrada, las palabras de salida y lanceado en cuanto a n´umerode tweets entre las palabras de entrada y salida. ambas lenguas. Adem´as,tal y como se ob- Por otra parte, dado que los resultados en serva en la tabla, mientras para el espa˜nolla traducci´onsiguen sin permitir una traduc- distribuci´onde tweets en las distintas clases ci´ontotalmente autom´aticaexisten numero- est´abalanceada, para el catal´anse observa un sos esfuerzos por conseguir sistemas que per- sesgo donde la mayor´ıade los tweets recopila- mitan a un operador experto traducir texto dos se han etiquetado con un posicionamiento con el m´ınimoesfuerzo posible (Peris, Domin- positivo. go, y Casacuberta, 2017). En esta ´arease han Por otra parte, los autores realizaron un desarrollado numerosas t´ecnicasde postedi- estudio con el fin de determinar cuanto po- ci´oncon las que se pueden obtener mejores larizadas estaban las distintas comunidades resultados con un esfuerzo humano inferior a respecto al refer´endum. El estudio determi- 61 Marcos Esteve Casademunt, Paolo Rosso, Francisco Casacuberta naba que las comunidades estaban desconec- lenguaje coloquial lleno de errores, abrevia- tadas entre si, ya que ´unicamente el 13.64 % turas etc. Se ha propuesto un sistema basado de los usuarios segu´ıan a usuarios de otras en postedici´onque permita a un humano ex- comunidades y, el n´umerode usuarios que perto traducir de una manera eficiente e ir solo segu´ıan a usuarios con una polariza- mejorando las prestaciones de los traducto- ci´on,positiva o negativa, era del 51.44 % y res de forma iterativa (Domingo et al., 2019; 25.03 % respectivamente. Adem´as,´unicamen- Peris y Casacuberta, 2019). En la figura 2 se te el 9.89 % de los usuarios manifestaban una detalla la implementaci´onseguida para dicho postura neutral, por lo que exist´ıauna alta sistema. polarizaci´on. 3.2 Datasets para la traducci´on autom´atica Para la construcci´onde los traductores ha si- do necesario el uso de corpus paralelos de ora- ciones espa˜nol-ingl´esy catal´an-ingl´es.Se han utilizado el corpus Europarl para traducir del espa˜nolal ingl´es y el Opensubtitles para tra- ducir del catal´anal ingl´es. Figura 2: Descripci´ondel sistema utilizado 3.2.1 Europarl El Europarl (Tiedemann, 2012) se trata de Tal y como se detalla en la figura 2, el un corpus paralelo extra´ıdode la web del Par- sistema consta de distintas fases: lamento Europeo por Philipp Koehn. Este 1. Entrenamiento de los traductores neuro- corpus incluye versiones en 21 lenguas euro- nales con el toolkit NMT-Keras y utili- peas. Destacar que el corpus que nos interesa zando los corpus paralelos expuestos en consta de un total de 1.9 millones de pares de 3.2.1 y 3.2.2. frases espa˜nol-ingl´esy alrededor de 50 millo- nes de palabras, tanto para el espa˜nolcomo 2. Extracci´onde un subconjunto de tweets para el ingl´es.A estos pares ser´anecesario del dataset expuesto en 3.1 y traducci´on aplicarles un preproceso, as´ıcomo una toke- por el sistema entrenado. nizaci´on. 3. Revisi´onpor parte del operario exper- 3.2.2 OpenSubtitles to de las traducciones generadas por el El corpus OpenSubtitles (Lison y Tiede- sistema, rectific´andolasy generando de mann, 2016) se trata de un corpus para- esta forma un corpus paralelo de tweets. lelo extra´ıdode subt´ıtulosde pel´ıculas.Es- 4. Ajuste de los par´ametrosdel traductor te corpus ha sido extra´ıdode la pagina web neuronal, utilizando el corpus paralelo http://www.opensubtitles.org/. El corpus in- de tweets generado por el operario. cluye versiones para 62 lenguas distintas. La versi´oncatal´an-ingl´esconsta de un total de 4.2 Experimentac´on 428.000 pares de oraciones y alrededor de Para realizar la experimentaci´onse ha co- 3 millones de palabras. Estos pares catal´an- gido un subconjunto de los datasets parale- ingl´esdeber´anser preprocesados y tokeniza- los y se ha explorado la variaci´ondel BLEU dos para poder ser utilizados por el traductor. al cambiar los par´ametrosentre los mode- los de encoder-decoder con atenci´on y 4 Traducci´onneuronal transformer. Concretamente, para las prue- 4.1 Propuesta de soluci´on bas se ha utilizado 180.000 muestras para en- A la hora de construir los traductores se ha trenamiento, 1.000 muestras para desarrollo propuesto un sistema basado en redes neu- y 2.500 muestras para test. ronales, utilizando para ello el toolkit NMT- 4.2.1 Resultados Europarl Keras (Peris y Casacuberta, 2018). Dado que En primer lugar, se ha explorado cual es la los traductores que se van a entrenar est´an contribuci´ondel tama˜nodel embedding en el basados en corpus con un lenguaje m´asfor- modelo de encoder-decoder; para ello se ha mal y el objetivo final es obtener traducciones fijado el tama˜node las redes LSTM a 64 uni- de un corpus de tweets, los cuales poseen un dades y se han variado los tama˜nosde los 62 Minería de argumentación en el referéndum del 1 de octubre de 2017 embeddings, tanto para el encoder como para Input Target BLEU el decoder. 64 64 22.28 128 128 22.59 Input Target BLEU 256 256 21.61 64 64 22.81 128 128 30.99 Tabla 4: Evoluci´ondel BLEU al modificar el 256 256 30.01 tama˜node los embeddings en el corpus Open- Subtitles (3 epochs) Tabla 2: Evoluci´ondel BLEU al modificar el tama˜node los embeddings en el corpus Eu- A la vista de los resultados expuestos en roparl (3 epochs) la tabla 4, el mejor BLEU se obtiene cuando el tama˜node embedding de entrada y salida Como se puede observar en la tabla 2 los coincide con 128 unidades. Se fijar´a,por tan- mejores resultados se obtienen cuando el ta- to, este tama˜noa los embedding a la hora de ma˜node embedding es 128; por esta raz´onse explorar la evoluci´ondel BLEU al aumentar fijar´ael tama˜no del embedding a este tama˜no el n´umerode unidades en las redes LSTM. a la hora de realizar la experimentaci´on, y as´ı observar la contribuci´ondel tama˜node la red Input LSTM Target LSTM BLEU LSTM a la variaci´onen el BLEU. 64 64 22.59 Input LSTM Target LSTM BLEU 128 128 23.73 64 64 30.99 256 256 23.91 128 128 32.12 512 512 22.71 256 256 31.43 Tabla 5: Evoluci´ondel BLEU al modificar el tama˜node las redes LSTM en el corpus Tabla 3: Evoluci´ondel BLEU al modificar el OpenSubtitles (3 epochs) tama˜node las capas LSTM en el corpus Eu- roparl (3 epochs) Como se puede observar en la tabla 5, los Tal y como se puede observar en la tabla mejores resultados se obtienen cuando el ta- 3, para nuestro caso el mejor modelo basado ma˜node embedding es de 128 y el tama˜node en encoder-decoder con modelo de atenci´onse las redes LSTM coincide con 256 unidades. obtiene cuando el tama˜node los embeddings Adem´as de la arquitectura encoder- y de la red coinciden con 128 unidades. decoder, se ha explorado la variaci´on del Por otra parte, se ha explorado la arqui- BLEU al modificar el tama˜nodel modelo ba- tectura de traducci´onbasada en Transfor- sado en la arquitectura Transformer, aunque mers, aunque las experimentaciones han re- los resultados no han conseguido mejorar al velado que los resultados obtenidos son sig- modelo basado en la arquitectura encoder- nificativamente inferiores a los obtenidos con decoder. la arquitectura encoder-decoder. De nuevo, una vez determinada la me- Por ´ultimo,una vez determinada la mejor jor arquitectura (encoder-decoder) y para- arquitectura (encoder-decoder) y parametri- metrizaci´on(tama˜node embeddings 128 y zaci´on(tama˜node embeddings 128 y unida- LSTM 256), se ha entrenado un traductor con des LSTM 128) se ha entrenado un traductor un mayor volumen de datos obteniendo un con un mayor volumen de datos, obtenien- BLEU de 27.02. Este ser´ael nuevo traductor do un BLEU de 33.91. Este ser´a,por tan- neuronal (catal´an-ingl´es),el cual se ir´ame- to, el traductor que traducir´alos tweets y se jorando de forma progresiva en la traducci´on ir´amejorando de forma iterativa utilizando de tweets empleando el sistema de postedi- el sistema de postedici´onpropuesto. ci´onpropuesto. 4.2.2 Resultados OpenSubtitles 4.2.3 Traducci´onde los tweets Se ha seguido un enfoque similar al expues- Una vez entrenados los traductores neurona- to anteriormente, fijando primeramente para les para la traducci´ondel espa˜noly el catal´an la arquitectura encoder-decoder el tama˜node al ingl´es, se ha seguido la propuesta comenta- la red LSTM a 64 unidades y explorando la da en el apartado 4.1 para traducir los tweets. variaci´ondel BLEU al variar el tama˜nodel Se observ´oun incremento de las prestaciones embedding. de los traductores neuronales al incrementar 63 Marcos Esteve Casademunt, Paolo Rosso, Francisco Casacuberta el n´umerode muestras posteditadas en el pro- sus ideas. ceso de adaptaci´on;obteniendo traducciones Algunos de los argumentos encontrados m´aseficientes a si los tweets se tuvieran que por la herramienta son: traducir de forma manual. • El de @cupnacional diu que tindrem m´es 5 Argument mining drets indepes. Potser si,BarcelonaWorld i escoles OPUS De moment recolzament Una vez traducido el corpus de tweets del da- 2 taset comentado en el apartado 3.1, se ha he- partit corrupte. #1octL6 (En contra) cho uso de la herramienta, desarrollada por • #1O Catal´adiu que especular amb la la Universidad T´ecnicade Darmstadt, Argu- detenci´ode Puigdemont t´epoc sentit i menText (Stab et al., 2018) y disponible en insisteix que ´esdecisi´odel jutge3 (Neu- el enlace http://www.argumentsearch.com/. tral) Se trata de una herramienta que permite bus- car argumentos escritos en ingl´eso alem´anen • #1octL6 que nivel de irresponsabilidad grandes colecciones de documentos, hacien- solo por interes electoral, de unos y do uso de aprendizaje autom´atico.M´ascon- otros. Politicos rompiendo la sociedad y cretamente hace uso de redes neuronales con un pais. Que pena...(En contra) modelos de atenci´onpara la clasificaci´onde • Dice el Gobierno espa˜nolq si renuncia- las oraciones seg´unsi poseen argumentos o no mos al refer´endum#1O nos dar´anm´as y, posteriormente redes recurrentes BILSTM, dinero, autonom´ıay reformar´anla Cons- para clasificar el posicionamiento a favor o en tit. . . (A favor) contra del tema que est´asiendo analizado. Dado que la herramienta utilizada necesita 6 Conclusiones y trabajo futuro un tema sobre el que realizar el an´alisis,se ha utilizado la palabra clave catalonia y se El objetivo principal de este art´ıculoconsist´ıa han agrupado los tweets, dependiendo de su en determinar si exist´ıaargumentaci´onen los lengua origen y del posicionamiento que tu- tweets del dataset MultiStanceCat acerca del vieran en el corpus MultiStanceCat. Refer´endumdel 1 de Octubre de 2017 sobre la independencia de Catalu˜na. Para ello, puesto que no existen herramientas que realicen mi- ner´ıade argumentos sobre el catal´ano el es- pa˜nol,en este trabajo se ha propuesto un sis- tema de traducci´onneuronal construido con en el toolkit Nmt-Keras y basado en postedi- ci´onpara traducir el corpus de tweets de la tarea MultiStanceCat del catal´any el espa˜nol al ingl´es.Esto ha permitido adaptar un domi- nio formal, como pueda ser un traductor en- trenado sobre el corpus Europarl a la traduc- ci´onde un dominio informal lleno de errores, como pueda ser los tweets. Esta fase ha sido Figura 3: Porcentaje de argumentaci´onseg´un especialmente ardua debido a la propia jer- el idioma y el posicionamiento ga informal utilizada en Twitter, donde por lo general existen una gran cantidad de erro- En la figura 3 se puede observar como el res en la escritura, abreviaciones, coletillas y porcentaje de argumentaci´ones muy bajo, caracter´ısticaspropias de la comunicaci´ones- m´aximode 5.44 % en el caso de la comuni- pontanea en medios sociales. Para solucionar dad que est´aen contra y habla en espa˜nol. este problema, ha sido necesario realizar un Adem´as,se puede observar que los hablantes gran n´umero de post-ediciones con el fin de de espa˜nolen el caso de las comunidades a favor y en contra del refer´endum,tratan de 2El de @cupnacional dice que tendremos m´as de- argumentar en mayor medida su postura, en rechos Indep. Quiz´assi, BarcelonaWorld y escuelas comparaci´oncon los hablantes de catal´an.En OPUS de momento apoyemos al partido corrupto. #1octL6 cambio, en el caso de la comunidad neutral se 3#1O Catal´adice que especular con la detenci´on observa como los hablantes de catal´anson los de Puigdemont tiene poco sentido e insiste en que es que tratan de argumentar en mayor medida decisi´ondel juez 64 Minería de argumentación en el referéndum del 1 de octubre de 2017 adaptar los traductores iniciales a la traduc- tion. Computational linguistics, 16(2):79– ci´onde los tweets. 85. Por otra parte, una vez traducidos los ´ tweets y, a la hora de analizar la argumenta- Domingo, M., M. Garc´ıa-Mart´ınez, A. Peris, ci´onexistente en el corpus MultiStanceCat, A. Helle, A. Estela, L. Bi´e,F. Casacu- se ha observado que los usuarios, en su ma- berta, y M. Herranz. 2019. Incremental yor´ıa no tratan de argumentar su postura adaptation of nmt for professional post- acerca del refer´endumcelebrado el 1 de Oc- editors: A user study. arXiv preprint ar- tubre de 2018 y, adem´as,existe un peque˜no Xiv:1906.08996. incremento en la argumentaci´onde los usua- Eger, S., J. Daxenberger, y I. Gurevych. rios castellanoparlantes. 2017. Neural end-to-end learning for Como trabajo futuro, puede ser interesan- computational argumentation mining. En te realizar un estudio del origen de los erro- Proceedings of the 55th Annual Meeting res, ya que pueden existir errores en la tra- of the Association for Computational Lin- ducci´onde los tuits y en la extracci´onde los guistics (Volume 1: Long Papers). argumentos. Por ´ultimo,tambi´enpuede ser interesante estudiar cu´ales el porcentaje de Koehn, P. 2009. Statistical machine trans- argumentaci´onen un corpus de tweets gen´eri- lation. Cambridge University Press. co. Esto permitir´ıarealizar mejores compara- Koehn, P. y K. Knight. 2000. Estimating ciones, permitiendo determinar si es habitual word translation probabilities from unre- el uso de argumentaci´onen Twitter. Adem´as, lated monolingual corpora using the em se puede destacar la necesidad que existe de algorithm. En AAAI/IAAI, p´aginas 711– desarrollar herramientas que permitan reali- 715. zar miner´ıade argumentaci´onpara el espa˜nol y las dem´aslenguas del estado espa˜nol.Esto Lai, M. 2019. On language and structure in permitir´ıa obtener mejores resultados al no polarized communities. Ph.D. tesis, Uni- tener que aplicar t´ecnicasde traducci´onau- versitat Polit`ecnicade Val`encia. tom´atica. Lai, M., M. Tambuscio, V. Patti, G. Ruffo, y Agradecimientos P. Rosso. 2019. Stance polarity in poli- tical debates: A diachronic perspective of Dicho trabajo ha sido desarrollado en el mar- network homophily and conversations on co de la asignatura Traducci´onAutom´ati- twitter. Data & Knowledge Engineering, ca del M´asterUniversitario en Inteligencia 124 (online). Artificial, Reconocimiento de Formas e Ima- gen Digital (MIARFID) de la Universitat Po- Lison, P. y J. Tiedemann. 2016. Opensub- lit`ecnicade Val`encia.El trabajo de los ulti- titles2016: Extracting large parallel corpo- mos dos autores se ha desarrollado en el mar- ra from movie and tv subtitles. co del proyecto Misinformation and Miscom- munication in social media: FAKE news and Luong, M.-T., H. Pham, y C. D. Manning. HATE speech (MISMIS-FAKEnHATE) del 2015. Effective approaches to attention- Ministerio de Ciencia, Innovaci´ony Univer- based neural machine translation. arXiv sidades (PGC2018-096212-B-C31) y del pro- preprint arXiv:1508.04025. yecto PROMETEO/2019/121 (DeepPattern) Mikolov, T., I. Sutskever, K. Chen, G. S. de la Generalitat Valenciana. Corrado, y J. Dean. 2013. Distribu- ted representations of words and phrases Bibliograf´ıa and their compositionality. En Advances Bengio, Y., R. Ducharme, P. Vincent, y in neural information processing systems, C. Jauvin. 2003. A neural probabilis- p´aginas3111–3119. tic language model. Journal of machine learning research, 3:1137–1155. Palau, R. M. y M.-F. Moens. 2009. Ar- gumentation mining: the detection, clas- Brown, P. F., J. Cocke, S. A. Della Pietra, sification and structure of arguments in V. J. Della Pietra, F. Jelinek, J. Lafferty, text. En Proceedings of the 12th interna- R. L. Mercer, y P. S. Roossin. 1990. A tional conference on artificial intelligence statistical approach to machine transla- and law, p´aginas98–107. 65 Marcos Esteve Casademunt, Paolo Rosso, Francisco Casacuberta

Peris, A.´ y F. Casacuberta. 2018. NMT- Keras: a very flexible toolkit with a focus on interactive NMT and online learning. The Prague Bulletin of Mathematical Lin- guistics, 111:113–124. Peris, A.´ y F. Casacuberta. 2019. Online learning for effort reduction in interacti- ve neural machine translation. Computer Speech & Language, 58:98–126. Peris, A.,´ M. Domingo, y F. Casacuberta. 2017. Interactive neural machine trans- lation. Computer Speech & Language, 45:201–220. Stab, C., J. Daxenberger, C. Stahlhut, T. Mi- ller, B. Schiller, C. Tauchmann, S. Eger, y I. Gurevych. 2018. ArgumenText: Searching for arguments in heterogeneo- us sources. En Proceedings of the 2018 Conference of the North American Chap- ter of the Association for Computational Linguistics: Demonstrations. Stab, C. y I. Gurevych. 2014. Identif- ying argumentative discourse structures in persuasive essays. En Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), p´aginas46–56. Taul´e,M., F. M. R. Pardo, M. A. Mart´ı,y P. Rosso. 2018. Overview of the task on multimodal stance detection in tweets on catalan# 1oct referendum. En IberEval@ SEPLN, p´aginas149–166. Tiedemann, J. 2012. Parallel data, tools and interfaces in opus. En Procee- dings of the Eight International Conferen- ce on Language Resources and Evaluation (LREC’12). Vaswani, A., N. Shazeer, N. Parmar, J. Usz- koreit, L. Jones, A. N. Gomez,L.Kaiser, y I. Polosukhin. 2017. Attention is all you need. En Advances in neural information processing systems, p´aginas5998–6008.

66 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 67-74 recibido 03-04-2020 revisado 26-05-2020 aceptado 09-06-2020

Edición de un corpus digital de inventarios de bienes

Edition of a digital corpus of inventories of goods

Pilar Arrabal Rodríguez Universidad de Granada [email protected]

Resumen: En este trabajo se pretende dar a conocer el proceso de elaboración de un corpus diacrónico digital a partir de la selección y edición digital de inventarios de bienes de los siglos XVIII y XIX de las provincias de Madrid y Almería. Los recuentos de bienes, de estructura repetitiva y abundantes en distintos puntos de la geografía hispánica, facilitan la comparación regional y cronológica de los documentos. Este corpus forma a su vez parte de Oralia diacrónica del español (ODE), un corpus que toma como inspiración el modelo tecnológico empleado por el proyecto europeo P.S. Post Scriptum para ofrecer en línea un corpus anotado a partir de la herramienta TEITOK (Janssen, 2016). Palabras clave: Inventario de bienes, Lingüística histórica, Lingüística de corpus, español, XML, TEITOK

Abstract: The aim of this paper is to show the process of preparing a digital diachronic corpus based on the selection and digital edition of inventories of goods from the 18th and 19th centuries in the provinces of Madrid and Almeria. Inventories of goods which have a similar structure and are abundant in various areas of the Spanish geography, facilitate the comparison of the documents. This corpus is part of Oralia diacrónica del español (ODE), a corpus that takes as inspiration the technological model used by the European project P. S. Post Scriptum to offer an annotated corpus online using the TEITOK tool (Janssen, 2016). Keywords: Inventories of goods, Diachronic linguistics, Corpus Linguistics, Spanish, XML, TEITOK

sobrada fiabilidad el estudio de la variación 1 Introducción diatópica y diacrónica, lo que permite establecer diferencias geográficas claras en el En este trabajo se explica la metodología que se uso de las palabras, así como delimitar la zona y está siguiendo para elaborar un corpus extensión de los términos a lo largo del tiempo. diacrónico digital constituido por inventarios de De manera particular, los inventarios de bienes bienes de los siglos XVIII y XIX de las se benefician también de estas condiciones provincias de Madrid y Almería. favorables para el estudio de fenómenos Son cada vez más numerosos los corpus dialectales reflejados por los escribanos locales. basados en documentación archivística que han La gran abundancia de este tipo de sabido aprovechar las ventajas que ofrecen este documentos por todo el territorio hispánico tipo de documentos. En lengua española merece permite comparar los resultados tanto diatópica especial mención el corpus CHARTA1, que como cronológicamente en la medida en que es integra numerosos subcorpus de muy diferentes posible ubicar con certeza el uso de los tipologías, desde los albores del castellano hasta vocablos en un momento histórico y lugar la época moderna. Entre otras ventajas, la geográfico concretos. Un claro ejemplo de ello documentación archivística posibilita con es el Corpus Léxico de Inventarios2, (Morala

1 http://www.charta.es/ 2 http://web.frl.es/CORLEXIN.html

ISSN 1135-5948. DOI 10.26342/2020-65-8 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Pilar Arrabal Rodríguez

Rodríguez, 2014). El CorLexIn reúne diferentes criterios que además son inventarios de bienes del siglo XVII de desconocidos por el usuario que consulta el prácticamente casi todas las provincias de corpus. Para la elaboración de este corpus se ha España y de otras partes de América. De esta tomado el estándar propuesto por el consorcio manera, queda registrada de forma minuciosa TEI para la transcripción de manuscritos. una gran cantidad de léxico referido a objetos El corpus que aquí se presenta justifica su de muy diversa naturaleza que eran de elaboración en el principio de la fiabilidad de extendido uso en la época y en todas las los documentos que lo componen y pretende ser regiones hispánicas favoreciendo así una herramienta interdisciplinar que facilita interesantes investigaciones3. investigaciones en diversas áreas como pueden La abundante documentación conservada y ser la etnografía, historia, cultura y lingüística. la similar estructura que mantienen los textos, En definitiva, la idea que ha motivado la al igual que su contenido, e indistintamente de elaboración de un corpus como el que se cuál sea su origen, favorece la investigación propone parte de dos conceptos que se dialectal. Estas características hacen de los consideran clave: el aprovechamiento de la inventarios una interesante fuente documental tipología inventario de bienes tal y como se que favorece la comparación. consolida en CorLexIn, y el empleo de un Dejando a un lado los aspectos lingüísticos y modelo tecnológico para ofrecer un corpus con pasando a las consideraciones tecnológicas, la los últimos avances como lo es el propuesto por edición digital de los documentos que P. S. Post Scriptum. conforman el corpus ha seguido de cerca el modelo tecnológico propuesto por el proyecto 2 Antecedentes europeo P.S. Post Scriptum4. Este corpus reúne El origen del corpus cuya realización se una amplia colección de cartas privadas tanto expondrá en los apartados siguientes supone la en español como portugués de autores de continuación del proyecto Corpus diacrónico diferentes categorías sociales. Con ello se del español del Reino de Granada. 1492-1833, prioriza el estudio de variedades lingüísticas CORDEREGRA (Calderón-Campos y García- que no suelen aparecer reflejadas en otros Godoy 2010), que recoge, entre otras tipologías corpus de carácter culto o literario, ya que la de documentos, inventarios de bienes de las correspondencia privada se presta a evidenciar provincias de Granada, Málaga y Almería. No un discurso cercano a la oralidad por el obstante, el número de textos era tratamiento de asuntos cotidianos (Vaamonde, considerablemente menor en el caso de esta 2015). última. Por tanto, el objetivo ha sido completar Adicionalmente, el principal objetivo de este el volumen del corpus con la inclusión de proyecto ha sido el de ofrecer un corpus documentos de Almería para mostrar así un enteramente editado, lematizado y anotado corpus completo, más rico y representativo de lingüísticamente gracias a la herramienta esta región. TEITOK (Janssen, 2016). Como mejoras frente al corpus ya existente, En la actualidad ya son veintidós los la ampliación prevé el uso de un avanzado proyectos que utilizan TEITOK para alojar sus modelo tecnológico que permitirá ofrecer un corpus. De entre ellos, P. S. Post Scriptum es el corpus enteramente digital gracias a la corpus que se ha tomado como referencia para herramienta TEITOK. El resultado tras este la realización del nuevo corpus de inventarios cambio en la metodología es el corpus Oralia por tratarse también de un corpus histórico de la diacrónica del español (ODE) (Calderón Edad Moderna del español. Campos, 2019). El corpus que aquí se presenta Además del empleo de TEITOK, el uso de es por tanto un subcorpus dentro de ODE. estándares internacionales es imprescindible La aplicación de una de las tecnologías más para no dar cabida a textos en los que la punteras en lingüística de corpus es el objetivo presentación gráfica sea heterogénea debido a principal de los proyectos en los que este trabajo se enmarca: HISPATESD y ALEA- 3 Morala Rodríguez (2014) ha evidenciado en XVIII, que pretenden además alcanzar un numerosos trabajos cómo corpus de estas elevado impacto social con la visibilidad y características pueden hacer útiles aportaciones al recuperación del patrimonio documental de estudio del léxico y a la lexicografía histórica. Andalucía (por ahora de sus provincias 4 http://ps.clul.ul.pt/es/index.php?

68 Edición de un corpus digital de inventarios de bienes orientales y más adelante de su totalidad). El los archivos, se han preferido los manuscritos segundo proyecto, como una de sus en mejor estado de conservación y solo se han aplicaciones tecnológicas más relevantes, rechazado aquellos inventarios que han persigue además documentar formas léxicas planteado alguna duda sobre su certera datación identitarias de todas las provincias andaluzas a o localización. partir de un cartografiado diacrónico. Los documentos seleccionados son, en su Se pretende contrastar la modalidad gran mayoría, cartas de dote y recuentos de lingüística de la región con otras zonas y bienes realizados con motivo de la muerte de épocas, por lo que se ha proyectado la creación algún familiar y tras los que se realiza su de un corpus de control con el que poder consecuente reparto entre los herederos; pero establecer diferencias o similitudes también se incluyen entre los documentos comparativas. Este corpus será representativo escogidos almonedas o embargos de bienes. En de la Comunidad de Madrid y servirá como todos ellos el escribano recoge listas referencia en tanto que reflejará una modalidad pormenorizadas de las pertenencias del de habla lo suficientemente alejada de la benefactor, entre los que se encuentran registrada en Almería. De aquí en adelante y utensilios domésticos variados, ropas, salvo que se especifique lo contrario, se hará mobiliario o animales. En la Tabla 1 se muestra referencia de manera conjunta a la metodología una tipología de los documentos pertenecientes empleada tanto en el corpus de estudio a cada provincia incluidos en el corpus: almeriense como en el madrileño. Almería Madrid 3 El corpus documental Cartas de dote 5 26 Particiones 44 18 La primera tarea para la confección del corpus Almonedas 2 1 se fundamentó en localizar, seleccionar y Embargos 1 0 digitalizar los manuscritos. Para ello se Total 52 45 consultaron los fondos notariales del Archivo Histórico de Protocolos de Almería y de Madrid. La labor de selección de los inventarios Tabla 1: Relación de documentos por provincia de bienes ocupó los primeros meses desde el inicio de elaboración del corpus y actualmente A la tarea de selección, le ha seguido la está concluida para ambas provincias. reproducción fotográfica del documento. Cada En el caso de los documentos almerienses inventario se ha identificado con una cabecera seleccionados, estos proceden de los once en la que se incluyen datos que registran su municipios ya contemplados en CorLexIn con datación y localización geográfica entre otras el fin de mantener lo más fielmente posible el características descriptivas propias del principio de la comparabilidad diatópica. A documento. estos once puntos se le han añadido ocho Un propósito inicial radicaba en alcanzar un nuevas localizaciones extendiendo así el corpus de aproximadamente 100.000 palabras espacio estudiado. transcritas. Actualmente, con los cerca de cien A su vez, el corpus de Madrid incluye documentos que componen el corpus, el inventarios de la capital y de otros trece material digitalizado sobrepasa este tamaño por municipios de la provincia. De esta manera se lo que no se descarta seguir ampliando las consigue un corpus dialectalmente más amplio dimensiones del corpus en fases posteriores. y representativo de la provincia en su totalidad de lo que podemos encontrar en CorLexIn5. 4 Transcripción en XML-TEI La selección de los inventarios de bienes no ha seguido ningún tipo de restricción a Los criterios de transcripción escogidos han excepción de que estos se incluyan en el marco sido el resultado de un largo proceso de cronológico (XVIII-XIX) y la zona geográfica reflexión y de toma de decisiones con el fin de (Madrid y Almería) objetos de estudio. Ante la lograr una edición digital que asegure el rigor abundancia de este tipo de documentación en filológico de las transcripciones. Tan solo se han normalizado la puntuación, el uso de mayúsculas y minúsculas y la separación de 5 La Comunidad de Madrid en CorLexIn únicamente se ve representada por tres de sus palabras, todo ello conforme a las normas localidades, entre las que se incluye la capital. ortográficas actuales. Las grafías se han

69 Pilar Arrabal Rodríguez respetado conforme al original con el fin de para trabajar con los distintos esquemas permitir estudios de carácter gráfico o fonético. existentes de codificación, entre los que se En cuanto a cuestiones de carácter técnico, encuentran las directrices propuestas por el la edición del total de documentos que consorcio TEI. El uso de Oxygen ha facilitado componen el corpus se ha llevado a cabo en un sobremanera el etiquetado y ha agilizado el lenguaje informático y versátil como es XML proceso de transcripción, ya que permite validar adaptado al estándar TEI (Text Encoding automáticamente los documentos de acuerdo Initiative, 2007), un esquema de codificación con el esquema elegido y asiste al usuario en la muy especializado y de alcance internacional. codificación. A partir de este tipo de lenguaje, el La transcripción de los inventarios se ha consorcio TEI propone un estándar que ofrece centrado exclusivamente en aquellas partes una metodología de codificación para la edición consideradas de mayor interés léxico. Estas son de textos en el ámbito de las Humanidades, y las coincidentes con las listas o recuentos de más particularmente para la marcación de bienes donde se enumeran pormenorizadamente manuscritos, que es la que ha servido de base los objetos de los que consta el inventario. Se para la edición de este corpus. han excluido, por tanto, aquellas partes La transcripción en XML permite combinar referidas a las relaciones de parentesco de los aspectos de contenido y formato del documento herederos con el difunto o la relación de gracias al etiquetado de diferente información, parientes implicados en la tasación y reparto de tal y como se muestra en la Figura 1: bienes. En definitiva, se han obviado fragmentos donde abundan las fórmulas fraseológicas o los tratamientos protocolarios propios de textos de carácter notarial como son los documentos que nos ocupan. Estas partes resultan claves para la correcta identificación del inventario y por ello se han digitalizado y conservado internamente, pero no se suman al conjunto del material transcrito. De manera excepcional, sí se han transcrito cuando brevemente permiten la identificación del documento y aparecen inmediatamente antes de la relación de bienes.

5 Procesamiento lingüístico del corpus Figura 1: Transcripción en XML-TEI de un La edición digital del conjunto de datos no es la fragmento de inventario meta final. El tratamiento lingüístico del corpus supone uno de los grandes objetivos de carácter tecnológico con el que poder ofrecer un corpus En el fragmento transcrito se visualiza cómo anotado. El procesamiento del corpus consta es posible marcar características referidas a la principalmente de cuatro tareas que tienen que estructura principal del documento con las ver con la tokenización, normalización, etiquetas ,

o ; correspondientes lematización y etiquetado morfosintáctico. a los inicios de folio, párrafo o línea Todas estas tareas se realizan en línea respectivamente. También características físicas directamente desde el sistema web TEITOK. o visuales como la que refleja la etiqueta TEITOK está ideado para combinar la para las adiciones en el margen fuera de la caja edición filológica con los avances de la de escritura; o bien características conceptuales lingüística computacional y abarca dos recursos por medio de las etiquetas y , en uno. En primer lugar, es una plataforma de para el desarrollo de abreviaturas o conjeturas consulta en la que cualquier usuario externo editoriales respectivamente. puede visualizar los documentos y realizar La edición en XML se ha llevado a cabo búsquedas en el corpus ajustadas a sus utilizando el procesador de textos oXygen XML necesidades. En segundo lugar, es también Editor6 que incorpora multitud de plantillas donde se llevan a cabo por parte del equipo de trabajo las tareas de tratamiento lingüístico y 6 https://www.oxygenxml.com/

70 Edición de un corpus digital de inventarios de bienes edición del corpus que se desarrollan a La gran variedad ortográfica que presentan los continuación. manuscritos obstaculiza sobremanera la posibilidad de ofrecer un corpus anotado 5.1 Tokenización morfosintácticamente. La normalización ortográfica supera esta barrera, pero, además, La tokenización supone uno de los primeros ello hace accesible el corpus al público pasos para el tratamiento automático de los interesado que no posea conocimientos textos y alude al proceso de identificación de lingüísticos específicos. tokens. Esta tarea se lleva a cabo En esta segunda fase se añade la automáticamente en TEITOK una vez que los normalización de aquellas grafías que documentos han sido importados a la previamente se respetaron en la edición de plataforma. El proceso de tokenización consiste acuerdo con el manuscrito original por poseer en asignar a cada forma ortográfica, incluyendo interés filológico y lingüístico. En TEITOK es también los signos de puntuación, un número posible realizar automáticamente la único de identificación dentro de un elemento modernización ortográfica del corpus de que engloba la palabra y acuerdo con parámetros previos importados a la cualquier otra información que se añada en plataforma y que sirven de entrenamiento. Tras relación con ella. A partir de ahora, dentro de la normalización automática se añade a cada cada token se almacenará toda la información token un atributo @nform tal y como se ha relativa a los distintos niveles de edición de una visto en el ejemplo de la Figura 2. sola palabra a la que posteriormente se le Tras la normalización automática se realiza adjudicarán los atributos @form una revisión manual conforme a las normas (correspondiente a la forma original), @fform ortográficas actuales de aquellas palabras que (forma expandida) y @nform (forma no han sido procesadas correctamente. Esta normalizada), de manera que se respetan las revisión es de vital importancia, pues evita que múltiples formas gráficas que tiene una misma errores en este nivel se sucedan en las palabra. Véase el ejemplo de la Figura 2 siguientes fases de anotación. tomado del token “vezino”:

vezno La anotación del corpus se lleva a cabo a partir de la forma normalizada de cada palabra. El Figura 2: Ejemplo de token proceso de anotación automática consiste en la adición de dos nuevos atributos @pos y @lemma a cada token ya formado. A estos La transformación de palabras en tokens es atributos se incorpora la etiqueta también imprescindible para los procesos de morfosintáctica y el lema correspondiente. búsquedas del corpus con el sistema CQP El etiquetario utilizado para anotar el corpus (Corpus Query Processor). Gracias a la se ha basado en el ya manejado por el proyecto tokenización quedan vinculadas todas las P. S. Post Scriptum, aunque con ligeras formas de una misma palabra en sus distintos modificaciones que simplifican en algunos niveles de edición. En la búsqueda, el usuario casos las etiquetas utilizadas. Este, a su vez, puede decidir qué forma ortográfica desea sigue las directrices del estándar propuesto por buscar (paleográfica o normalizada) y el el grupo EAGLES para la anotación de sistema recuperará todas las soluciones que lexicones en lengua española. coinciden en su normalización La anotación morfosintáctica en TEITOK se independientemente de las múltiples variedades lleva a cabo con el anotador automático gráficas que estén presentes en el texto original. NeoTag (Janssen, 2012), un analizador Esto permite recuperar todas las variantes probabilístico del tipo HMM que adjudica a formales, incluyendo aquellas poco predecibles, cada palabra la etiqueta correspondiente según a partir de una sola búsqueda. la función gramatical que cumple en el texto. NeoTag funciona calculando la probabilidad de 5.2 Normalización la etiqueta lingüística POS (part-of-speech) para La normalización del corpus abarca cada forma ortográfica teniendo como base un exclusivamente el nivel ortográfico del texto. corpus de entrenamiento. NeoTag busca en él la

71 Pilar Arrabal Rodríguez frecuencia de aparición de cada palabra con su lema. El algoritmo para ello genera en el acto respectiva etiqueta POS y en base a ella un patrón de análisis morfológico con el que adjudicará una etiqueta u otra dependiendo de modificar la forma original para obtener el lema la probabilidad. (Janssen, 2012). Este sistema fallará por En aquellas palabras nuevas o desconocidas ejemplo en aquellos casos de verbos con flexión para las que no hay datos en el corpus de irregular o de formas como la que se acaba de entrenamiento, el analizador escogerá la mencionar7. etiqueta lingüística según otros factores, como Tal y como se deduce a partir de estos son la terminación de la palabra o contextos ejemplos, los casos con mayor probabilidad de similares de aparición. Sin embargo, hay otros error radican en aquellas palabras que pueden casos también propensos a errar en la anotación. presentar diferentes categorías gramaticales, Se trata de aquellas palabras que pueden casos de desambiguación, lo que exige una presentar ambigüedad léxica: palabras que, aun revisión manual de las mismas. estando presentes en el corpus, requieran una El porcentaje de acierto en la anotación anotación morfosintáctica distinta a la dependerá de varios factores, entre los que se registrada por hallarse en un contexto diferente. encuentran el tamaño y la calidad del corpus de Hasta alcanzar las 100.000 palabras, el entrenamiento en cuanto a lo que etiquetas corpus de entrenamiento que sirvió como base correctamente asignadas se refiere. Pero sin para la anotación ha sido el del proyecto P. S. duda, la precisión del analizador mejora cuando Post Scriptum, constituido por cartas de la Edad este utiliza el propio corpus como Moderna. No obstante, dadas las características entrenamiento y no otros parámetros individuales del corpus epistolar frente al de importados. Esto solo será posible cuando el inventarios, existía un alto porcentaje de error. tamaño del corpus lo permita (Janssen, 2016). Los siguientes ejemplos reales tomados del Al presente, debido al tamaño ya corpus servirán para ejemplificar cómo considerable del corpus, este se ha desligado funciona el analizador. En la oración “un perol recientemente de los parámetros de P. S. Post de cobre”, que se encuentra fácilmente en el Scriptum y el conjunto de inventarios ya corpus, cobre es etiquetado reiteradamente con incluidos en TEITOK conforman el propio la etiqueta VMSP1S0 y con el lema cobrar. Se corpus de entrenamiento, en constante trataría por tanto de un verbo. Esto se debe a actualización según aumenta su tamaño. Con que en el corpus de entrenamiento (heredado de los parámetros de los inventarios como base se P. S. Post Scriptum) es alta la frecuencia de ha aumentado considerablemente la tasa de cobre como una forma verbal y no como el aciertos tras una anotación automática. La elemento químico al que se hace referencia en cantidad de nuevos datos anotados no es aún nuestro corpus. suficiente para poder realizar una comparación De manera paralela a la anotación objetiva respecto a la efectividad de los morfosintáctica se lleva a cabo la lematización, parámetros previos. Se espera que se reduzcan también de forma automática con Neo Tag. las labores de revisión manual y que errores Para las palabras desconocidas, delimitar la como los mencionados más arriba queden terminación de las que no lo son es la estrategia solventados, aunque no se han realizado todavía usada para asignar el lema. Una vez que el estadísticas concluyentes que valoren los analizador detecta varias posibilidades, elegirá resultados de la anotación en esta nueva fase8. la más frecuente. Este caso se aplica a la Con todo, la tarea de la anotación lingüística palabra “céntimos”, anotada automáticamente demanda todavía una atención considerable y como un verbo y con el lema *centimar. Esta palabra no se encuentra en los parámetros y se 7 Aun así, los fallos en los lemas son muy anota conforme a su terminación limitados respecto a la anotación morfosintáctica. La correspondiente a la flexión verbal de otras lematización ha sido testada con un 95% de acierto palabras que en cambio sí localiza en los en corpus españoles (Janssen, 2012: 3). parámetros. En muchos casos como este una 8 Test de eficacia para la evaluación de NeoTag mala lematización está vinculada a una etiqueta en la anotación morfosintáctica se han realizado POS errónea, pero en otros, como el caso de sobre varios corpus del español actual con una veces, (correctamente anotado como sustantivo, precisión del 97% (vid. Janssen, 2012). En un corpus pero lematizado como *vec), se debe de los siglos XVIII y XIX como el que nos ocupa, y con un tamaño considerablemente menor, este únicamente al procedimiento de asignación del porcentaje se verá reducido.

72 Edición de un corpus digital de inventarios de bienes exige necesariamente una revisión manual, en edición paleográfica y la edición normalizada algunos casos semiautomática9. Hasta el de cada uno de los manuscritos. momento, esta revisión se ha llevado a cabo Todos los documentos ya se encuentran entre dos anotadores del equipo de trabajo y se normalizados y anotados morfosintácticamente. está elaborando una guía, todavía de uso interno En este momento, las tareas de elaboración del exclusivamente, en la que se contemplen los corpus están centradas en la revisión del criterios estipulados con el fin de garantizar la etiquetado lingüístico con el objetivo de consistencia de la anotación en la totalidad de constituir un corpus de entrenamiento documentos. De esta revisión depende la consistente y de calidad con el menor número calidad del corpus de entrenamiento para la de errores posibles y con el que se vea reducido anotación de futuros documentos conforme va el actual porcentaje de error del anotador creciendo el corpus. Asimismo, también en esta automático. fase radica la posibilidad de realizar búsquedas El corpus de inventarios ya se entrena con complejas mediante el sistema CQP que utiliza sus propios parámetros. De ahora en adelante, TEITOK, restringiendo las consultas y una vez desligado de los indicadores de P.S. ofreciendo también información relativa al lema Post Scriptum, se sigue probando la efectividad y a la etiqueta lingüística de palabras en automática de la normalización y del analizador posiciones específicas de la oración, lo que tras este cambio. TEITOK facilita la edición de favorece una potente sintaxis de búsqueda. la anotación lingüística a partir de búsquedas que permiten aplicar cambios en bloque. 6 Estado actual y trabajo posterior Actualmente se está trabajando en una revisión del etiquetado según este método, que, si bien Desde inicios de 2019 hasta ahora la tarea de no es del todo automático, agiliza sobremanera recopilación de los inventarios de bienes está el proceso sin la necesidad de hacerlo terminada tanto para el corpus de estudio como documento por documento. Se espera que esta para el de control. Actualmente, el tamaño del tarea se concluya al finalizar el presente año. que constan ambos corpus supera las 117.000 Producto de la metodología empleada, el palabras conforme a la distribución que se resultado es ofrecer un corpus donde el usuario muestra en la Tabla 2: puede realizar búsquedas cruzadas de los documentos que ya están, además de Almería Madrid normalizados, anotados lingüísticamente, y S. XVIII 28.826 30.325 descargar libremente los archivos en formato S. XIX 25.785 33.017 XML o TXT con la transcripción de cada Total 54.611 63.342 documento en la versión que desee. Entre las tareas más inmediatas que se Tabla 2: Número de palabras en el corpus realizarán próximamente se contempla la necesidad de ampliar el tamaño del corpus con Ya están disponibles para su consulta nuevo material transcrito una vez que se haya pública el volumen de textos anteriormente trabajado en la mejora del etiquetado recogidos en la Tabla 2. Es posible acceder a morfosintáctico de los textos anotados. Está ellos a través de la dirección web prevista también una proyección cartográfica http://corpora.ugr.es/ode/, donde se aloja el digital que refleje todas las localidades de las corpus en su totalidad. Junto a cada que se tienen datos y que conforman el corpus. transcripción, se puede consultar la información Otras tareas que merecen especial atención metatextual que acompaña a cada documento y están relacionadas con las búsquedas en el que conforma la cabecera de cada uno de ellos. corpus. Entre ellas, mejorar la interfaz para Además, está disponible la triple visualización hacerla intuitiva y accesible a cualquier usuario de los documentos. Esto es: la reproducción no especializado. Al mismo tiempo, se fotográfica del documento o facsímil digital, la proporcionarán otras opciones avanzadas que posibiliten la recuperación de la información por medio de búsquedas complejas que 9 Para una descripción de las distintas permitan explotar al máximo un corpus del que posibilidades que ofrece TEITOK a la hora de será posible explorar muy diversos aspectos agilizar las correcciones en el etiquetado pertenecientes a diferentes ámbitos de morfosintáctico consúltese Janssen, Ausensi y Fontana (2017). investigación.

73 Pilar Arrabal Rodríguez

7 Conclusiones UE) y Atlas Lingüístico y Etnográfico de Andalucía, s. XVIII. Patrimonio documental y En este trabajo se ha dado a conocer el proceso humanidades digitales (ALEA XVIII) de elaboración de un corpus de inventarios (proyectos I+D+i Junta de Andalucía - FEDER, siguiendo los últimos avances en edición digital P18-FR-695). y lingüística computacional para corpus históricos. Bibliografía TEITOK es una de las herramientas existentes que posibilita crear corpus a partir de Calderón-Campos, M. 2019. La edición de la combinación de las tareas de transcripción y corpus históricos en la plataforma TEITOK. lingüística computacional en una plataforma El caso de Oralia diacrónica del español. con diversas funcionalidades donde se permite Chimera, 6:21-36. al mismo tiempo el mantenimiento en línea del Calderón-Campos, M. y M. T. García-Godoy. corpus. Gracias a la tokenización se conserva 2010-2019. Oralia diacrónica del español toda la información relativa al texto y a su (ODE). En línea: http://corpora.ugr.es/ode/ grafía al igual que se incluye información lingüística. Las ediciones que se ofrecen de un CLUL (ed.). 2014. P. S. Post Scriptum. Archivo mismo texto a partir de una sola transcripción Digital de Escritura Cotidiana en Portugal y solo son posibles mediante el etiquetado previo España en la Edad Moderna. En línea: de los distintos niveles de edición. Partiendo de archivos codificados en el Consorcio TEI, (eds.). 2007. TEI P5: estándar TEI para lenguaje XML se ofrece un Directrices para la codificación e corpus enteramente digital, de libre acceso y ya intercambio electrónico de texto (Versión disponible en línea. 1.5.). En línea: corpus de inventarios de las provincias de Almería y Madrid de las características Janssen, M., J. Ausensi y J. M. Fontana. 2017. mencionadas es el de poder realizar estudios Improving POS tagging in Old Spanish lingüístico-estadísticos comparativos y using TEITOK. En Proceedings of the establecer diferencias significativas a partir de NoDaLiDa 2017 workshop on Processing los análisis contrastivos resultantes. Por ahora, Historical Language, páginas 2-6, el tamaño del corpus permite hacer una primera Gotemburgo, Suecia. cala en esta dirección, pero el propósito es continuar ampliando el corpus y abarcar otras Janssen, M. 2016. TEITOK: Text-Faithful zonas que continúen enriqueciendo los análisis. Annotated Corpora. En Proceedings of the Los inventarios de bienes permiten realizar Tenth International Conference on interesantes investigaciones en términos Language Resources and Evaluation, culturales y lingüísticos, ya que permiten páginas 4037-4043, Portoroz, Eslovenia. reflejar la modalidad dialectal de la zona. Esto Janssen, M. 2012. NeoTag: a POS tagger for se traduce en un corpus creado para satisfacer grammatical neologism detection. En no solo los intereses de los filólogos Proceedings of the 8th International investigadores, sino también los de lingüistas, Conference on Language Resources and historiadores, etnógrafos o público interesado Evaluation, LREC 2012, Estambul. sin conocimientos lingüísticos específicos. Este carácter interdisciplinar es posible a la Morala-Rodríguez, J. R. 2014. El CorLexIn, un proyección de un corpus con diferentes corpus para el estudio del léxico histórico y ediciones de entre las que el usuario puede dialectal del Siglo de Oro. Scriptum Digital, elegir la que desee y acceder a los textos a partir 3:5-28. de un potente motor de búsquedas. Vaamonde, G. 2015. P. S. Post Scriptum: dos corpus diacrónicos de escritura cotidiana. Agradecimientos Procesamiento del Lenguaje Natural, 55:57- Este trabajo se inscribe en el marco de los 64. proyectos de investigación Hispanae Testium Depositiones HISPATESD, de referencia FFI2017-83400-P (MINECO / AEI / FEDER,

74 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 75-82 recibido 28-04-2020 revisado 04-06-2020 aceptado 04-06-2020

Relevant Content Selection through Positional Language Models: An Exploratory Analysis

Selecci´onde Contenido Relevante mediante Modelos de Lenguaje Posicionales: Un An´alisisExperimental Marta Vicente, Elena Lloret Department of Software and Computing Systems University of Alicante, Spain {mvicente,elloret}@dlsi.ua.es

Abstract: Extractive Summarisation, like other areas in Natural Language Pro- cessing, has succumbed to the general trend marked by the success of neural ap- proaches. However, the required resources—computational, temporal, data—are not always available. We present an experimental study of a method based on statis- tical techniques that, exploiting the semantic information from the source and its structure, provides competitive results against the state of the art. We propose a Discourse-Informed approach for Cost-effective Extractive Summarisation (DICES). DICES is an unsupervised, lightweight and adaptable framework that requires neit- her training data nor high-performance computing resources to achieve promising results. Keywords: Summarisation, Positional Language Models, Discourse Semantics Resumen: Como muchas ´areasen el ´ambito del Procesamiento de Lenguaje Na- tural, la generaci´onextractiva de res´umenesha sucumbido a la tendencia general marcada por el ´exitode los enfoques de aprendizaje profundo y redes neuronales. Sin embargo, los recursos que tales aproximaciones requieren – computacionales, temporales, datos – no siempre est´andisponibles. En este trabajo exploramos un m´etodo alternativo basado en t´ecnicasestad´ısticasque, explotando la informaci´on sem´antica del documento original as´ıcomo su estructura, proporciona resultados competitivos. Presentamos DICES, un m´etodo no supervisado, econ´omicoy adap- table que no necesita recursos potentes ni grandes cantidades de datos para lograr resultados prometedores respecto al estado de la cuesti´on. Palabras clave: Res´umenes autom´aticos, Modelos de Lenguaje Posicionales, Sem´antica del Discurso 1 Introduction enabling transformation in industry and aca- demia. However, today shortcomings of DL The explosive growth of data that today’s so- technologies raise concerns at different le- ciety witness, puts in the front line the re- vels depicting a landscape of affected scena- search and development of suitable techno- rios: small companies that cannot access hu- logies that facilitate not only the access to ge amount of data, direct absence of such such data deluge, but its comprehension. To volume of data due to the specificity of the this effect, summarisation techniques become problem (e.g. some medical realms, organi- a crucial resource, aiming at facilitating infor- sational documentation, etc.) even the com- mation access and understanding by conden- plex processing involved in DL, which may be sing data with no loss of meaning (Nenkova costly not only in terms of computational re- and McKeown, 2011). sources, but in environmental impact (Stru- Deep Learning (DL) approaches have be- bell, Ganesh, and McCallum, 2019). Draw- come increasingly popular in most of the Na- backs of this technology, as it exists today, tural Language Processing (NLP) tasks, al- which highlight the need to explore alterna- so in text summarisation, with competiti- 1 tive methodologies and motivate our current ve results and promising developments for investigation.

1nlpprogress.com is a repository to track the Refocusing attention on statistical met- progress in NLP, for the most common tasks. hods, in this paper we present a Discourse- ISSN 1135-5948. DOI 10.26342/2020-65-9 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Marta Vicente, Elena Lloret

Informed approach for Cost-effective Extrac- fectiveness of our approach. tive summarisation (DICES), and examine its performance in the task of single docu- 2 Related Work ment extractive summarisation (SDS). Sin- ce it is conceived as an unsupervised method As our approach is primarily designed for ex- that does not require human annotation or tractive summarisation, perspective that in- intervention neither copious amounts of data cludes salient sentences from the original text to provide good results, our approach repre- without modification, we focus specifically on sents a more digitally inclusive tool for public representative methods within this context, and private sector organisations that have for brevity. We also include related work that tighter budgets for accessing Information and underline the importance of discourse, as this Communication Technologies. In this sense, is a core aspect for DICES. organisations with small scale production of Neural approaches have become mains- digital data could benefit from this light- tream research in recent years, where summa- weight mechanism to obtain the desired sum- risation is tackled as a classification problem maries from their different types of informa- which establishes sentence appropriateness. tion. Our approach is feasible with less re- Reinforcement learning (Wu and Hu, 2018; sources than would be necessary for deep or Chen and Bansal, 2018) or encoder-decoder machine learning perspectives. architectures (Cheng and Lapata, 2016; Na- llapati et al., 2016) are common, and research Central to our approach is the fundamen- into new combinations grows constantly. As tal integration of a specific type of langua- opposed to DICES, these approaches still ge model known as Positional Language Mo- need huge amounts of training data, which del (PLM), which has proven to be useful is not always available. and cost-effective in different areas of NLP, Whereas a large part of existing research such as information retrieval (Boudin, Nie, relies on occurrence frequency, a few studies and Dawes, 2010) or language generation (Vi- have focused on including discourse and se- cente, Barros, and Lloret, 2018). Taking into mantics in their approaches. (Liu, Titov, and account the positional information of relevant Lapata, 2019) proposed a structured atten- elements within the document, we outline the tion architecture to induce trees while, in the significance of understanding and processing case of (Liu and Chen, 2019) and (Hirao et the text not as a simple set of words but al., 2013), they relied in Rhetorical Structu- as a succession of messages, whose full mea- ral Theory (Mann and Thompson, 1987). The ning must be accessed beyond the sentence difference with our approach is that these sys- level, at the discourse dimension. We shape tems include an expensive linguistic compo- our methodology to incorporate semantic in- nent that requires dependency parsing or rhe- formation gathered from the document, ins- torical analysis to obtain the relation between tead of merely using words. Thereby, rather the units of the document. DICES repre- than simply relate to word counts, we aim at sents the semantics and structure of the com- overcoming the limitations of the bag of words ponents from a statistical perspective which approaches by considering the original texts imply shallow features and resources. as structurally meaningful sources of infor- mation in a discourse-informed process, sho- wing that such strategy have a positive im- 3 DICES Approach pact on both the selection of content and the In this Section we first explain the statistical consequent generation of quality summaries. foundations of DICES to later describe how In short, our contributions are: 1) we de- a middle representation of the document is fine a discourse-informed statistical model built upon the PLMs, serving as basis for the which incorporates semantics from the ori- method to obtain the required summary. ginal text, revealing an improvement of the The fundamental assumption here is that resulting summary; 2) we implement an un- the better the understanding of the origi- supervised, lightweight and adaptable frame- nal text, the more informative the summary work, simple yet effective and 3) we conduct becomes. And only considering the text as a series of experiments over standard bench- a structured discourse, whose semantic ele- marks and, with no heavy load of compu- ments coherently relates to each other, can tational resources, empirically verify the ef- that understanding be leveraged. 76 Relevant content selection through positional language models: an exploratory analysis

3.1 Positional Language Models selected to obtain more effective results and, The basic idea behind this model is that for at a deeper and more comprehensive level, we every position i within a document D, it is could also consider more abstract forms, as possible to calculate a score for each element identifiers for synonyms or deeper semantic w that belongs to the document’s vocabu- or syntactic constructs. lary. This value displays the relevance of w In our current configuration, the vocabu- in a precise position, based on the element’s lary is composed of the synsets corresponding distance to other occurrences of the same ele- to nouns, verbs and adjectives, together with ment throughout the document. The closer the named entities (NE) that appear along the elements appear to the position being the text. This decision aligns with our se- evaluated, the higher the score obtained. This mantic goal, that in this case was to capture behavior allows the model to express the sig- the meanings, and the semantic information nificance of the elements considering the who- they convey. Freeling (Padr´oand Stanilovsky, le text as their context, rather than being li- 2012) is an open-source tool which provides mited to the scope of a single sentence. In this several layers of linguistic analysis. We use it manner, one PLM is computed for each and also to obtain synsets from WordNet (Kilga- every position of the document, which can be rriff and Fellbaum, 2000). formulated as follows: 3.4 Seed Creation P|D| c(w, j) × f(i, j) P (w | i) = j=1 (1) A seed is then created, which must contain P P|D| 0 w0∈V j=1 c(w , j) × f(i, j) elements that allow us to dislodge the irre- levant parts of the discourse. The process of where c(w,j) indicates the presence of w in creation can begin with a sentence or with a position j ; |D| refers to the length of the do- set of words, that need to be analyzed with cument D; V, the vocabulary and f(i,j) is the the same tools as the source text. A second propagation function rating the distance bet- vocabulary is then built from those. ween i and j. Let V denote the source vocabulary, with 3.2 PLM for summarisation elements {w1,..,w|V|}, and Vs the vocabulary Having explained the model foundations, and extracted from the seed, a filter vector F is thus the PLM module, a more detailed proce- generated with as many positions as elements dure needs to be designed to adapt the model V has. If the element wi from V belongs to to the task of summarisation. This procedure Vs, then F[i] = 1 ; F[i] = 0 otherwise. Now comprises three stages. First, we need the de- it is possible to obtain a Score Vector (SC ) finition of the vocabulary as a parameter with values for every position j : for the PLM module. From this stage, we ob- X SC[j] = P (w | j) × F (2) tain a representation of the text that involves w∈V both the vocabulary and the positions of its elements. Second, we create a seed, i.e., a The general vocabulary has been reduced set of words that can be relevant for the text to relevant vocabulary and, thanks to the and whose constitution depends on the cor- PLMs, for each position of the document we pus of origin itself. Finally, the processing of dispose of a reference value. SC would beco- the PLM against the seed allows us to esta- me a detector of important areas, with ma- blish scores for the text elements, which will ximum values when the accumulation of re- be transformed into a ranking of sentences levant elements given by PLMs is higher. from which the highest scored ones could be selected to produce the final summary up to 3.5 Ranking and Selection a specific length. From the SC, we are able to obtain the posi- 3.3 The Vocabulary Definition tions of interest within the document. From those, we retrieve as candidate sentences the First, it is necessary to define which type of sentences where these positions belong to, elements will constitute the vocabulary for and subsequently obtain a value Sscore for the PLM. A straightforward approach would each of those sentences: be to select the words as they appear in the text. However, this could lead to unnecessary X Sscore = SC[i] (3) repetitions. Alternatively, lemmas could be i∈S 77 Marta Vicente, Elena Lloret with S being the sentence to be scored, and and made the resulting dataset available3. i indicating the positions within the docu- One particularity of the corpus is that the ment for that sentence. Since it is possible documents are presented in an anonimized that the position belongs to a very short sen- mode (we call it M for mentions), so that the tence, and considering that the value in each entities appearing in the text are substituted position comes from its context, it would be by an identifier or mention. Then, along with pertinent to include the neighboring senten- the text, a list of the correspondent entities is ces. In this way, we introduce a parameter in provided. We processed the corpus to obtain the processing that allows us to select if we a non-anonimized version (we call it E for want to recover only the sentence that inclu- entities), with the entities in their place. In des the position, or also its neighbors. this manner, our evaluation is conducted on Typically, a parameter needs to be set in both versions, M and E. order to determine the length of the sum- Although the version processed by the mary required. We sequentially select the hig- authors contains more than 280K documents, hest scoring sentences, until the established we selected a portion of the test documents length, to gather the best set of sentences for to evaluate our system, since the strength the resultant summary. of DICES does not rely on the amount of examples. Originally, the test set is compri- 4 Datasets, Experiments and sed of 10,397 Daily Mail and 1,093 CNN Evaluation documents. We took the 1,093 documents from CNN and randomly selected the sa- In order to evaluate DICES, several experi- me amount from Daily Mail, thus creating ments were conducted over different corpus. a smaller, but balanced dataset of 2,186 do- Although the process is similar for all them, cuments. For this collection, we estimate an some aspects had to be adapted. Next, the average of 17 sentences per document labeled datasets are introduced together with some with 1 (following (Cheng and Lapata, 2016)’s details on the implementation. We finish with version), and 4 highlights per document, also some notes on the evaluation metrics. on average. 4.1 Datasets Description 4.2 Implementation details The datasets selected to evaluate DICES we- As introduced in Section 3, three stages were re chosen for their renown, to enable a quality required to create the final summary: the vo- comparison with previous systems. In this cabulary definition, the seed creation and se- section, we describe these datasets. lection of sentences after ranking them. The DUC 2002 Some of the most popu- constitution of the seed is specific for each lar datasets used to address summarisation corpus. DUC2002 provides one headline per tasks come from the Document Understan- article. In this case, the headline, the first ding Conferences2. (DUC). Among others, sentence or the combination of both would DUC2002 includes a task aimed at SDS, who- work as seed for the summaries. Regarding se goal was to built summaries from one En- CNNDM, headlines are not provided but en- glish news article of 100 words at most. tities or mentions are. Therefore, the seed could consist of the first sentence of the docu- CNN/DailyMail We selected a second ment, the entities/mentions provided or their corpus, the CNN/DailyMail (CNNDM) (Her- combination. For both corpora, we experi- mann et al., 2015), which is more recent than mentally determined that the best strategy DUC2002 and is widely used to evaluate DL was to select as seed the combined option, approaches from both extractive and abstrac- for which results are reported. tive perspectives. Apart from the abstractive gold standard in the form of highlights, some 4.3 Evaluation authors have created purely extractive mo- We adopt ROUGE (Lin, 2004) for perfor- del summaries. Particularly, the authors in ming the evaluation, a recall-oriented measu- (Cheng and Lapata, 2016) tagged with a la- re which has become one of the most common bel 1 the sentences from a document which metrics in summarisation. Summaries were should appear in a gold standard summary, evaluated taking into account ground truth

2duc.nist.gov 3github.com/cheng6076/NeuralSum 78 Relevant content selection through positional language models: an exploratory analysis models and their relation with the system CNNDM % R1 R2 RL summaries in terms of n-gram overlap. M R 83.18 74.54 81.03 Regarding the gold-standard summaries, F 72.00 63.96 70.01 E R 80.72 71.01 78.27 DUC2002 provided up to 4 model summaries F 71.17 61.93 68.86 for each single document and CNNDM do- cuments are paired with a set of abstractive Tabla 1: DICES evaluation against the pu- highlights, that serve as reference summary. re extractive gold summaries from CNNDM— Besides, we created a pure extractive gold anonimized (M) and non-anonimized (E) summary for CNNDM, taking into account sentences labeled as 1, provided by (Cheng outcomes obtain a much higher value than and Lapata, 2016). The gold summary was those achieved when an abstract summary is exclusively used to examine DICES capabli- used as reference. Therefore, in this case, we lity of efficiently retrieving important infor- do not compare them with the other systems. mation from a document. We discuss this is- sue in the next section. 5.2 System comparison In this section we compare our experimental 5 Results and Discussion results with state-of-the-art systems4. Addi- To explore and evaluate the effectiveness of tionally, some baselines are included to pro- our approach, the following analyses were vide evidence of our achievements. conducted: 1) we used the labeled CNNDM 5.2.1 DUC2002 Evaluation corpus to establish the system’s ability to re- We evaluated DICES against several summa- trieve relevant sentences and, 2) we applied risation approaches and report the results in our system to the SDS task. Table 2. The top part of the table exhibits We next present our results and compa- systems that reported recall. The best perfor- re them with several state-of-the-art models. ming system for the competition BestDuc02 We found out that ROUGE unigram (R1) and the Lead baseline the organisers provi- and bigram (R2) overlapping were usually ded are included. The Lead baseline, taking reported by other systems, but occasionally as summary the first 100 words, relies on the longest subsequence overlap (RL) was also assumption that in news genre, the relevant included. Furthermore, some works used F- information is located firstly. Although this score and some employed recall. Taking into baseline was only surpassed by 3 systems, it account this diversity, and in order to get the is highly genre-dependent and does not consi- clearest idea of our system’s performance, we der semantic knowledge, contrary to our ap- have included in the comparison all the signi- proach, which is easily adaptable to other ficant approaches, reporting on the measure genres and domains. required in each comparison. Additionally, taking into account that graph-based and statistical methods repre- 5.1 Relevant Sentence Retrieval sent a common ground within extractive The average number of sentences with label tasks, we included results from LexRank (Er- 1 on the selected subset of the CNNDM cor- kan and Radev, 2004), a popular graph-based pus was 17. In compliance with this restric- technique that uses the PageRank algorithm, tion, summaries limited to that length were and implemented two baselines based on fre- obtained using DICES. These summaries we- quency counts that constitute popular refe- re evaluated against the pure extractive gold rences among statistical approaches, perfor- summary to prove that our approach success- ming a bag of words strategy: Tfidf (term fully detected the sentences where the rele- and inverse document frequency involved), vant information in the article resided. R1, and and Tfisf, which is a variation of the R2 and RL are presented in Table 1, both for former, considering the inverse sentence fre- the anonimized (M) and non-anonimized (E) quency, instead idf. versions of the corpus. Systems reporting F-scores are placed at The results show our model’s success in the bottom of Table 2. To the best of our retrieving relevant information. The balance 4Results taken from the respective literature. Only between recall and F-score also indicates that for Pointer-Gen in CNNDM task, the code availa- the recovered elements are significant, in the ble in github.com/abisee/pointer-generator# different n-grams modalities. However, the #looking-for-pretrained-model was run. 79 Marta Vicente, Elena Lloret

Duc2002 R ( %) CNNDM - E Non-anonimized System R1 R2 RL System R1 R2 RL Tfisf 37.03 13.39 30.11 Tf-Isf 27.97 8.37 22.84 Tfidf 38.43 14.39 31.40 Tf-Idf 30.12 9.68 24.64 Lead 41.13 21.07 37.53 BL-4sent 35.56 13.89 31.53 BestDuc02 42.77 21.76 38.64 Pointer-Gen 35.64 15.08 32.52 LexRank 43.20 17.94 38.91 LiuLapata 19 43.85 20.34 39.90 DICES02 44.72 20.02 37.22 DICES-E 34.46 13.09 28.19 F-score ( %) CNNDM - M Anonimized System R1 R2 RL System R1 R2 RL Pointer-Gen 37.22 15.78 33.90 Tf-Isf 31.00 10.05 25.59 ChenBansal 39.46 17.34 36.72 Tf-Idf 32.77 11.25 27.09 DICES02 45.97 20.56 38.25 BL-4sent 38.02 15.71 33.67 WuHu 18 41.22 18.87 37.75 Tabla 2: Recall and F-score results on the single- DICES-M 36.78 14.62 30.27 document task of DUC2002 Tabla 3: F-score ( %) results for CNNDM-E and knowledge there were no neural approaches CNNDM-M, against highlights reporting recall measure. We only found ChenBansal (Chen and Bansal, 2018) sys- tem that presents the F-score for results. We have also included the scores from the They propose a reinforcement learning ap- state-of-the-art systems: LiuLapata (Liu and proach for abstractive summarisation and Lapata, 2019) for E and WuHu (Wu and Hu, test it on the DUC2002 task, and also present 2018) for M. Finally, we show the score for the results for the pointer-generator system the provided model of Pointer-Gen on our Pointer-Gen (See, Liu, and Manning, 2017). corpora. Although we do not beat its perfor- Although their system is abstractive and ours mance, our score is considerably close. is not, the model summaries against which all three systems are compared are the same. Compared to state-of-the-art systems, the significant differences between the results 5.2.2 CNN/DailyMail Evaluation may be caused 1) by the variation in the As previously stated, our main objective amount of data being processed in each case, when working with CNNDM was to evaluate 2) by the extractive condition of our approach DICES’ ability to retrieve salient information against the other abstractive systems. We al- (Section 5.1), given that the objective is emi- so conducted the test with the Tf-Idf and nently abstractive summarization. Neverthe- Tf-Isf set ups, also performing with them less, DUC2002 results showed DICES remar- extractive summarization. As in the other kable performance regarding traditional yet tasks, DICES performs better than those ap- competitive models, and moreover, its com- proaches consistently. parison against neural approaches revealed promising outcomes. We therefore decided to We conducted one last experiment in or- also conduct our summarisation experiments der to better understand the performance on this corpus, a task that has been usually and possibilities of DICES. As mentioned approached from neural frameworks. Specifi- above, our subset of 2,186 documents was cally, we carried out the experiments not over not strictly comparable with the state of the the whole dataset but over the subset of 2,186 art. Nevertheless, we found an experiment documents that we had previously studied. in (Cheng and Lapata, 2016) which evalua- These results may not be strictly comparable tes their extractive approach on 500 sam- due to this size factor, but they certainly give ples from CNNDM, with the highlights pai- us an idea of the potential of our approach. red to the documents as gold standard. We Table 3 summarises the results. randomly extracted the same number of ar- It appears in the Table a baseline that was ticles from our data and performed a similar created taking the first four sentences of each evaluation. The results (F-score) are repor- document, as this was the average length of ted in Table 4, and indicate a substantial im- the highlights for that subset. We built a ba- provement as least of 54 %. Nevertheless, we seline for each version of the subcorpus: the think it would be interesting to evaluate DI- anonimized (M) and non-anonimized (E) one. CES exactly in the same set of documents. 80 Relevant content selection through positional language models: an exploratory analysis

CNNDM F-score ( %) both the size of that vocabulary and its se- 500 docs R1 R2 RL mantic composition, thus having a negative ChengLapata 21.20 8.30 12.00 effect on the summaries generated. DICES-E 34.14 12.83 28.05 Finally, it is worth mentioning recent work (+61 %) (+54 %) (+133 %) on summarisation which outlines the benefits Tabla 4: F-score results computed on a random of individually dealing with content selection CNNDM subsample (improvement in brackets) and realisation (Gehrmann, Deng, and Rush, 2018; Cho et al., 2019). DICES is able to per- form these tasks separately due to its mo- 6 Discussion dular architecture. The PLM stage presents The results obtained both with DUC2002 a basic mechanism to detect salient content dataset, and in the evaluation of the sys- within a document by means of a conden- tem’s capacity to recover relevant senten- sed meaning representation. Although in this ces, demonstrate the effectiveness of DICES work we have exclusively tested its perfor- in achieving the objectives established, and mance for extractive summarisation, DICES reinforces our effort on enhancing the seman- modules could also be part, for example, of tic structure of the discourse as catalyst for an abstractive pipeline by adding a different progress in summarisation. realisation module. Additionally, DICES is However, the results DICES obtained in able to work at multiple granularity levels by some of the evaluation settings were lower focusing on the sentences as a whole, speci- than expected, for example, when compared fically on their semantic constituents or even to DL approaches. In this case, the different down to the token level. And this represents a sizes of data evaluated could well explain the crucial difference regarding common extrac- variation in the results or the disparity could tive approaches that usually relies in the sen- be attributed to the fact that some of tho- tence as their basic unit. se systems are trained on different huge da- tasets and just tested in smaller datasets— 7 Conclusion and Future work the ones we use to evaluate our approach, as This paper explores a methodology for sin- DUCs. In any case, to better understand the gle document extractive summarisation that lower performance of DICES in the summari- exploits positional and semantic information sation of CNNDM we plan to deeper analyse to improve the generation of summaries. A the impact of some factors. An aspect to be novel model based on statistical grounds is considered would be the system’s evaluation proposed. One of the motivations that led over the whole CNNDM dataset, to check if us to devise and test an approach like DI- in this case size is relevant. It is also worth CES was to provide a competitive alterna- noting that the corpus highlights are origi- tive against the general trend that exploits nally abstract summaries. This could imply neural networks, in contexts where, for one a disadvantage with respect to our approach reason or another, computational and tem- whose results may be better contextualised poral resources or data are less accessible. In performing a manual evaluation of the resul- general, DICES achieves satisfactory results tant summaries. A thorough study on how without the need for a large amount of data, the distinct constitution of the seed affects training or computational load, in contrast to the outcomes could also give us more insight more sophisticated DL approaches. on what is causing the discrepancies and a The experiments show the capability of wider scenario to enrich the research. the framework both in detecting relevant Moreover, we carried out an analysis on areas of the document and in retrieving the the resulting summaries that allowed us to appropriate sentences to construct relevant identify some errors originated in the prepro- summaries. Its performance was successfully cessing stage. We detected, for example, how evaluated in creating single document sum- punctuation marks, mainly quotes, harmed maries in the news domain for English. the language analysers. Besides, the inade- The DICES methodology could easily be quate disambiguation of terms from which adapted to other languages, whenever a lin- the synsets proceed also had an impact on guistic analyser is available. Moreover, due to the creation of the vocabulary, either coming its unsupervised nature and the flexibility DI- from the body or from the seed, and affecting CES exhibits, it can readily be applied also to 81 Marta Vicente, Elena Lloret different domains and summarisation modali- Lin, C.-Y. 2004. ROUGE: A package for au- ties as multi-document summarisation, query tomatic evaluation of summaries. In Text and user focused summarisation or headline Summarization Branches Out, pages 74– generation. 81. Acknowledgments Liu, Y. and M. Lapata. 2019. Text summari- zation with pretrained encoders. In Proc. This research results from work partially of EMNLP-IJCNLP, pages 3721–3731. funded by Generalitat Valenciana (SIIA PROMETEU/2018/089) and the Spanish Liu, Y., I. Titov, and M. Lapata. 2019. Sin- Government—ModeLang (RTI2018-094653- gle document summarization as tree in- B-C22) and INTEGER (RTI2018-094649-B- duction. In Proc. of the NAACL, Vol. 1, I00). It is also based upon work from COST pages 1745–1755. Action Multi3Generation (CA18231). Liu, Z. and N. Chen. 2019. Exploiting References Discourse-Level Segmentation for Extrac- tive Summarization. In Proc. of the 2nd Boudin, F., J. Y. Nie, and M. Dawes. 2010. Workshop on New Frontiers in Summari- Positional language models for clinical in- zation, pages 116–121. formation retrieval. In Proc. of EMNLP, pages 108–115. Mann, W. C. and S. A. Thompson. 1987. Rhetorical Structure Theory: Description Chen, Y.-C. and M. Bansal. 2018. Fast and Construction of Text Structures. In abstractive summarization with reinforce- Natural Language Generation. Springer selected sentence rewriting. In Proc. of the Netherlands, pages 85–95. ACL, Vol. 1, pages 675–686. Nallapati, R., B. Zhou, C. dos Santos, Cheng, J. and M. Lapata. 2016. Neural sum- C¸. Gul¸cehre, and B. Xiang. 2016. marization by extracting sentences and Abstractive text summarization using words. In Proc. of ACL, pages 484–494. sequence-to-sequence RNNs and beyond. Cho, S., L. Lebanoff, H. Foroosh, and F. Liu. In Proc. of SIGNLL, pages 280–290. 2019. Improving the similarity measure of Nenkova, A. and K. McKeown. 2011. determinantal point processes for extrac- Automatic Summarization. Foundations tive multi-document summarization. In and Trends R in Information Retrieval, Proc. of the ACL, Vol. 1, pages 1027–1038. 5(2):103–233. Erkan, G. and D. R. Radev. 2004. LexRank: Padr´o,L. and E. Stanilovsky. 2012. FreeLing Graph-based lexical centrality as salience 3.0: Towards Wider Multilinguality Free- in text summarization. Journal of Artifi- Ling project developer. Proc. of LREC. cial Intelligence Research, 22:457–479. Gehrmann, S., Y. Deng, and A. Rush. 2018. See, A., P. J. Liu, and C. D. Manning. Bottom-up abstractive summarization. In 2017. Get to the point: Summarization Proc. of EMNLP, pages 4098–4109. with pointer-generator networks. Proc. of ACL, 1:1073–1083. Hermann, K. M., T. Kocisky, E. Grefenstet- te, L. Espeholt, W. Kay, M. Suleyman, Strubell, E., A. Ganesh, and A. McCallum. and P. Blunsom. 2015. Teaching machi- 2019. Energy and policy considerations nes to read and comprehend. In Advances for deep learning in NLP. In Proc. of the in neural information processing systems, ACL, pages 3645–3650. pages 1693–1701. Vicente, M., C. Barros, and E. Lloret. 2018. Hirao, T., Y. Yoshida, M. Nishino, N. Ya- Statistical language modelling for automa- suda, and M. Nagata. 2013. Single- tic story generation. Journal of Intelligent document summarization as a tree knap- & Fuzzy Systems, 34(5):3069–3079. sack problem. In Proc. of EMNLP, pages Wu, Y. and B. Hu. 2018. Learning to extract 1515–1520. coherent summary via deep reinforcement Kilgarriff, A. and C. Fellbaum. 2000. Word- learning. In AAAI Conf. on Artificial In- Net: An Electronic Lexical Database. telligence, pages 5602–5609. Language, 76(3):706. 82 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 83-90 recibido 02-04-2020 revisado 27-05-2020 aceptado 27-05-2020

Rantanplan, Fast and Accurate Syllabification and Scansion of Spanish Poetry

Rantanplan, silabaci´ony escansi´onr´apidasde poes´ıaespa˜nola

Javier de la Rosa1, Alvaro´ P´erez1, Laura Hern´andez1, Salvador Ros1, Elena Gonz´alez-Blanco2 1Digital Humanities Innovation Lab, UNED, Madrid, Spain 2School of Human Sciences and Technology, IE University, Madrid, Spain {versae, alvaro.perez, laura.hernandez, sros}@scc.uned.es, [email protected]

Abstract: Automated analysis of Spanish poetry corpora lacks the richness of tools available for English. The existing options suffer from a number of issues: are limited to fixed-metre hendecasyllabic verses, are not publicly available, the syllabification procedure underneath is not thoroughly tested, and their speed is questionable. This paper introduces new methods to alleviate these concerns. For syllabification, we contribute with our own method and manually crafted corpus. For scansion, our approach is based on a heuristic for the application of rhetorical figures that alter metrical length. Experimental evaluation shows that both fixed-metre and mixed-metre poetry can be successfully analyzed, producing metrical patterns more accurately (increasing accuracy by 2% and 15%, respectively), and at a fraction of the time other methods need (running at least 100 times faster). Keywords: stress, metrical patterns, scansion Resumen: El an´alisisautomatizado de corpus de poes´ıa espa˜nolacarece de la riqueza de las herramientas disponibles para el ingl´es.Las opciones existentes adole- cen de una serie de problemas: se limitan a versos endecas´ılabos de m´etricafija, no est´andisponibles p´ublicamente, el procedimiento de silabaci´onno est´aprobado a fondo, y su velocidad es mejorable. Este art´ıculopresenta nuevos m´etodos para contrarrestar estos problemas. Para la silabaci´on,contribuimos con nuestro propio m´etodo, as´ıcomo un corpus elaborado manualmente. Para la escansi´on,nuestro enfoque se basa en una heur´ısticapara la aplicaci´onde figuras ret´oricasque alteran la longitud m´etrica. La evaluaci´onexperimental demuestra que tanto la poes´ıade m´etricafija como la de m´etricamixta se analizan con ´exito,obteni´endosepatrones m´etricoscon mayor precisi´on(mejoras de un 2% y un 15%, respectivamente), y en una fracci´ondel tiempo que otros m´etodos necesitan (ejecut´andoseal menos 100 veces m´asr´apido). Palabras clave: acentuaci´on,patrones m´etricos,escansi´on

1 Introduction etic metre of its poetry. Once a word is Although different in nature, syllabification split into syllables, Spanish orthography es- and scansion are loosely coupled by the un- tablishes somewhat rigid rules to assign stress derlying functioning of the prosody of a lan- and classifies the words according to the posi- guage. Syllabification is the splitting of tion of the last stressed syllable. There is gen- 1 words into their constituent units, syllables. erally only one stressed syllable per word , Unlike English, where there is a weak corre- with few exceptions (RAE, 2010). Depend- spondence between sounds and letters, spo- ing on the position of the stressed syllable, ken syllables in Spanish are the basis of there are three categories of words: the orthographic units of its words. These 1In this work we use hyphens as the syllabic sep- building blocks shape the stress patterns and arator for representation purposes, marking in bold rhythm of a language, as well as the po- the stressed syllable (e.g., ‘a-mo-ro-so’).

ISSN 1135-5948. DOI 10.26342/2020-65-10 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Javier de la Rosa, Álvaro Pérez, Laura Hernández, Salvador Ros, Elena González-Blanco

• oxytone words, when the stressed sylla- with the plus symbol ‘+’ and unstressed ones ble is the last syllable of the word: ‘tam- use the minus ‘−’. An extra unstressed sym- bor’. bol is added to the metrical representation of • paroxytone words, when the stressed syl- a verse when its last word is an oxytone, re- lable is the one before the last syllable of moved if a proparoxytone, or left unchanged the word: ‘plan-ta’. if a paroxytone. Example 1 shows a verse of 8 syllables and the resulting metrical pattern • proparoxytone words, when the stressed after applying the synalepha (denoted by ‘<’) syllable lies two syllables from the end of and considering the stress of the last word. the word: ‘pl´a-ta-no’. (1) Cuando el alba me despierta Some word functions, such as prepositions, Cuan-doe

85 Javier de la Rosa, Álvaro Pérez, Laura Hernández, Salvador Ros, Elena González-Blanco synereses and keeping the information about 3.1 PoS tagger the stress positions (line 6). This sequence We built Rantanplan on top of the industrial- of phonological groups is translated directly strength NLP framework spaCy for speed into a metrical pattern (line 7), since each (Honnibal and Montani, 2017). As men- phonological group represents a prosodic unit tioned previously, in Spanish some words are of pronunciation. The only consideration to stressed depending on their function in the factor in is the stress of the ending word, so sentence, hence the need for a proper part of an extra symbol could be added or subtracted speech tagger. AnCora (Taul´e,Mart´ı, and accordingly when necessary. From here, two Recasens, 2008), the gold standard corpus situations can occur: many modern statistical language models are 1. The expected metrical length is not trained on for PoS tagging of Spanish texts, known, in which case the calculated pat- splits most affixes thus causing some failures tern is returned (line 14). in the tags it assigns on prediction. To cir- cumvent this limitation and to ensure clitics3 2. The expected metrical length is known were handled properly, we integrated Freel- and its value greater than the length of ing’s affixes rules via a custom built pipeline the calculated pattern (lines 8-13). This for spaCy. The resulting package, - means some of the applied synalephas affixes4, splits words with affixes before as- and synereses must be undone until both signing PoS, and can be plugged in to a reg- lengths match. The metrical adjustment ular spaCy pipeline loading one of the sta- module will try every option iteratively tistical models for Spanish. In our approach, giving priority based on a heuristic. For only suffixes on verbs are enabled in spacy- each attempt, a new metrical pattern affixes to guarantee clitics are handled ade- and its corresponding length is calcu- quately by spaCy and PoS tags are assigned lated and checked against the expected correctly. metrical length. If no match is found, the last pattern calculated is returned. 3.2 Syllabification Our method then follows a rule-based algo- rithm inspired by R´ıos Mestre (1998), Ca- Algorithm 1: Scansion procedure parr´os(1993) and Navarro Tom´as(1991) to Input: A sequence W of words split words into syllables. The procedure re- hw1, w2, . . . , wni lies heavily on regular expressions to extract Input: A value length for the the letter groups that form the syllables. It metrical length expected is comprised of three steps. (optional) Output: A sequence hs1, s2, . . . , sLi 1. Pre-syllabification rules are applied, of booleans expressing the which include the detection of consonant metrical pattern groups other than double ‘l’, as in ‘ais- 1 for wi ∈ W do lar’, and the handling of the prefixes ‘sin- 2 tagi ← pos(wi) ’ and ‘des-’ when followed by consonants, 3 syllablesi ← syllabify(wi) as in ‘deshielo’. 4 stressesi ← stress(syllablesi, tagi) 2. Regular letter clusters are identified and 5 end separated from the rest. 6 groups ← phonological(syllables, stresses) 3. Post-syllabification exceptions for con- 7 pattern ← transform(groups) sonant clusters and diphthongs are ap- 8 if length then plied. 9 while |pattern| < length do Apart from the official rules for syllabifica- 10 g ← generate phonological(W) tion (RAE, 2010), there are cases with more 11 pattern ← transform(g) 3 12 end Syntactically independent but phonologically de- pendent that appear together in a word, 13 end e.g., in ‘c´ogemelo’,both ‘me’ and ‘lo’ are pronouns 14 return pattern written together with the verb ‘coge’ 4See https://github.com/linhd-postdata/spac y-affixes/

86 Rantanplan, fast and accurate syllabification and scansion of Spanish poetry than one correct way to proceed. The first of tion when needed. For example, the words these cases was the ‘tl’ group. Let’s take the ‘me ama’ are split into the syllables ‘me-a- word ‘atl´antico’ for example, its syllabifica- ma’, and after applying synalepha the result- 5 tion changes according to the territory . We ing phonological groups, ‘mea<-ma’, keep the decided not to split the group ‘tl’ since most stress in place. Word ends are also marked of the Spanish speakers around the world do since they are needed to adjust the length of not separate it. In the case of words of Nahu- the metrical pattern according to the posi- atl origin this separation should not be made tion of the stress of the last word. Phonolog- either. Compounds words and words with an ical groups are then transformed into a met- ‘h’ in between were also challenging. As an rical pattern representation and returned if example of the former let’s take the word ‘re- the expected metrical length of the verse is utilizar’. Although intuitively it may seem not known beforehand. that the prefix ‘re-’ should be separated from the rest of the word, the Fund´eu6 recom- 3.4 Metrical adjustment mends not to do it this way, splitting instead However, there are situations where the ex- as ‘reu-ti-li-zar’. Similarly, the ‘h’ in a mid- pected metrical length is known, such as dle position does not split diphthongs, so ‘de- processing a corpus of sonnets which tend sahijar’ would be syllabified as ‘de-sahi-jar’, to be hendecasyllables. In cases like this, which might feel odd at a first pass but it verses with applied synalephas or synereses actually agrees with the pronunciation of the but a metrical length lower than the expected word. Moreover, we also included possible would trigger the adjustment module. In ex- diereses as part of our alternative syllabifi- ample 2, the expected metrical length is 11 cation exceptions. One such word is ‘hiato’7 but our system returns 9, thus triggering the which can be split either as ‘hia-to’ or ‘hi-a- metrical adjustment module. to’. As noted by Navarro-Colorado (2017), another common case is the word ‘suave’, (2) loor a mi autor, y al que leyere loor-a-miau-tor-yal-que-le-ye-re which poets tend to apply dieresis to thus re- < < < sulting in ’sua-ve’ instead of the default split + − − + − − − + − 9 < 11 as ‘su-a-ve’. Therefore, our method relies (Juan de Timoneda) on a list of words with alternative syllabifi- cations compiled from R´ıos Mestre (1998). This means that 11 − 9 = 2 of the ap- These alternatives are only taken into ac- plied synalephas and synereses must be un- count by the metrical adjustment module. done until both lengths match. The metri- cal adjustment module tries every possible 3.3 Stress assignment and metrical pattern combining synereses, synale- phonological groups phas, and alternative syllabifications. Prior- Once syllables and part of speech of a word ity is given to keep the synalephas since they are extracted, stress can be assigned. The are rarely broken, and syneresis are usually assignment of stress follows very closely the undone. The same happens for the alterna- rules defined in RAE (2010), adding excep- tive syllabifications, which deals with dieresis tions for certain parts of speech, consonant and adds more combinations to check for. A groups, and words that are usually stressed special case adding possibilities to the search but are not for metrical reasons. The se- space is the handling of synalephas between quence of phonological groups that will be words with an initial ‘h’ and vowel ending used to extract the metrical pattern is calcu- words. Up to the 16th century, the initial ‘h’ lated by applying all possible synereses and in words was aspired instead of silent. This synalephas to the list of syllables of words depends on the etymology of some words. For per line, and propagating the stress informa- example, in the verse ‘cubra de nieve la her- mosa cumbre’ (see example 3) there should 5 See https://www.fundeu.es/consulta/at-lan- not be synalepha between ‘la’ and ‘hermosa‘ ti-co-o-a-tlan-ti-co-12213/ 6The Fund´eu is a foundation created from the De- since ‘hermosa’ evolved from the Latin ‘fer- partment of Urgent Spanish of the EFE Agency. See mosa’ and as such a synalepha was not possi- https://twitter.com/Fundeu/status/1182226555 ble at all. To this day, this remains an option 457724416 to the author, who can decide whether or not 7 Several examples can be found at http://elie to apply a synalepha in such cases. s.rediris.es/elies4/Fon8.htm 87 Javier de la Rosa, Álvaro Pérez, Laura Hernández, Salvador Ros, Elena González-Blanco

(3) cubra de nieve la hermosa cumbre the syllabification algorithm. We collected cu-bra-de-nie-ve-la-her-mo-sa-cum- more than 100k words using a combination bre of online resources9 into a corpus we named + − − + − − − + − + − 11 EDFU, and are releasing it under a Creative (Garcilaso de la Vega) Commons license10. All entries are manually reviewed for correction and compliance with For each attempt, a new metrical pattern R´ıos Mestre and Fund´eurecommendations. and length is calculated and checked against Table 1 shows the accuracy of the meth- the expected metrical length. If no match ods by Agirrezabal (2014), Navarro-Colorado is found, the last pattern calculated is re- (2016), and ours when run against EDFU. turned. For the verse in example 2, the gen- Our method performs almost perfectly, more erated possible metrical patterns are shown than one percentual point of gain over the in example 4. Pattern 4 a, with no synere- others. No time comparison is made since all ses and one synalepha between ‘y’ and ‘al’ times are fairly similar. would be generated first and returned after- wards. Since the metrical pattern has the Method Accuracy correct length it is returned as such and the Navarro-Colorado 98.35 generation stops, saving the time it takes to Agirrezabal 98.06 generate any other possible pattern. This is Rantanplan (ours) 99.99 also a limitation of our approach since more than one correct metrical pattern can be gen- erated that matches the desired length. Table 1: Scores on EDFU syllabification cor- pus. Best score in bold (4) loor a mi autor, y al que leyere (a) lo-or-a-mi-au-tor-yal-que-le-ye-re < 4.2 Scansion − + − − − + − − − + − 11 In his original work describing his scansion (b) lo-or-a-mia

89 Javier de la Rosa, Álvaro Pérez, Laura Hernández, Salvador Ros, Elena González-Blanco metrical length. This decision paid off well Honnibal, M. and I. Montani. 2017. spacy in terms of accuracy since our method out- 2: Natural language understanding with performed the rest in both fixed-metre and bloom embeddings. Convolutional Neural mixed-metre poetry. Networks and Incremental Parsing, 7. We plan to continue improving Rantan- Moretti, F. 2013. Distant reading. Verso plan and explore alternatives, specially using Books. statistical language models to produce end- to-end metrical patterns further improving Navarro-Colorado, B. 2017. A metrical scan- speed. Moreover, the output produced by our sion system for fixed-metre spanish po- method will eventually be machine readable, etry. Digital Scholarship in the Human- interoperable, and ready to be ingested into a ities, 33(1):112–127. triple store compliant with the POSTDATA Navarro-Colorado, B., M. R. Lafoz, and Project network of ontologies. N. S´anchez. 2016. Metrical annotation of a large corpus of spanish sonnets: repre- Acknowledgements sentation, scansion and evaluation. In In- This research was supported by the project ternational Conference on Language Re- Poetry Standardization and Linked Open sources and Evaluation, pages 4360–4364. Data (POSTDATA) (ERC-2015-STG- Navarro Tom´as,T. 1991. M´etricaespa˜nola. 679528) obtained by Elena Gonz´alez-Blanco Rese˜nahist´orica y descriptiva, 50. and funded by an European Research Coun- cil (https://erc.europa.eu) Starting Padr´o,L. and E. Stanilovsky. 2012. Freeling Grant under the Horizon2020 Program of 3.0: Towards wider multilinguality. In In- the European Union. ternational Conference on Language Re- sources and Evaluation. References Quilis, A. 1969. M´etrica espa˜nola. Alcal´a Agirrezabal, M., I. Alegria, and M. Hulden. Madrid. 2017. A comparison of feature-based and RAE, R. A. E. 2010. Ortograf´ıade la lengua neural scansion of poetry. In Proceed- espa˜nola. Espasa. ings of the International Conference Re- cent Advances in Natural Language Pro- R´ıos Mestre, A. 1998. La transcripcion cessing, Ranlp 2017, pages 18–23. fonetica automatica del diccionario elec- tronico de formas simples flexivas del Agirrezabal, M., A. Astigarraga, B. Arrieta, espa˜nol: un estudio fonologico en el lex- and M. Hulden. 2016. Zeuscansion: a tool ico. Ph.D. thesis, Universitat Aut`onoma for scansion of english poetry. Journal of de Barcelona. Language Modelling, 4. Taul´e,M., M. A. Mart´ı, and M. Recasens. Agirrezabal, M., J. Heinz, M. Hulden, and 2008. Ancora: Multilevel annotated cor- B. Arrieta. 2014. Assigning stress to out- pora for catalan and spanish. In Interna- of-vocabulary words: three approaches. tional Conference on Language Resources In International Conference on Artificial and Evaluation. Intelligence, Las Vegas, NV, volume 27, pages 105–110. A Appendix: Availability Caparr´os, J. D. 1993. M´etrica espa˜nola. A demo of Rantanplan can be found online S´ıntesis Madrid. at http://postdata.uned.es/poetrylab/. All source code is available under an Apache Fern´andez-Carvajal, F. 2003. Antolog´ıade License 2.0 in a public code repository (http textos. s://github.com/linhd-postdata/rantan Gerv´as,P. 2000. A logic programming ap- plan/) and as a Python package in PyPI (ht plication for the analysis of spanish verse. tps://pypi.org/project/rantanplan/). In International Conference on Computa- B Appendix: Reproducibility tional Logic, pages 1330–1344. Springer. To reproduce the results in this paper, please, Hartman, C. O. 2005. The scandroid 1.1. refer to the next code repository: https: [Online; accessed 20-July-2020]. //github.com/linhd-postdata/rantanpl an-evaluation. 90

Proyectos

Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 93-96 recibido 30-04-2020 revisado 26-05-2020 aceptado 10-06-2020

COST Action “European network for Web-centred linguistic data science” (NexusLinguarum)

Acción COST “Red europea para la ciencia de datos lingüísticos centrada en la web” (NexusLinguarum)

Thierry Declerck1, Jorge Gracia2, John P. McCrae3 1DFKI GmbH, Multilinguality and Language Technology Lab 2Aragon Institute of Engineering Research, University of Zaragoza 3Insight SFI Research Centre for Data Analytics, National University of Ireland Galway [email protected], [email protected], [email protected]

Abstract: We present the current state of the large “European network for Web-centred linguistic data science”. In its first phase, the network has put in place several working groups to deal with specific topics. The network also already implemented a first round of Short Term Scientific Missions (STSM) Keywords: linguistic data science, multilingualism, linguistic linked data, language resources

Resumen: Presentamos el estado actual de la “Red Europea para la ciencia de datos lingüísticos centrada en la Web”. En su primera fase, el projecto ha establecido varios grupos de trabajo para tratar temas específicos. La red también implementó una primera ronda de Misiones Científicas de Corto Plazo (la sigla STSM en Ingles, para Short Term Scientifc Mission). Palabras clave: ciencia de datos lingüísticos, multilingüismo, datos lingüísticos anlazadoss, recursos lingüísticos

representation, integration, and exploitation of 1 Introduction language data (syntax, morphology, lexicon, etc.). NexusLinguarum brings thus the topic of We report on the current state of development linguistic data into this big data context. of the “European network for Web-centred In order to support the study of linguistic linguistic data science” (NexusLinguarum), data science, the network is aiming at the which is a recently started COST Action.1 construction at Web scale of a mature holistic The main aim of NexusLinguarum is to ecosystem of multilingual and semantically promote synergies between linguists, computer interoperable linguistic data. Such an ecosystem scientists, terminologists, and other is needed to foster the systematic cross-lingual stakeholders in industry and society, in order to discovery, exploration, exploitation, extension, investigate and extend the area of linguistic data curation, and quality control of linguistic data. science. The network understands linguistic For this, NexusLinguarum is investigating the data science as a subfield of the expanding combination of linked data (LD) technologies, “data science”, which focuses on the systematic natural language processing (NLP) techniques analysis and study of the structure and and multilingual language resources (LRs) properties of data at a large scale, along with (bilingual dictionaries, multilingual corpora, methods and techniques to extract new terminologies, etc.). This combination seems to knowledge and insights from it. Linguistic data offer the potential to enable such an ecosystem science is a specific case, which is concerned that will allow for transparent information flow with providing a formal basis to the analysis, across linguistic data sources in multiple languages, by addressing in a principled way 1See https://nexuslinguarum.eu/. The action the semantic interoperability problem. started in October 2019, for a duration of 4 years. “COST” stands for “European Cooperation in This combination builds on and further Science and Technology”. See https://www.cost.eu/. extends the recent development of the so-called

ISSN 1135-5948. DOI 10.26342/2020-65-11 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Thierry Declerck, Jorge Gracia, John P. McCrae

Linguistic Linked Open Data cloud (LLOD).2 2.1 Working Group 1 - Linked data- LLOD grounds on linked data to share and based language resources (leader: Milan interlink linguistically relevant data sources. Dojchinovski, Czech Technical Specifically, linked data refers to the University, Prague and Julia Bosque-Gil, recommended best practices for exposing, University of Zaragoza) sharing, and connecting structured data on the Web3 and builds on Semantic Web WG 1 is laying the foundations and is recommendations and World Wide Web developing best practices for the evolution, Consortium (W3C) standards such as Resource creation, improvement, diagnosis, repair, and Description Framework (RDF), RDF Schema enrichment of LLOD resources and value (RDFS), and Web Ontology Language (OWL). chains. In this, NexusLinguarum closely cooperates In the first phase of the action, this WG with on-going H2020 projects dealing with follows the task of identifying the current state LLOD topics, Elexis4 and Prêt-à-LLOD.5 of the LLOD cloud, with its problems and challenges, as well as to identify the state of the 2 Structure of the Network art and challenges in the current linguistic data models, especially with respect to under- The network counts on a very large number of resourced languages and domains that could participants, with institutions from 37 so-called and should be integrated in the LLOD. COST Member Countries, 3 institutions from One of the outcomes of WG1 will result in a Near Neighbour Countries (Belarus, Georgia, matrix of datasets and languages, which can be Kosovo), 2 International Partner Countries shared with related work done in the Elexis (United States, Singapore) and 1 Specific project, which is aiming at developing a Matrix Organisation (the Translation Centre for the Dictionary in the field of eLexicography. Bodies of the European Union). NexusLinguarum is co-ordinated by the 2.2 Working Group 2 - Linked data- University of Zaragoza, supported by the Polytechnic University of Madrid for aware NLP services (leader: Marieke administrative and financial matters. The chair van Erp, KNAW Humanities Cluster, of the Action is Jorge Gracia (University of Amsterdam) Zaragoza, Spain), the Vice Chair is John This WG focuses on the application of McCrae (National University of Ireland, linguistic data science methods including linked Galway) and the Scientific Communication data to enrich NLP tasks in order to take Manage is Thierry Declerck (German Research advantage of the growing amount of linguistic Center for Artificial Intelligence). The Grant (open) data available on the Web. Holder Scientific Representative for the A main task in the first year of the action network is Elena Montiel (Polytechnic consists in identifying a list of existing University of Madrid). The co-ordinator for the standards, datasets annotations, NLP services Short Term Scientific Missions is Penny and experts to connect to the COST action. Labropoulou (Athena Research Center) and Additional tasks to be supported by this WG Vojtech Svatek (University of Economics, are for example the detection and description of Prague) is the ITC conference manager. LLOD data sets that can support a series of At its kick-off meeting, the network NLP tasks, like Natural Language Generation structured itself in 5 working groups, which are or Machine Translation. briefly described in the following sections. 2.3 Working Group 3 - Support for linguistic data science (leader: Dagmar Gromann, Centre for Translation Studies, Vienna, and Amaryllis Mavragani, University of Stirling, UK) 2 http://linguistic-lod.org/llod-cloud This working group focusses in the first year on 3 fostering the study of linguistic data by https://www.w3.org/DesignIssues/LinkedData.html following data analytic techniques at a large 4 https://elex.is/ scale in combination with LLOD and linked 5 https://www.pret-a-llod.eu/

94 COST action “European network for web-centred linguistic data science” data-aware NLP techniques. Big data and 3 Possible Impacts of the Network linguistic information. In this task, big data sources and state-of-the-art statistical analysis NexusLinguarun contributes to knowledge will be studied in combination to LLOD in creation and transfer in several aspects. Firstly, through training programs, order to better understand the language. 6 Visual analytics will be also considered for scientific/industry events, datathons , which this task. This will have an impact on all sub- serve to promote and teach linguistic data domains of linguistics, from typology to syntax science and its related technologies to people to comparative linguistics. from both academia and industry. WG3 is touching issues related to big data Secondly, NexusLinguarum is contributing and linguistic information, as well as deep to a series of W3C standardisation activities in learning and neural approaches for linguistic the field of Language Resources. data, in a multilingual setting. Thirdly, the project is aiming at disseminating its topics to a broader public, 2.4 Working Group 4 - Use cases and including popular science publications, blog posts on the Web, and contributions to applications (Leader: Sara Carvahlo, crowdsourced resources, like Wikipedia. University of Aveiro, Portugal, and Ilan Finally, the Action is committed to the Kernerman, K Dictionaries, Israel) design of a common curriculum for a Europe- WG 4 is dealing with the identification of use wide master degree in linguistic data science, as cases that will demonstrate the possible a means to ensure knowledge transfer to a new deployment of LLOD related technologies. Use generation of researchers and practitioners cases identified are in the legal domain, as well coming from different disciplines and with as in the and social sciences, different backgrounds, in the topics related to in order to show how linguistic data science can linguistic data science. deeply influence studies in those fields. At the first meeting of the WG, there was an 4 Conclusion agreement that setting up use cases that are We briefly presented the current state of broader in scope, particularly at this initial development of the European network for Web- stage, could help get things in motion. centred linguistic data science” Currently, five main topics are considered, (NexusLinguarum), which we think could although others will be added in later stages: greatly benefit to be discussed at the SEPLN legal domain, humanities and social sciences, conference, as the network could get impulses linguistics, life science, and technology. from the SEPLN communities, which are dealing with large varieties of languages. 2.5 Working Group 5 - Management and dissemination (leader: Jorge Gracia, Acknowledgments University of Zaragoza) Work presented here was supported in part by This WG deals with the day-to-day the COST Action CA18209 - NexusLinguarum management and administrative coordination, “European network for Web-centred linguistic as well as cross-WG and external data science”, the project Prêt-à-LLOD, under communication and capacity building. grant agreement no. 825182, and the ELEXIS Recently, a new task was added for scientific project, under grant agreement no. 731015. communication strategy. While this WG seems to be classical management one, it has to deal with the fact that the COST Actions are funding networking activities and not directly research work. Therefore a main challenge consists in establishing cooperative relations with Research and Development programmes ant also to 6 NexusLinguarum will for example organise the identify the most relevant Short Term Scientific next SD-LLOD Datathon and the next edition of the Missions that can contribute to this kind of Language, Data and Knowledge (LDK 2021) cooperation. conference. See http://2019.ldk-conf.org/sd-llod- 2019/ for the past edition of those events.

95 Thierry Declerck, Jorge Gracia, John P. McCrae

References Bellandi, A., E. Giovannetti, S. Piccini, and A. Weingart. 2017. Developing LexO: a collaborative editor of multilingual lexica and termino-ontological resources in the humanities. In Proceedings of Language, Ontology, Terminology and Knowledge Structures Workshop (LOTKS 2017), Montpellier, France, September. Association for Computational Linguistics. Berners-Lee, T. 2006. Design issues: Linked data. http://www.w3.org/DesignIssues/ LinkedData.html (August 24, 2020). Chiarcos, C., J. McCrae, P. Cimiano, and C. Fellbaum. 2013. Towards open data for linguistics: Linguistic linked data. In A. Oltramari, et al., editors, New Trends of Research in Ontologies and Lexical Resources. Springer, Heidelberg, Heidelberg, Germany. Cimiano, P., C. Chiarcos, J. P. McCrae, and J. Gracia. 2020. Linguistic Linked Data - Representation, Generation and Applications. Springer. McCrae, J., G. A. de Cea, P. Buitelaar, P. Cimiano, T. Declerck, A. Gómez-Pérez, J. Gracia, J., L. Hollink, E. Montiel-Ponsoda, D. Spohr, and T. Wunner. 2012. Interchanging lexical resources on the Semantic Web. Language Resources and Evaluation, 46(6):701–719. Bosque-Gil, J., D. Lonke, J. Gracia, and I. Kernerman. 2019. Validating the OntoLex- lemon lexicography module with K Dictionaries’ multilingual data. In Electronic lexicography in the 21st century. Proceedings of the eLex 2019 conference, pages 726–746, Brno, Czech Republic, October. Lexical Computing CZ s.r.o.

96 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 97-100 recibido 30-04-2020 revisado 26-05-2020 aceptado 11-06-2020

MINTZAI: Sistemas de Aprendizaje Profundo E2E para Traducci´onAutom´aticadel Habla

MINTZAI: End-to-end Deep Learning for Speech Translation Thierry Etchegoyhen1, Haritz Arzelus1, Harritxu Gete1, Aitor Alvarez1, Inma Hernaez2, Eva Navas2, Ander Gonz´alez-Docasal1, Jaime Os´acar1, Edson Benites1, Igor Ellakuria3, Eusebi Calonge4, Maite Martin4 1Vicomtech Foundation, Basque Research and Technology Alliance (BRTA) 2HiTZ Center - Aholab, University of the Basque Country (UPV/EHU) 3ISEA, 4Ametzagai˜na {tetchegoyhen, harzelus, hgete, aalvarez, agonzalezd, josacar, eebenites}@vicomtech.org {eva.navas, inma.hernaez}@ehu.eus, [email protected] [email protected], [email protected]

Resumen: La traducci´onautom´atica del habla consiste en traducir el habla de un idioma origen en texto o habla de un idioma destino. Sistemas de este tipo tienen m´ultiplesaplicaciones y son de especial inter´esen comunidades multiling¨uescomo la Uni´onEuropea. El enfoque est´andaren el ´ambito se basa en componentes principales distintos que encadenan el reconocimiento del habla, la traducci´onautom´atica,y la s´ıntesis del habla. Con los avances obtenidos mediante redes neuronales artificiales y aprendizaje profundo, la posibilidad de desarrollar sistemas de traducci´ondel habla extremo a extremo (end-to-end), sin descomposici´onen etapas intermedias, est´a dando lugar a una fuerte actividad en investigaci´ony desarrollo. En este art´ıculo,se hace un repaso del estado del arte en este ´areay se presenta el proyecto mintzai, que se est´arealizando en el ´ambito. Palabras clave: Traducci´ondel Habla, Traducci´onAutom´atica,Reconocimiento del Habla, S´ıntesis del Habla, Aprendizaje Profundo Abstract: Speech Translation consists in translating speech in one language into text or speech in a different language. These systems have numerous applications, particularly in multilingual communities such as the European Union. The standard approach in the field involves the chaining of separate components for speech recog- nition, machine translation and . With the advances made possible by artificial neural networks and Deep Learning, training end-to-end speech transla- tion systems has given rise to intense research and development activities in recent times. In this paper, we review the state of the art and describe project mintzai, which is being carried out in this field. Keywords: Speech Translation, Machine Translation, Speech Recognition, Text to Speech, Deep Learning 1 Participantes y entidades meses, con comienzo el 1 de abril de 2019 y financiadoras finalizaci´onel 31 de diciembre de 2020. mintzai es un proyecto de investigaci´onsub- mintzai se est´allevando a cabo por el 2 vencionado por el Gobierno Vasco a trav´esde siguiente consorcio: Vicomtech , grupo Aho- 3 4 la convocatoria de ayudas elkartek 2019 de lab de la UPV/EHU , ISEA y Ametzagai- 5 la Agencia Vasca de desarrollo empresarial na , siendo empresas adheridas Argia, EiTB Spri.1 Su principal objetivo es la investiga- y MondragonLingua. El proyecto tiene asig- ci´ony el desarrollo de sistemas de traducci´on nado el c´odigoKK-2019/00065 y el sitio web autom´aticaneuronal del habla, con particu- asociado es: http://mintzai.eus/ lar ´enfasisen la traducci´onentre euskera y castellano. 2https://www.vicomtech.org El proyecto tiene una duraci´ontotal de 21 3https://aholab.ehu.eus/ 4https://www.isea.eus/ 1http://www.spri.eus 5https://www.ametza.com ISSN 1135-5948. DOI 10.26342/2020-65-12 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural T. Etchegoyhen, H. Arzelus, H. Gete, A. Álvarez, I. Hernaez, E. Navas, A. González, J. Osácar, E. Benites, I. Ellakuria, E. Calonge, M. Martín

2 Contexto y motivaci´on errores se propagan en la cadena de pro- Los m´etodos de aprendizaje profundo (Deep cesamiento global debido a la indepen- Learning) se han impuesto como el nuevo pa- dencia de los m´odulosde tratamiento radigma en el campo de las tecnolog´ıas de de los distintos aspectos (reconocimien- la lengua, con mejoras significativas logradas to, traducci´ono s´ıntesis). Los sistemas por ejemplo en traducci´onautom´atica,con- E2E, en comparaci´on,no sufren de esta versi´onde texto en habla, y reconocimien- limitaci´ony eliminan por lo tanto esta to autom´aticodel habla. Actualmente, tanto clase de errores. la investigaci´oncient´ıficacomo la explotaci´on Optimizaci´onde modelado: Los sistemas comercial en estos ´ambitos se basan mayori- E2E modelan la transformaci´onde la in- tariamente en variantes de redes neuronales formaci´onde entrada en informaci´onde artificiales profundas. salida mediante una red neuronal ´uni- Una de las aportaciones importantes del ca. Esta caracter´ısticapermite modelar enfoque basado en redes neuronales es la po- de forma conjunta los diferentes aspec- sibilidad de dise˜nary entrenar sistemas extre- tos necesarios para encontrar una solu- mo a extremo (E2E), i.e., sistemas que con- ci´on´optima;en contraste, los sistemas vierten informaci´onde entrada en informa- en cadena delegan la determinaci´onde ci´onde salida mediante un sistema de apren- las representaciones a cada m´odulode dizaje neuronal ´unico.Los sistemas de tra- forma independiente, lo cual puede resul- ducci´onautom´aticaneuronal, por ejemplo, tar en un modelado sub´optimodel pro- modelan as´ı de forma conjunta los proce- blema global a solucionar. sos de alineamiento entre palabras y de tra- ducci´on(Bahdanau, Cho, y Bengio, 2015); Desarrollo y despliegue: Al ofrecer una los sistemas E2E de reconocimiento del ha- arquitectura reducida a una ´unicared bla, a su vez, aprenden a asociar directamen- neuronal, los sistemas E2E permiten una te se˜nalesde sonido con transcripciones pa- simplificaci´onsignificativa de la prepara- ra modelar el proceso completo de reconoci- ci´onde sistemas para distintos idiomas miento (Graves, Mohamed, y Hinton, 2013), y dominios. En comparaci´on,los siste- y los sistemas E2E de conversi´onde texto a mas en cadena requieren entrenamientos habla generan la se˜nalpartiendo de una re- separados de m´oduloscomplejos, y un presentaci´onfon´eticau ortogr´aficade la en- encadenamiento adecuado de los distin- trada (Wang et al., 2017). tos m´odulosque requiere un esfuerzo es- Se puede contrastar este tipo de arquitec- pec´ıficode adaptaci´ony desarrollo. Los tura con su alternativa est´andar,donde dis- sistemas E2E facilita as´ı el despliegue tintos componentes se encadenan para con- ´agilque requieren los numerosos escena- vertir informaci´onde entrada en informaci´on rios distintos que ocurren en la pr´actica. de salida. En un sistema de traducci´onhabla- Eficiencia: Por la simplificaci´onde ar- habla est´andar, por ejemplo, destacar´ıanun quitectura, los sistemas E2E pueden m´odulode reconocimiento del habla, cuya sa- ofrecer una mayor eficiencia computacio- lida en forma de texto ser´ıatratada por un nal, en t´erminosde espacio y tiempos m´odulode traducci´onautom´aticaque pro- de procesamiento. Aunque sea posible duzca un texto en el idioma de destino, a en teor´ıadesarrollar redes para sistemas partir del cual se pueda generar un contenido E2E similares en complejidad a la suma le´ıdomediante una voz sint´etica. de los componentes de sistemas encade- Las diferencias de arquitectura se trasla- nados, no suele ser el caso en la pr´acti- dan en diferencias importantes en cuanto a ca y la eliminaci´onde los componentes ventajas y deficiencias respectivas; a conti- intermedios suele proveer una mejora a nuaci´onse describen las principales diferen- nivel de eficiencia. cias, centradas en las ventajas obtenibles con sistemas E2E: Estas ventajas potenciales de los sistemas E2E han impulsado su presencia cada vez Reducci´onde errores: Los sistemas en mayor en aplicaciones de tecnolog´ıasdel len- cadena tienen como desventaja la acu- guaje. Los principales sistemas comerciales mulaci´onde errores generados por los en reconocimiento del habla, as´ı como al- distintos m´odulosde la cadena. Estos guno de los sistemas de conversi´onde texto 98 MINTZAI: Sistemas de aprendizaje profundo E2E para traducción automática del habla en habla, son actualmente de tipo E2E, co- sis del habla. Estos tres componentes princi- mo pueden serlo tambi´en los sistemas de tra- pales se entrenan por separado, y el procesa- ducci´onautom´aticaadoptados en los ´ambitos miento opera en cascada: la salida del recono- acad´emicosy comerciales. cedor de habla alimenta el sistema de traduc- En el campo de la traducci´ondel habla se ci´onautom´atica,el cual produce una traduc- han desarrollado muestras del potencial de la ci´onen forma de texto que sirve de entrada tecnolog´ıa,como se describe en m´asdetalle al sistema de s´ıntesis del habla. en la siguiente secci´on.En ciertas condicio- nes, los sistemas E2E iniciales pueden lograr Para optimizar el funcionamiento de los resultados superiores a los obtenidos con sis- sistemas encadenados, la comunicaci´onen- temas est´andaren cadena, y se considera una tre componentes se suele adaptar a la tarea, tecnolog´ıa clave para el desarrollo de siste- en particular explotando hip´otesism´ultiples mas de traducci´onautom´aticahabla-habla y (Ney, 1999; Matusov, Kanthak, y Ney, 2005). habla-texto, cuyas aplicaciones son m´ultiples Otros enfoques cl´asicosse han centrado en y en alta demanda. En comunidades multi- m´etodos estad´ısticosy en aut´omatasde esta- ling¨uescomo la Comunidad Aut´onomaVasca dos finitos, integrando los modelos ac´usticos o la Uni´onEuropea, por ejemplo, las necesi- y de traducci´onen transductores estoc´asticos dades de comunicaci´onmultiling¨uese extien- (Vidal, 1997; Casacuberta et al., 2004). den a toda la red socioecon´omica,con una Como fue mencionado previamente, los presencia impactante de barreras ling¨u´ısticas sistemas encadenados sufren de la acumula- que frenan la presencia cultural, la igualdad ci´onde errores y los avances a este nivel han de idiomas y los desarrollos econ´omicos. consistido en mejorar los componentes indi- Pese a resultados preliminares de cierto viduales, como el reconocedor de habla, o en ´exitocon sistemas E2E de traducci´ondel ha- mejorar la conectividad entre componentes bla, los sistemas en cadena suelen obtener mediante rasgos espec´ıficosa la frontera entre actualmente mejores resultados que los siste- componentes (Kumar et al., 2015). mas E2E en cuanto a calidad de traducci´on habla-texto en la mayor´ıade los casos. Por Para resolver el problema de la propaga- lo cual, existe un reto importante en investi- ci´onde errores y mejorar la calidad general gaci´ony desarrollo de arquitecturas E2E que de los sistemas, en trabajos recientes se ha permitan alcanzar o superar de forma con- explorado el enfoque extremo a extremo neu- sistente los sistemas en cadena cl´asicos.Por ronal para la traducci´onhabla-texto. Los pri- otro lado, mientras los sistemas de traducci´on meros resultados obtenidos han sido promete- habla-texto dominan el campo de la traduc- dores con sistemas codificador-decodificador ci´ondel habla, la traducci´onhabla-habla neu- y mecanismos de atenci´on,en particular en ronal directa constituye un campo de investi- cuanto a la reducci´onde errores acumulados gaci´onpoco explorado, con retos propios im- (Duong et al., 2016; B´erardet al., 2016; Weiss portantes. Por ´ultimo,para ciertos pares de et al., 2017). En cuanto a la traducci´onhabla- idiomas, la escasez de recursos ling¨u´ısticosy habla, los trabajos son escasos, con primeros componentes tecnol´ogicosson obst´aculossig- trabajos exploratorios (Jia et al., 2019). nificativos para el desarrollo de tecnolog´ıade Pese a estos primeros resultados, donde traducci´ondel habla de calidad. sistemas E2E pueden superar a los sistemas El proyecto mintzai se propone respon- encadenados en condiciones similares, en la der a estos retos, con la investigaci´ony el mayor´ıade los casos los sistemas encadena- desarrollo de m´etodos avanzados en traduc- dos siguen obteniendo los mejores resultados, ci´onneuronal del habla, y su validaci´onen como muestran por ejemplo los resultados de casos de uso centrados en el par de idiomas las tareas compartidas internacionales en tra- euskera-castellano. ducci´ondel habla (Niehues et al., 2019). Una de las principales razones es la escasez de da- 3 Estado del arte tos de entrenamiento paralelos para la tarea Los sistemas est´andaresde traducci´onau- de traducci´ondel habla, en comparaci´oncon tom´aticadel habla se basan en tres compo- los datos disponibles para entrenar compo- nentes principales encadenados: un sistema nentes individuales. Otro factor relevante es de reconocimiento del habla, un sistema de la necesidad de mejorar las arquitecturas E2E traducci´onautom´atica,y un sistema de s´ınte- para la traducci´ondel habla en general. 99 T. Etchegoyhen, H. Arzelus, H. Gete, A. Álvarez, I. Hernaez, E. Navas, A. González, J. Osácar, E. Benites, I. Ellakuria, E. Calonge, M. Martín

4 MINTZAI F. Nevado, M. Pastor, D. Pic´o,A. San- El proyecto mintzai se propone contribuir chis, y C. Tillmann. 2004. Some approa- a los avances en la investigaci´onde siste- ches to statistical and finite-state speech- mas E2E tanto para la traducci´onautom´atica to-speech translation. Comput. Speech habla-texto como para la traducci´onhabla- Lang., 18(1):25–47. habla. El proyecto pretende adem´asavanzar Duong, L., A. Anastasopoulos, D. Chiang, en el estado del arte para la traducci´ondel S. Bird, y T. Cohn. 2016. An attentio- habla en el par de idiomas euskera-castellano, nal model for speech translation without en las dos direcciones de traducci´on. transcription. En Proc. of NAACL, p´agi- Tras su puesta en marcha en 2019, el pro- nas 949–959. yecto ha logrado los primeros resultados re- Graves, A., A.-r. Mohamed, y G. Hinton. sumidos a continuaci´on: 2013. Speech recognition with deep re- Creaci´on de corpus paralelos habla- current neural networks. En Proc. of texto y habla-habla euskera-castellano y ICASSP, p´aginas6645–6649. castellano-euskera a partir de los conte- Jia, Y., R. J. Weiss, F. Biadsy, W. Ma- nidos de las sesiones del Parlamento Vas- cherey, M. Johnson, Z. Chen, y Y. Wu. co. Los corpus se compartir´ancon la co- 2019. Direct speech-to-speech translation munidad cient´ıficabajo licencia Creative with a sequence-to-sequence model. ar- Commons by-nc. Xiv:1904.06037. Investigaci´on y desarrollo de sistemas Kumar, G., G. Blackwood, J. Trmal, D. Po- de traducci´ondel habla encadenados en vey, y S. Khudanpur. 2015. A coarse- euskera y castellano, con componentes grained model for optimal coupling of neuronales E2E para reconocimiento del ASR and SMT systems for speech transla- habla, traducci´onautom´aticay s´ıntesis tion. En Proc. of EMNLP, p´aginas1902– del habla. 1907. Investigaci´on y desarrollo de sistemas Matusov, E., S. Kanthak, y H. Ney. 2005. On E2E para traducci´on habla-texto y the integration of speech recognition and habla-habla en euskera y castellano. statistical machine translation. En Proc. of Eurospeech 2005, p´aginas467–474. Los primeros resultados del proyecto son satisfactorios, en particular con la creaci´onde Ney, H. 1999. Speech translation: Coupling un corpus relativamente amplio para la tra- of recognition and translation. En Proc. ducci´ondel habla en un par de idiomas con of ICASSP 1999, p´aginas517–520. bajo soporte a nivel de recursos, y de prime- Niehues, J., R. Cattoni, S. Stuker, M. Negri, ros sistemas encadenados y E2E de traduc- M. Turchi, E. Salesky, R. Sanabria, L. Ba- ci´ondel habla para este par de idiomas. rrault, L. Specia, y M. Federico. 2019. Durante el 2020, el esfuerzo se centrar´a The IWSLT 2019 Evaluation Campaign. en extender y mejorar los primeros sistemas En Proc. of IWSLT. desarrollados, y en validar los resultados ob- Vidal, E. 1997. Finite-state speech-to-speech tenidos. translation. En Proc. of ICASSP, p´aginas Bibliograf´ıa 111–114. Bahdanau, D., K. Cho, y Y. Bengio. 2015. Wang, Y., R. Skerry-Ryan, D. Stan- Neural machine translation by jointly lear- ton, Y. Wu, R. J. Weiss, N. Jaitly, ning to align and translate. En Proc. of Z. Yang, Y. Xiao, Z. Chen, S. Bengio, ICLR. Q. Le, Y. Agiomyrgiannakis, R. Clark, y R. A. Saurous. 2017. Tacotron: B´erard,A., O. Pietquin, C. Servan, y L. Besa- Towards end-to-end speech synthesis. cier. 2016. Listen and translate: A proof arXiv:1703.10135. of concept for end-to-end speech-to-text translation. En Proc. of NIPS. Weiss, R. J., J. Chorowski, N. Jaitly, Y. Wu, y Z. Chen. 2017. Sequence-to-sequence Casacuberta, F., H. Ney, F. J. Och, E. Vidal, models can directly translate foreign J. M. Vilar, S. Barrachina, I. Garcıa- speech. arXiv:1703.08581. Varea, D. Llorens, C. Martınez, S. Molau, 100 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 101-104 recibido 02-04-2020 revisado 24-05-2020 aceptado 24-05-2020

MISMIS: Misinformation and Miscommunication in social media: aggregating information and analysing language

MISMIS: Desinformación y agresividad en los medios de comunicación social: agregando información y analizando el lenguaje

Paolo Rosso1, Francisco Casacuberta1, Julio Gonzalo2, Laura Plaza2, J. Carrillo2, E. Amigó2, M. Felisa Verdejo2, Mariona Taulé3, Maria Salamó3, M. Antònia Martí3 1 PRHLT-Universidad Politécnica de Valencia 2 NLP&IR- UNED 3 CLiC-Universitat de Barcelona {prosso,fcn}@dsic.upv.es; {julio, lplaza, jcalbornoz, enrique, felisa}@lsi.uned.es; {mtaule, maria.salamo, amarti}@ub.edu

Abstract: The general objectives of the project are to address and monitor misinformation (biased and fake news) and miscommunication (aggressive language and hate speech) in social media, as well as to establish a high quality methodological standard for the whole research community (i) by developing rich annotated datasets, a data repository and online evaluation services; (ii) by proposing suitable evaluation metrics; and (iii) by organizing evaluation campaigns to foster research on the above issues. Keywords: Fake news, biased information, hate speech, social media

Resumen: Los objetivos generales del proyecto son abordar y monitorizar la desinformación (noticias sesgadas y falsas) y la mala comunicación (lenguaje agresivo y mensajes de odio) en los medios de comunicación social, así como establecer un estándar metodológico de calidad para toda la comunidad investigadora mediante: i) el desarrollo de datasets anotados, un repositorio de datos y servicios de evaluación online; ii) la propuesta de métricas de evaluación adecuadas; y iii) la organización de campañas de evaluación para fomentar la investigación sobre las cuestiones mencionadas. Palabras clave: Noticias falsas, información sesgada, mensajes de odio, medios de comunicación social is that social networks are a breeding ground for 1 Introduction the propagation of fake news: when a piece of news outrages us or matches our beliefs, we Social media have become the default channel tend to share it without checking its veracity; for people to access information and express and, on the other hand, content selection ideas and opinions. The most notable and algorithms in social networks give credit to this positive effect is the democratization of type of popularity because of the click-based information and knowledge and the capacity of economy on which their business are based. influence on the public opinion. Instead of a Another harmful effect is that the relative few media that control which information anonymity of social networks facilitates the spreads and how (radio and TV channels, news propagation of toxic, hate and exclusion media, etc.), any citizen can now share her messages. Therefore, social networks views with the world and, potentially, influence contribute, paradoxically, to misinformation the public state of opinion on any topic. But and miscommunication. The main aim of this there are also undesired effects of this project is to alleviate these perverse effects via democratization of knowledge that is becoming language analysis and the computational increasingly important. One of them is that modelling of the propagation mechanisms for social networks foster information bubbles: factual information, ideas and opinions in social every user may end up receiving only the networks. information that matches her personal biases, beliefs, tastes and viewpoints. A perverse effect

ISSN 1135-5948. DOI 10.26342/2020-65-13 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Paolo Rosso, Francisco Casacuberta, Julio Gonzalo, Laura Plaza, J. Carrillo-Albornoz, E. Amigó, M. Felisa Verdejo, Mariona Taulé, Maria Salamó, M. Antònia Martí

1.1 Participants of misinformation from three angles: (i) by developing methods to extract unbiased The MISMIS project (PGC2018-096212-B) is knowledge from biased sources (what we call coordinated by Paolo Rosso (Universidad the wisdom of biased crowds); (ii) by Politécnica de Valencia, UPV) and funded by developing techniques to identify fake news; the Spanish Ministry of Science, Innovation and and (iii) by designing summarization and Universities (R+D Knowledge Generation visualization techniques for controversial program). It started in 1st January 2019 and it topics, in which biases and manipulated will run until 31st December 2021. These are information are exposed and the different the groups involved in this project: 1 points of view are identified and summarized • The PRHLT research center at UPV, which contrastively. leads the MISMIS-FAKEnHATE (PGC2018- • O2 - Miscommunication. We will develop 096212-B-C31) subproject. This is an techniques to detect and monitor experienced group in pattern recognition, aggressiveness and hate speech in social machine learning, multimodality and NLP media, which contribute to create increasingly (e.g. author profiling, stance detection and polarized communities and to demote automatic misogyny identification). 2 argumentation in favour of pure confrontation. • The NLP&IR research group at Universidad We will model hate speech from a linguistic Nacional de Educación a Distancia (UNED), point of view in different languages and we which leads the MISMIS-BIAS (PGC2018- will also use multimodal signals (images and 096212-B-C32) subproject. They contribute to video) to complement textual evidence. the project with their experience in the area of • O3 - Methodological Tools. One of the evaluation metrics and methodologies for distinguishing aspects of our proposal is that Information Retrieval and NLP. 3 we will make a substantial effort to improve • The CLiC research group at Universitat de research methodologies in the field, with the Barcelona (UB), which leads the MISMIS- goal of establishing a high methodological LANGUAGE (PGC2018-096212-B-C33) standard for the research community. This is a subproject. This group will contribute to this horizontal objective for which we will: (i) project with the development of theoretically develop rich annotated datasets for each of the founded linguistic models for the detection of relevant tasks; (ii) perform a formal study of misinformation and miscommunication, as suitable evaluation metrics that appropriately well as by the creation of the datasets define each of the tasks and provide adequate necessary for training and evaluating the analytical tools for evaluating system systems that will be carried out. performance; (iii) develop a data repository and an online evaluation service in which 2 Hypotheses and Objectives metrics and datasets are centrally available for The two main hypotheses that will guide our the research community, and (iv) organize research are: evaluation campaigns using all the above tools H1. Misinformation and miscommunication to foster the sharing and reuse of knowledge should be linguistically modelled, not only to in the research community. improve the performance of algorithms but also to provide a better understanding of these 3 Work progress phenomena. The problem of misinformation detection (O1) H2. Linguistic resources, algorithms and has been addressed from several perspectives: metrics properly developed to address rumour detection (Ghanem et al., 2019a), fact misinformation and miscommunication are checking (Ghanem et al., 2019b), fake news essential for a consistent advancement related to detection (Ghanem et al., 2019c) and credibility the following general objectives of the project. detection (Giachanou et al., 2019). False claims, These objectives are the following: and false information in general, are • O1 - Misinformation. We will model and intentionally written to evoke emotions to the develop techniques to overcome the problem readers in an attempt to be believed and be disseminated in social media. To differentiate 1 https://www.prhlt.upv.es/wp/ between credible and non-credible claims in the 2 https://sites.google.com/view/nlp-uned/home above latter works, we incorporated emotional 3 http://clic.ub.edu/en

102 MISMIS: Misinformation and Miscommunication in social media: aggregating information and analysing language features into a Long Short Term Memory et al., 2020). The preliminary work carried out (LSTM) neural network. In order to address to detect fake news and hate speech was fake news from an author profiling perspective, covered also by the press6. currently we are organizing a shared task on Regarding the corpora developed (O2 and profiling fake news spreaders on Twitter4 (O3). O3), we have created the NewsCom corpus, Another way to overcome the problem of which contains 2,955 comments in Spanish misinformation is developing methods to posted in response to eighteen different news extract unbiased knowledge from biased articles from online newspapers. This corpus sources. Recent literature on recommender has been annotated with the focus of negation, systems has focused on the inherent biases scope and negation markers (Taulé et al., 2020). present in the data, which are amplified by Currently, we are annotating the comments of recommendation algorithms. We are currently this corpus indicating their degree of toxicity: working on real-world datasets in order to not toxic, midly toxic, toxic, very toxic. We are characterize the existence of a geographic bias enriching this annotation with new and more in the data. We are analysing the bias of current detailed criteria. This annotation allows us to state-of-the-art recommender systems and try to establish fine grained criteria for analysing and understand the visibility that is given to items. better defining what can be considered as a Next, we will address the mitigation of the bias comment with toxic, offensive, abusive or hate in the recommendation algorithms. We are also language. This corpus could be used for the analysing fairness in session-based development of algorithms for the detection of recommender systems (Ariza et al., 2020). We this kind of language. Another line of research study how the effectiveness of state-of-the-art is the creation of a corpus of fake news from algorithms (neural and non-neural approaches) data provided by the Newtral7 start-up with the is affected by specific dataset structure, due to aim to be used for training and testing different session lengths, in terms of accuracy algorithms for textual-based and multimodal and distribution of content provider exposure. fake news detection. These corpora could be Regarding the problem of used in future evaluation campaigns. miscommunication, and more concretely of hate Regarding our work on methodological speech detection, we organised the HatEval foundations of evaluation and textual similarity shared task at SemEval with Twitter corpora in (O3): (i) we have completed a study on the English and Spanish (O3) and immigrants and evaluation of ordinal classification tasks women as target of hate speech (Basile et al., (Amigó et al., 2020). Such tasks are very 2019). The twin problem of identifying relevant for NLP and for the project, but they offensive tweets (O2) in general (OffensEval are currently evaluated with metrics for shared task at SemEval) was addressed with a different problems (classification and deep-learning ensemble method (De La Peña regression). We have defined formal restrictions and Rosso, 2019). Special attention was given for the task and we have proposed a new metric, to the problem where targets of hate speech based on Information Theory and Measurement were women (Frenda et al., 2019a; 2019b). An Theory, which complies with all the formal extensive study of the different forms in that requisites and behaves better empirically than sexism attitudes and behaviours are found in available alternatives. (ii) We have also social media has been performed, and different completed the first stages of a formal study on methods for automatically detecting everyday similarity functions, which includes a specific sexism in online communications have been axiomatics for similarity in the context of NLP developed (Rodríguez et al., 2020). and a new similarity function, which is most Hate speech in social media contributes to robust than state of the art alternatives. (iii) We create increasingly polarized communities (Lai have addressed the task of semantic et al., 2019; 2020). At the moment, we are compositionality, comparing word2vec investigating the impact of hate speech and approaches with contextual embeddings and polarization (O2) on demoting argumentation5 when controversial issues are addressed (Esteve 6https://www.lavanguardia.com/tecnologia/2019 1207/472082286311/usan-la-ia-para-detectar- 4 https://pan.webis.de/clef20/pan20-web/author- noticias-falsas-y-mensajes-de-odio-en-redes- profiling.html sociales.html 5 http://www.argumentsearch.com/ 7 https://www.newtral.es/

103 Paolo Rosso, Francisco Casacuberta, Julio Gonzalo, Laura Plaza, J. Carrillo-Albornoz, E. Amigó, M. Felisa Verdejo, Mariona Taulé, Maria Salamó, M. Antònia Martí

exploring unsupervised alternatives with a Intelligent & Fuzzy Systems, 36(5): 4743– formal foundation. 4752. Ghanem B., A. Cignarella, C. Bosco, P. Rosso, Acknowledgments and F. Rangel. 2019a. UPV-28-UNITO at The MISMIS project (PGC2018-096212-B) is SemEval-2019 Task 7 Exploiting Post's funded by the Spanish Ministry of Science, Nesting and Syntax Information for Rumor Innovation and Universities. Stance Classification. Proc. of the 13th Int. Workshop on Semantic Evaluation References (SemEval-2019), NAACL-HLT 2019, Minnesota, USA, 1125–1131. Amigó, A., J. Gonzalo, S. Mizzaro and J. Carrillo. 2020. Effectiveness Metrics for Ghanem, B., G. Glavas, A. Giachanou, S. Ordinal Classification: Formal Properties Ponzetto, P. Rosso and F. Rangel. 2019b. and Experimental Results. Proc. of the 58th UPV-UMA at CheckThat! Lab: Verifying Annual Meeting of the Association for Arabic Claims using Crosslingual Approach. Computational Linguistics (ACL 2020). L. Cappellato, N. Ferro, D. E. Losada and H. Müller (eds.) CLEF 2019 Labs and Ariza, A., F. Fabbri, L. Boratto and M. Salamó. Workshops, Notebook Papers. Workshop 2020. From The Beatles to Billie Eilish: Proceedings.CEUR-WS.org, vol. 2380. Connecting Provider Representativeness and Exposure in Session-based Recommender Ghanem B., P. Rosso and F. Rangel. 2019c. An Systems. Submitted to SIGIR 2020. Emotional Analysis of False Information in Social Media and News Articles. Basile, V., C. Bosco, E. Fersini, D. Nozza, V. (arXiv:1908.09951). ACM Transactions on Patti, F. Rangel, P. Rosso and M. Internet Technology (TOIT). To appear. Sanguinetti. 2019. SemEval-2019 Task 5: Multilingual Detection of Hate Speech Giachanou A., P. Rosso and F. Crestani. 2019. Against Immigrants and Women in Twitter. Leveraging Emotional Signals for Proc. of the 13th Int. Workshop on Semantic Credibility Detection. Proc. of the 42nd Int. Evaluation (SemEval-2019), (NAACL-HLT ACM SIGIR Conference on Research and 2019), Minnesota, USA, 54-63. Development in Information Retrieval (SIGIR ’19), Paris, France. De La Peña, G. P. and P. Rosso. 2019. DeepAnalyzer at SemEval-2019 Task 6: A Lai, M., M. Tambuscio, V. Patti, G. Ruffo and Deep Learning-based Ensemble Method for P. Rosso. 2019. Stance Polarity in Political Identifying Offensive Tweets. Proc. of the Debates: a Diachronic Perspective of 13th Int. Workshop on Semantic Evaluation Network Homophily and Conversations on (SemEval-2019), NAACL-HLT 2019, Twitter. Data & Knowledge Engineering, Minnesota, USA, 582–586. 124. https://doi.org/10.1016/j.datak.2019.101738 Esteve, M., F. Casacuberta and P. Rosso. 2020. Minería de argumentación en el Referéndum Lai, M., V. Patti, G. Ruffo and P. Rosso. 2020. del 1 de Octubre de 2017. Procesamiento #Brexit: Leave or Remain? The Role of del Lenguaje Natural, 65. To appear. User’s Community and Diachronic Evolution on Stance Detection. Journal of Frenda S., N. Kando, V. Patti and P. Rosso. Intelligent & Fuzzy Systems. To appear. 2019a. Stance or Insults?. Proc. of the Ninth International Workshop on Evaluating Rodríguez-Sánchez, F. 2019. Desarrollo de un Information Access (EVIA 2019), Satellite sistema para la detección del machismo en Workshop of the NTCIR-14 Conference, redes sociales. Trabajo Final de Máster en National Institute of Informatics, Tokyo, Tecnologías del Lenguaje. UNED. Japan. Taulé, M., M. Nofre, M. González and M.A. Frenda, S., B. Ghanem, M. Montes-y-Gómez Martí. 2020. Focus of negation: its and P. Rosso. 2019b. Online Hate Speech identification in Spanish. Natural Language against Women: Automatic Identification of Engineering,1-22. Misogyny and Sexism on Twitter. Journal of https://doi.org/10.1017/S1351324920000388

104 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 105-108 recibido 14-03-2020 revisado 30-05-2020 aceptado 30-05-2020

AMALEU: Una Representaci´onUniversal del Lenguaje basada en Aprendizaje Autom´atico

AMALEU: A Machine-Learned Universal Language Representation Marta R. Costa-juss`a TALP Research Center Universitat Polit`ecnicade Catalunya, Barcelona [email protected]

Resumen: El objetivo del proyecto AMALEU es aprender una representaci´on com´unpara diferentes idiomas. Se pretende tener una representaci´oncom´unpa- ra la lengua oral y una para la lengua escrita. AMALEU, de dos a˜nosde duraci´on, est´afinanciado por el MINECO dentro del programa de Europa Excelencia Palabras clave: Traducci´onAutom´aticaMultiling¨ue,Traducci´onde Voz y Texto Abstract: The objective of AMALEU’s project is learning a multilingual common representation for speech and a different multilingual common representation for text. AMALEU is a two-year project funded by the MINECO. Keywords: Multilingual Machine Translation, Speech and Text Translation 1 Participantes del proyecto en todos los lenguajes. Necesitamos mejores El grupo de investigaci´onque participa en algoritmos de aprendizaje que puedan explo- el proyecto pertenece al grupo de investiga- tar el progreso de unas pocas modalidades y ci´onde Voz del Departamento de Teor´ıade lenguajes para el beneficio de otros. Se˜naly Comunicaciones de la Universidad Este proyecto se focaliza en el reto de Polit`ecnicade Catalu˜nay al centro de inves- aprendizaje a partir de pocos recursos a par- tigaci´onTALP. La investigadora principal es tir de una aproximaci´onpara la traducci´on la autora de este art´ıculo, y como investiga- autom´aticamultiling¨ue. dores est´anCarlos Escolano, Gerard Gallego AMALEU propone entrenar conjunta- y Javier Ferrando. mente un modelo multiling¨uey multimodal que aprenda una representaci´onuniversal del 2 Entidad financiadora lenguaje. Este modelo compensar´ala falta El proyecto est´afinanciado por el Ministerio de datos etiquetados y mejorar´asignificativa- de Econom´ıa y Competitividad y el c´odigo mente la capacidad de generalizaci´onde los del proyecto es EUR2019-103819. AMALEU datos de entrenamiento a partir de observar comenz´oel 1 de enero de 2019, finaliza el 31 una variedad de recursos no etiquetados. Este de enero de 2020 por lo que tiene una dura- modelo reducir´ael n´umerocuadr´aticode sis- ci´onde 24 meses. La financiaci´ontotal es de temas de traducci´ona linear, lo cual tendr´a 74.850 euros. un gran impacto en un entorno multiling¨ue. El reto de este proyecto radica en entre- 3 Contexto y motivaci´on nar una representaci´onuniversal de mane- ¿Por qu´e la traducci´on autom´atica entre ra autom´aticay con aprendizaje profundo. ingl´esy portugu´eses considerablemente me- Para esto, AMALEU utilizar´auna arquitec- jor que la traducci´onautom´aticaentre ho- tura basada en la estructura de codificador- land´esy castellano? ¿Por qu´elos reconocedo- decodificador. El codificador aprende una re- res autom´aticosfuncionan mejor en alem´an presentaci´ondel lenguaje de entrada median- que en fin´es?El principal motivo es la varia- te una reducci´on de dimensionalidad, que ci´onen la cantidad de datos etiquetados pa- ser´ala representaci´onuniversal del lengua- ra entrenar y modelar los sistemas. Aunque je; a partir de esta abstracci´on,el decodifi- el mundo es multimodal y altamente multi- cador genera la salida. La arquitectura inter- ling¨ue,la tecnolog´ıade voz y lenguaje no es na del codificador-decodificador se dise˜nar´a capaz de absorver la alta demanda que hay explicitamente para aprender la abstracci´on ISSN 1135-5948. DOI 10.26342/2020-65-14 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Marta R. Costa-jussà universal del lenguaje, que se integrar´acomo oral, de manera independiente. La represen- funci´onobjetivo de la arquitectura. taci´oncom´un del texto o la voz se construir´a AMALEU impactar´a comunidades alta- a partir de una arquitectura de codificador- mente multidisciplinares incluyendo ciencias decodificador. Dado que m´ultiples lengua- de la computaci´on,matem´aticas,ingenier´ıa jes se pueden entrenar en un ´unicomode- y lingu´ıstica,que trabajan conjuntamente en lo que pueda producir traducciones multi- aplicaciones del procesado del lenguaje natu- ling¨ues,queremos aprender la representaci´on ral de voz y de texto. multiling¨uedel lenguaje a partir del uso de una funci´onobjetivo com´un. 4 Proyecto AMALEU Dentro de esta parte del trabajo plantea- El objetivo de AMALEU es aprender de ma- mos dos actividades. La primera es el desarro- nera autom´atica una representaci´onuniversal llo de la arquitectura para el lenguaje textual. del lenguaje ya sea voz o texto, de manera que Aqu´ı nos focalizaremos en investigar meca- se pueda explotar en aplicaciones de inteli- nismos de atenci´ony en diferentes represen- gencia artificial. Adicionalmente, el proyecto taciones intermedias que sean de longitud fija se plantea utilizar fuentes de informaci´onno o variable. Evaluaremos tanto la calidad de la etiquetadas as´ıcomo informaci´onling¨u´ıstica. representaci´onintermedia como la calidad de La Figura 1 muestra el diagrama de Gantt del la traducci´on.La segunda actividad consis- proyecto y los principales paquetes de traba- tir´aen construir una representaci´oncom´un jo. El plan de trabajo incluye: gesti´ony dise- para el lenguaje oral. Aqu´ı la investigaci´on minaci´on,representaci´onmultiling¨uecom´un, se centrar´aen adaptar el codificador de ma- integraci´onde recursos ling¨u´ısticosy la inte- nera que pueda aceptar la entrada oral que es graci´ony evaluaci´on.A continuaci´on,descri- considerablemente larga. Asimismo, identifi- bimos brevemente cada uno de estos cuatro caremos mecanismos de atenci´onadecuados puntos del plan de trabajo, as´ıcomo el grado para tales longitudes. La parte del decodifi- de consecuci´onde los objetivos a fecha 1 de cador se compartir´aen ambas actividades. junio de 2020 (mes 17). Consecuci´onde los objetivos (mes 17). La representaci´oncom´unen texto se da por 4.1 Gesti´ony Diseminaci´on completada a partir de los siguientes tra- Esta parte del trabajo incluye la adecuada bajos: (Escolano, Costa-juss`a,y Fonollosa, gesti´ondel proyecto preparando los informes 2019; Escolano et al., 2019; Escolano et al., adecuados y controlando el presupuesto ade- 2020a; Escolano et al., 2020b), donde se de- cuadamente. Adem´as,se gestionar´auna dise- muestra que la arquitectura que hemos plan- minaci´onelaborada que incluya publicacio- teado es capaz de acercar las representaciones nes y participaciones en eventos internacio- de oraciones similares en distintos idiomas. nales. Las publicaciones relacionadas con el La parte de voz est´aen curso. proyecto se van citando cuando se describen la consecuci´onde los objetivos de cada tarea. 4.3 Recursos Ling¨u´ısticos El objetivo en este punto es utilizar informa- Consecuci´onde los objetivos (mes 17). ci´onling¨u´ısticaexterna de manera que ayu- Este objetivo est´acompletado al 75 %, resul- de a encontrar la representaci´oncom´undel tados del mismo se pueden ver en acciones de lenguaje. Para eso se investigar´asobre cuales diseminaci´oncomo el desarrollo de la p´agi- son las m´ınimas unidades (palabras, subpa- na web del grupo de investigaci´ony del pro- labras o caracteres) m´asadecuadas para con- yecto1, la participaci´onen la organizaci´onde seguir una representaci´onuniversal del len- evaluaciones internacionales (Barrault et al., guaje. Por otro lado se explotar´anrecursos 2019), y art´ıculos de diseminaci´ondel ´area de diferentes naturalezas incluyendo bases de (Costa-juss`aet al., 2020). datos monoling¨ues. 4.2 Representaci´onCom´unde la Consecuci´onde los objetivos (mes 17). Voz o el Texto Multiling¨ues Hemos explorado las unidades m´ınimasy in- El objetivo en este punto es obtener una re- formaci´onling¨u´ıstica(Casas, Costa-juss`a,y presentaci´oncom´undel lenguaje en su re- Fonollosa, 2020; Casas, Fonollosa, y Costa- presentaci´ontextual y en su representaci´on juss`a,2020; Armengol-Estap´e, Costa-juss`a, y Escolano, 2020) as´ıcomo el uso de datos 1http://mt.cs.upc.edu monoling¨ues(Biesialska, Rafieian, y Costa- 106 AMALEU: a machine-Learned universal language representation

Figura 1: Diagrama Gantt. Plan de trabajo juss`a,2020). Ahora falta su integraci´onen la consecuci´onse mida mayoritariamente en ba- arquitectura multiling¨ue. se a los objetivos previos. Adem´as,queremos a˜nadirque con nuestro sistema hemos partici- 4.4 Integraci´on/Evaluaci´on pado en evaluaciones internacionales de pres- En cuando a datos, AMALEU usar´adatos tigio (Casas et al., 2019) obteniendo siempre multimodales de bases de datos existentes y resultados competitivos. conocidas como los que provienen de las eva- luaciones del WMT2 y del IWSLT3 as´ıcomo 5 Impacto otros recursos disponibles desde el Linguistic El impacto de AMALEU se refleja en una Data Consortium4. mejora de calidad y eficiencia de los actuales Referente a la evaluaci´on,como no existe sistemas multiling¨uesm´asall`ade los traduc- un m´etodo establecido para evaluar la cali- tores autom´aticos. dad de la representaci´onintermedia, propo- La novedosa integraci´onde informaci´on nemos una nueva medida que utiliza la dis- multiling¨uey multimodal en las aplicaciones tancia de la representaci´onentre oraciones de inteligencia artificial, que plantea AMA- similares. Esta distancia se combina con la LEU, permitir´amejorar la calidad de las mis- evaluaci´onde recuperaci´onde la oraci´onori- mas. Es clave que los sistemas pueden explo- ginal a partir de su representaci´onpara cal- tar m´ultiplesrecursos y se puedan entrenar cular lo que denominamos Medida de Simi- a partir de datos no etiquetados. Como con- litud con Recuperaci´on.Asimismo, calcula- secuencia, se consigue un aprendizaje zero- remos con oraciones que son contradictorias, shot, que quiere decir que el sistema no nece- tienen una representaci´onno similar (a par- sita ver ejemplos de la tarea en concreto para tir de la medida de distancia entre oraciones) aprenderla. y combinaremos esta distancia con la recu- La extensi´ondel entrenamiento a m´ulti- peraci´onde la frase original para calcular lo ples lenguas y modalidades contribuir´a a que denominamos Medida de Disimilitud con mayores generalizaciones. Asimismo, la ex- Recuperaci´on.Estas medidas pretenden con- plotaci´onde una alta variedad de recursos trolar que las oraciones similares tengan una ling¨u´ısticostambi´en lo har´a. representaci´onsimilar y las oraciones contra- La representaci´onuniversal del lenguaje dictorias tengan una representaci´ondistante. permite abrir nuevos horizontes en tareas co- La calidad de la traducci´onse evaluar´acon mo la b´usquedade informaci´ono los res´ume- el m´etodo estandard BLEU (Papineni et al., nes autom´aticoscross-ling¨ues.Asimismo, el 2002). uso de una representaci´oncom´undel lenguaje Consecuci´onde los objetivos (mes 17). permitir´areducir el n´umerode sistemas mul- La transversalidad de esta tarea hace que su tiling¨uesde N·(N-1) (donde N es el n´umero de lenguas) a 2·N. Adem´asnuestra aproxima- 2http://www.statmt.org/wmt20/ ci´onpermite a˜nadirincrementalmente nuevas 3https://workshop2020.iwslt.org/ lenguas. Este logro consigue crear nuevas ar- 4https://www.ldc.upenn.edu/ quitecturas de aprendizaje profundo m´asefi- 107 Marta R. Costa-jussà cientes, as´ıcomo beneficiar las lenguas de po- cos recursos a partir de las lenguas con m´as Costa-juss`a,M. R., R. Creus, O. S. Domingo, recursos. A. S. Dominguez, M. Escobar, C. I. L´opez, Finalmente, mencionar que la investiga- M. Y. F. Garc´ıa,y M. Geleta. 2020. Mt- ci´onrealizada en AMALEU tiene en cuenta adapted datasheets for datasets: Template producir resultados que sean justos y no re- and repository. ArXiv, abs/2005.13156. produzcan sesgos sociales (Costa-juss`a,2019; Costa-juss`a,M. R. 2019. An analysis of Costa-juss`a, Lin, y Espa˜na-Bonet, 2020; gender bias studies in natural language Costa-juss`aet al., 2020). processing. Nature Machine Intelligence, 1(11):495–496. Agradecimientos Este trabajo esta financiado por el Ministe- Costa-juss`a, M. R., C. Espa˜na-Bonet, rio de Econom´ıay Competitividad a trav´es P. Fung, y N. A. Smith. 2020. Multilin- del projecto EUR2019-103819 y el programa gual and interlingual semantic represen- Ram´ony Cajal que incluye financiamiento tations for natural language processing: del European Regional Development Fund. A brief introduction. Computational Linguistics. Bibliograf´ıa Costa-juss`a,M. R., P. L. Lin, y C. Espa˜na- Armengol-Estap´e,J., M. R. Costa-juss`a,y Bonet. 2020. Gebiotoolkit: Automatic ex- C. Escolano. 2020. Enriching the trans- traction of gender-balanced multilingual former with linguistic and semantic fac- corpus of wikipedia biographies. En Proc tors for low-resource machine translation. of the LREC. ArXiv, abs/2004.08053. Escolano, C., M. R. Costa-juss`a,y J. A. R. Barrault, L., O. Bojar, M. R. Costa-juss`a, Fonollosa. 2019. From bilingual to multi- C. Federmann, M. Fishel, Y. Graham, lingual neural machine translation by in- B. Haddow, M. Huck, P. Koehn, S. Mal- cremental training. En Proceedings of the masi, C. Monz, M. M¨uller,S. Pal, M. Post, ACL Student Research Workshop, p´aginas y M. Zampieri. 2019. Findings of the 236–242, Florence. 2019 conference on machine translation (WMT19). En Proceedings of the WMT, Escolano, C., M. R. Costa-juss`a,J. A. R. p´aginas1–61, Florence. Fonollosa, y M. Artetxe. 2020a. Mul- tilingual machine translation: Closing Biesialska, M., B. Rafieian, y M. R. Costa- the gap between shared and language- juss`a. 2020. Enhancing word embed- specific encoder-decoders. ArXiv, dings with knowledge extracted from lexi- abs/2004.06575. cal resources. En ACL Student Research Workshop. Escolano, C., M. R. Costa-juss`a,J. A. R. Fo- Casas, N., M. R. Costa-juss`a,y J. A. R. Fo- nollosa, y M. Artetxe. 2020b. Training nollosa. 2020. Combining subword repre- multilingual machine translation by alter- sentations into word-level representations nately freezing language-specific encoders- in the transformer architecture. En ACL decoders. ArXiv, abs/2006.01594. Student Research Workshop. Escolano, C., M. R. Costa-juss`a,E. Lacroux, Casas, N., J. A. R. Fonollosa, y M. R. Costa- y P.-P. V´azquez. 2019. Multilingual, juss`a. 2020. Syntax-driven iterative ex- multi-scale and multi-layer visualization pansion language models for controllable of intermediate representations. En Pro- text generation. ArXiv, abs/2004.02211. ceedings of the EMNLP-IJCNLP: System Demonstrations, Noviembre. Casas, N., J. A. R. Fonollosa, C. Escolano, C. Basta, y M. R. Costa-juss`a.2019. The Papineni, K., S. Roukos, T. Ward, y W.-J. TALP-UPC machine translation systems Zhu. 2002. Bleu: a method for auto- for WMT19 news translation task: Pivo- matic evaluation of machine translation. ting techniques for low resource MT. En En Proceedings of ACL, p´aginas311–318, Proceedings of the WMT, Florence, Italy. Philadelphia, Pennsylvania, USA, Julio.

108 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 109-112 recibido 01-04-2020 revisado 24-05-2020 aceptado 24-05-2020

Transcripci´on,indexaci´ony an´alisisautom´aticode declaraciones judiciales a partir de representaciones fon´eticasy t´ecnicasde ling¨u´ısticaforense

Transcription, indexing and automatic analysis of judicial declarations from phonetic representations and techniques of forensic linguistics

Pedro Jos´eVivancos-Vicente1, Jos´eAntonio Garc´ıa-D´ıaz2, Angela´ Almela3, Fernando Molina1, Juan Salvador Castej´on-Garrido1, Rafael Valencia-Garc´ıa2 1V´ocaliSistemas Inteligentes S.L. 2Facultad de Inform´atica,Universidad de Murcia 3Facultad de Letras, Universidad de Murcia {pedro.vivancos, fernando.molina, juans.castejon}@vocali.net {joseantonio.garcia8, angelalm, valencia}@um.es

Resumen: Recientes avances tecnol´ogicoshan permitido mejorar los procesos judi- ciales para la b´usquedade informaci´onen los expedientes judiciales asociados a un caso. Sin embargo, cuando t´ecnicosy peritos deben revisar pruebas almacenadas en v´ıdeosy fragmentos de audio, se ven obligados a realizar una b´usquedamanual en el documento multimedia para localizar la parte que desean revisar, lo cual es una tarea tediosa y que consume bastante tiempo. Para poder facilitar el desempe˜node los t´ecnicos,el presente proyecto consiste en un sistema que permite la transcripci´on e indexaci´onautom´aticade contenido multimedia basado en tecnolog´ıasde deep- learning en entornos de ruido y con m´ultiplesinterlocutores, as´ıcomo la posibilidad de realizar an´alisisde ling¨u´ısticaforense sobre los datos para ayudar a los peritos a analizar los testimonios de modo que se aporten evidencias sobre la veracidad del mismo. Palabras clave: Reconocimiento de voz, word spotting, ling¨u´ısticaforense Abstract: Recent technological advances have made it possible to improve the search for information in the judicial files of the Ministry of Justice associated with a trial. However, when judicial experts examine evidence in multimedia files, such as videos or audio fragments, they must manually search the document to locate the fragment at issue, which is a tedious and time-consuming task. In order to ease this task, we propose a system that allows automatic transcription and indexing of multimedia content based on deep-learning technologies in noise environments and with multiple speakers, as well as the possibility of applying forensic linguistics tech- niques to enable the analysis of witness statements so that evidence on its veracity is provided. Keywords: Speech recognition, word spotting, forensic linguistics

1 Introducci´on eficiencia en el uso de sus activos (Balleste- ros, 2011). Algunas de estas medidas est´an Los gobiernos de los pa´ısesavanzados est´an enfocadas a mejorar la labor de los funciona- llevando a cabo una serie de medidas de mejo- rios de las administraciones p´ublicascon el ra y renovaci´ontecnol´ogicade los ministerios perfeccionamiento de sus servicios de comu- de justicia para mejorar los procesos judicia- nicaciones internas, modernizar los sistemas les y as´ıreducir la posibilidad de errores, evi- inform´aticos,adaptarse al teletrabajo e in- tar la obsolescencia de equipos y mejorar la cluir nuevas funcionalidades en sus sistemas ISSN 1135-5948. DOI 10.26342/2020-65-15 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Antonio García-Díaz, Ángela Almela, Fernando Molina, Juan Salvador Castejón-Garrido, Rafael Valencia-García internos. del sistema, as´ıcomo cada uno de los m´odu- M´asconcretamente, existen sistemas tec- los que la componen mientras que la secci´on nol´ogicoscapaces de digitalizar en v´ıdeolas 3 describe su estado actual y nuevas v´ıasde vistas que tienen lugar en las salas de Justi- actuaci´on. cia, como las descritas en (Zalman, Rubino, y Smith, 2019). Adem´as,disponen de herra- 2 Arquitectura del sistema mientas que permiten la b´usquedade infor- Este sistema est´aformado por tres m´odulos maci´on,tanto en el texto como en los meta- principales: 1) Sistema de transcripci´onde datos y en los expedientes judiciales asocia- conversaciones de declaraciones judiciales ba- dos a un caso. Sin embargo, cuando t´ecnicos sado en Deep-Learning (ver Secci´on2.1), 2) y peritos judiciales deben revisar pruebas al- Sistema de indexaci´ony b´usquedade v´ıdeos macenadas en documentos multimedia (como (ver Secci´on2.2), y 3) Sistema de procesa- v´ıdeoso fragmentos de audio), se ven obliga- miento de ling¨u´ıstica forense basado en ca- dos a realizar, o bien un visionado completo racter´ısticaspsicoling¨u´ısticasy fen´omenosde del v´ıdeo, o bien probar su visualizaci´ondes- duda en el discurso (ver Secci´on2.3). En po- de cierto momento para localizar fragmentos cas palabras, el funcionamiento del sistema es que se consideren relevantes para el caso. Es- el siguiente: A partir de un conjunto de v´ıdeos ta tarea es tediosa y consume una cantidad y recursos multimedia introducidos de mane- de tiempo considerable. ra manual, el sistema transcribe autom´atica- Para poder facilitar el desempe˜node los mente su contenido a ficheros WebVTT; en t´ecnicoses necesario poder clasificar y com- segundo lugar, estos ficheros se utilizan para prender el contenido de los recursos multi- construir un sistema de indexaci´onpara po- media de manera eficiente. Los ´ultimosavan- der buscar sobre el v´ıdeoa partir de citas tex- ces en nuevas tecnolog´ıaspermiten salvar es- tuales palabras relacionadas o bien a trav´es tas dificultades con los consecuentes benefi- de b´usqueda fon´etica.Al final se provee a los cios que ofrece disponer de una indexaci´on peritos con una interfaz web donde pueden eficiente del contenido multimedia. Con ello, buscar sobre el fragmento exacto del v´ıdeoa se podr´ıanreducir los ya dilatados procedi- la vez que se genera un informe de ling¨u´ıstica mientos judiciales, suponiendo un ahorro a forense que ayude a determinar la veracidad largo plazo y un beneficio para la ciudadan´ıa. o duda de un determinado testimonio. Por otro lado, la ling¨u´ısticaforense, que A continuaci´on,se describen brevemente es la rama de la ling¨u´ıstica aplicada que se estos subsistemas. encarga de estudiar los diversos puntos de 2.1 Sistema de transcripci´onde encuentro entre lenguaje y derecho, ha de- mostrado su utilidad para aportar evidencias conversaciones de ling¨u´ısticas en los procesos judiciales. Esta declaraciones judiciales basado disciplina se est´aimplantando cada vez m´as en Deep-Learning en Espa˜nay est´amucho m´asextendida en El sistema de transcripci´ones capaz de apli- pa´ıses anglosajones, y pretende mostrar un car un proceso de transcripci´onautom´atica conjunto de caracter´ısticas ling¨u´ısticas que de los v´ıdeosbasada en tecnolog´ıasde Deep- sea capaz de servir de apoyo en an´alisisdel Learning, mejorando su desempe˜noen situa- contenido de las declaraciones para determi- ciones de entornos con ruido de ambiente nar si ciertos testimonios o declaraciones tie- (Hannun et al., 2014), con varias personas ha- nen garant´ıasde ser veraces. blando al un´ısono(Snyder et al., 2019) y sin El objetivo de este proyecto es el desarro- la necesidad de un entrenamiento espec´ıfico llo de un sistema de transcripci´on,indexa- por parte de los interlocutores. ci´ony an´alisisautom´aticode declaraciones El resultado final de este primer sistema judiciales a partir de representaciones fon´eti- son ficheros en formato WebVTT (Pfeiffer y cas y t´ecnicasde ling¨u´ısticaforense que per- Hickson, 2013), que es un formato para mos- mitan ayudar a los t´ecnicosy peritos en el trar pistas de texto cronometrado que se uti- desempe˜node su tarea, reduciendo as´ılos ya lizan en sistemas de pel´ıculas,v´ıdeosen strea- dilatados procesos judiciales y mejorando la ming, etc. Este formato permite incluir meta- eficacia del sistema. El presente documento datos para a˜nadirinformaci´onasociada a los est´aestructurado de la siguiente manera. La datos. De esta forma, se enriquecen las trans- secci´on2 describe la arquitectura funcional cripciones realizadas a˜nadiendoinformaci´on 110 Transcripción, indexación y análisis automático de declaraciones judiciales a partir de representaciones fonéticas y técnicas de lingüística forense

Marcador Ejemplos Vaguedad o exactitud Cantidades inexactas, fechas, nombres, lugares, ... Incertidumbre o certeza Expresiones como a menudo, hasta donde yo s´e, ... Distanciamiento del hecho Uso de pronombres personales en tercera persona, ... Minimizaci´on Expresiones como lo que pas´o, el incidente, ... Refuerzo de credibilidad Adverbios como honestamente, sinceramente, ... Generalizaciones Referencias a colectivos o el uso de determinantes indefinidos, ... Comunicaci´onevitativa Eufemismos, evasivas, interrupciones, ... Actitud cooperativa Expresiones como tienes raz´on, es cierto, as´ıes Egocentrismo Uso de pronombres en primer personal del singular, ... Repetici´on Palabras duplicadas o expresiones como ambos dos Exceso de lenguaje culto Cultismos, latinismos, verbos en subjuntivo, ...

Tabla 1: Marcadores de discurso falaz Marcadores Ejemplos Tiempo y espacio pa´ıses,capitales, lugares, fechas, rangos de tiempo, ... Simplicidad Uso de verbos en indicativo simple, frases cortas, ... Experiencias subjetivas Fen´omenospercibidos a trav´esde los sentidos Orden Expresiones como de una parte, por un lado, para terminar, ... Detalles Nombres propios, lugares, eventos, empresas, profesiones, ... Naturalidad Respuestas cortas

Tabla 2: Marcadores de discurso veraz

sobre el contexto situacional, as´ı como qu´e cas ling¨u´ısticamente relevantes relacionadas hablante produjo la frase registrada. con an´alisis de veracidad del testimonio. Para ello, se ha desarrollado un sistema de an´ali- 2.2 Sistema de indexaci´ony sis ling¨u´ısticodise˜nadoen espa˜nolbasado en b´usquedade v´ıdeos (Salas-Z´arateet al., 2017) para evaluar dis- El sistema de indexaci´on y b´usqueda de tintas caracter´ısticas para determinar si un v´ıdeoses el encargado de indexar y catalogar determinado relato es veraz o, si por el con- cada uno de los distintos v´ıdeosen reposito- trario, es enga˜noso,como el estudio presen- rios. Este sistema permite realizar b´usquedas tado en (Almela, Valencia-Garc´ıa,y Cantos, precisas de v´ıdeosa partir de palabras clave, 2012). Para ello, el sistema analiza el discur- fonemas y frases parecidas e identificar y car- so y ofrece al perito un informe que contiene gar el v´ıdeoen el fragmento preciso. Para ello, fen´omenosling¨u´ısticosdeterminados que per- este sistema es capaz de buscar en distintos miten ver si se minimizan los hechos, si hay tipos de formato multimedia (formatos AVI, negaciones excesivas, una actitud colabora- MP4) y buscar entre la informaci´ontrans- tiva, lapsus linguae, o bien si el discurso es crita a partir del sistema anterior y adem´as coherente, consistente, simple, equilibrado y realizar b´usquedasdirectamente en los frag- natural. Esta informaci´onse muestra a los pe- mentos de audio a trav´esdel modelado de ritos a trav´esde una interfaz intuitiva que re- fonemas. salta dentro del texto cu´alesson las palabras y frases clave identificadas por el an´alisis. 2.3 Sistema de procesamiento de ling¨u´ısticaforense basado en En la Tabla 1 y en la Tabla 2 se muestran algunos de los marcadores usados para iden- caracter´ısticaspsicoling¨u´ısticas tificar el relato falaz y veraz as´ıcomo algunos y fen´omenosde duda en el ejemplos de cada marcador. Para poder ex- discurso traer informaci´ondel texto en base a estos El sistema de procesamiento de ling¨u´ısticafo- marcadores se han empleado reconocedores rense basado en caracter´ısticaspsicoling¨u´ısti- de entidades, corpus anotados como los ana- cas y fen´omenosde duda en el discurso se en- lizados en (Jim´enez-Zafraet al., 2020) y el carga del an´alisisy extracci´onde caracter´ısti- uso lexicones. 111 Antonio García-Díaz, Ángela Almela, Fernando Molina, Juan Salvador Castejón-Garrido, Rafael Valencia-García

3 Trabajo futuro Jim´enez-Zafra,S. M., R. Morante, M. Tere- Actualmente nos encontramos en la ´ultima sa Mart´ın-Valdivia, y L. A. Ure˜na-L´opez. anualidad del proyecto. Se est´aterminando 2020. Corpora annotated with negation: el desarrollo de cada uno de los subsistemas An overview. Computational Linguistics, y realizando distintas evaluaciones de rendi- 46(1):1–52. miento y fiabilidad. Pfeiffer, S. y I. Hickson. 2013. Webvtt: The Las tecnolog´ıasque hasta el momento nos web video text tracks format. Draft Com- han dado mejor resultado en la transcripci´on munity Group Specification, W3C. de conversaciones se basan en modelos DNN- HMM (modelos ocultos de Markov con re- Ravanelli, M. y M. Omologo. 2017. Contami- des neuronales profundas) (Ravanelli y Omo- nated speech training methods for robust logo, 2017). Los tipos de redes neuronales dnn-hmm distant speech recognition. ar- m´asefectivas son aquellas que, por tener en Xiv preprint arXiv:1710.03538. cuenta elementos pasados, se ajustan mejor Salas-Z´arate,M. P., M. A. Paredes-Valverde, a la clasificaci´onde series temporales, como M. A.´ Rodr´ıguez-Garc´ıa, R. Valencia- son Time Delay Neural Networks (TDNN) Garc´ıa,y G. Alor-Hern´andez. 2017. Au- (Sun et al., 2017) y Long-Short Time Me- tomatic detection of satire in twitter: A mory (LSTM) (Zhang et al., 2016). La in- psycholinguistic-based approach. Knowl. dexaci´ony b´usquedade v´ıdeos se basa en Based Syst., 128:20–33. tecnolog´ıasde Keyword Spotting, que trata de identificar una serie de palabras clave en Snyder, D., D. Garcia-Romero, G. Sell, se˜nalesac´usticas.Adem´as,se est´anestudian- A. McCree, D. Povey, y S. Khudanpur. do otras variables de ling¨u´ısticaforense con 2019. Speaker recognition for multi- el fin de tener en cuenta los metadatos de la speaker conversations using x-vectors. En conversaci´onpara medir la latencia, el tono ICASSP 2019-2019 IEEE International y la velocidad del hablante. Conference on Acoustics, Speech and Sig- Por ´ultimo,se realizar´ala integraci´onde nal Processing (ICASSP), p´aginas 5796– los m´odulosen un prototipo final de la plata- 5800. IEEE. forma global. En este sentido, se planificar´an Sun, M., D. Snyder, Y. Gao, V. K. Naga- distintas pruebas de campo para comprobar raja, M. Rodehorst, S. Panchapagesan, la calidad del sistema completo. N. Strom, S. Matsoukas, y S. Vitaladevu- ni. 2017. Compressed time delay neural Agradecimientos network for small-footprint keyword spot- Este proyecto ha sido financiado por el Ins- ting. En INTERSPEECH, p´aginas3607– tituto de Fomento de la Regi´onde Murcia 3611. con fondos FEDER dentro del proyecto con Zalman, M., L. L. Rubino, y B. Smith. 2019. referencia 2018.08.ID+I.0025 Beyond police compliance with electronic Bibliograf´ıa recording of interrogation legislation: To- ward error reduction. Criminal Justice ´ Almela, A., R. Valencia-Garc´ıa,y P. Cantos. Policy Review, 30(4):627–655. 2012. Detectando la mentira en lenguaje escrito. Procesamiento del lenguaje natu- Zhang, Y., G. Chen, D. Yu, K. Yaco, S. Khu- ral, 48:65–72. danpur, y J. Glass. 2016. Highway long short-term memory rnns for distant Ballesteros, M. C. R. 2011. La necesa- speech recognition. En 2016 IEEE Inter- ria modernizaci´onde la justicia: especial national Conference on Acoustics, Speech referencia al plan estrat´egico2009-2012. and Signal Processing (ICASSP), p´aginas Anuario jur´ıdico y econ´omico escurialen- 5755–5759. IEEE. se, (44):173–186. Hannun, A., C. Case, J. Casper, B. Catan- zaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, y ot- hers. 2014. Deep speech: Scaling up end- to-end speech recognition. arXiv preprint arXiv:1412.5567. 112

Demostraciones

Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 115-118 recibido 01-04-2020 revisado 24-05-2020 aceptado 13-06-2020

EmoCon: Analizador de Emociones en el Congreso de los Diputados

EmoCon: Emotions Analyzer in the Spanish Congress

Salud Mar´ıaJim´enez-Zafra,Miguel Angel´ Garc´ıa-Cumbreras, Mar´ıaTeresa Mart´ın-Valdivia, Andrea L´opez-Fern´andez SINAI, Centro de Estudios Avanzados en TIC (CEATIC), Universidad de Ja´en {sjzafra, magc, maite, alfernan}@ujaen.es

Resumen: EmoCon es un prototipo de un analizador de emociones en el Congreso de los Diputados. Su objetivo es analizar el perfil emocional a nivel de sesi´onparla- mentaria y a nivel de cada diputado, a partir de las intervenciones realizadas durante las sesiones parlamentarias que tienen lugar en el Congreso de los Diputados. Para ello, la demo cuenta con tres m´odulosprincipales: i) descarga autom´aticade los documentos de las sesiones y extracci´onde las intervenciones realizadas por cada diputado, ii) an´alisisde las emociones expresadas a nivel de sesi´ony a nivel de dipu- tado y, iii) visualizaci´onde la informaci´onen una aplicaci´onweb. Palabras clave: An´alisisde emociones, Congreso de los Diputados, pol´ıtica Abstract: EmoCon is a prototype of an emotion analyzer in the Spanish Congress. Its objective is to analyze the emotions expressed by the deputies in the interventions made during the parliamentary sessions that take place in the Spanish Congress. To this end, the demo has three main modules: i) web scrapper for the session documents and processing, ii) emotions analyzer at the session level and at the deputy level, and iii) web application for visualization. Keywords: Emotion analysis, Spanish Congress, politics

1 Introducci´on rea de reconocimiento de emociones en tex- tos, que consiste en identificar de forma au- Las emociones son un aspecto muy impor- tom´aticalas emociones expresadas en un tex- tante de nuestra vida. Lo que hacemos y lo to. En concreto, presentamos una demo que que decimos refleja en parte las emociones tiene como objetivo aplicar t´ecnicasy herra- que sentimos. La RAE define emoci´oncomo mientas de las Tecnolog´ıasdel Lenguaje Hu- una “alteraci´ondel ´animointensa y pasaje- mano para analizar y detectar las emociones ra, agradable o penosa, que va acompa˜nada que expresan los pol´ıticos en los diarios de de cierta conmoci´onsom´atica”.Desde el pun- sesiones del Congreso de los Diputados, fruto to de vista psicol´ogicolas emociones se sue- de de sus intervenciones. La finalidad de es- len definir como un complejo estado afectivo, ta demo es mostrar el perfil emocional de los una reacci´onsubjetiva que ocurre como re- diputados del Congreso a partir de sus inter- sultado de cambios fisiol´ogicoso psicol´ogicos venciones. que influyen sobre el pensamiento y la con- Existen varias clasificaciones posibles de ducta. Estas emociones se pueden analizar a categor´ıasde emociones (Ekman, 1992; Plut- trav´esde las expresiones faciales, de la voz, chik, 1986; L¨ovheim,2012). EmoCon utili- del texto, etc. za la clasificaci´onde Paul Ekman (Ekman, La detecci´onde emociones ha sido un te- 1992), al ser una de las m´asusadas y que ma de gran inter´es en disciplinas como la m´asrecursos tiene disponibles. Ekman define neurociencia o la psicolog´ıay, actualmente, seis emociones b´asicas:enfado, miedo, asco, est´aatrayendo la atenci´onde los investiga- sorpresa, alegr´ıay tristeza. dores en ciencias de la computaci´on. El ´area que se encarga del estudio de las emociones a 2 Motivaci´on nivel computacional se conoce como Compu- En el ´ambito de la pol´ıtica,el an´alisis de emo- taci´onAfectiva (Cambria, 2016). En este tra- ciones se ha utilizado fundamentalmente para bajo nos centramos en esta ´area,en la ta- saber qu´eactitudes y propuestas de los candi- ISSN 1135-5948. DOI 10.26342/2020-65-16 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Salud María Jiménez-Zafra, Miguel Ángel García-Cumbreras, María Teresa Martín-Valdivia, Andrea López-Fernández datos electorales provocan en los ciudadanos En concreto, la demo presentada realiza un mejores respuestas emocionales, con el obje- an´alisisde la XII legislatura, es decir la que tivo de realizar sondeos pol´ıticosen vista a transcurre desde el 19 de julio de 2016 hasta unas elecciones. Nuestro enfoque es bien dis- el 5 de marzo de 2019, y se contin´uandescar- tinto y est´acentrado en el an´alisisde las emo- gando y analizando documentos de la legisla- ciones de los propios parlamentarios por sus tura vigente. Actualmente el sistema trabaja propios contenidos, los cuales podr´ıanestar con un total de 640 documentos de sesi´onde relacionados con la reacci´onde los ciudada- 384 diputados. nos. Para la descarga de la informaci´onse ha implementado un crawler, aprovechando que 3 Descripci´ondel sistema las URLs de estos documentos siguen un EmoCon extrae y analiza los diarios de se- patr´on2, y se ha utilizado Python como len- si´onde las intervenciones de los diputados en guaje de programaci´on.En la Figura 2 se el Congreso de los Diputados (debate pol´ıti- muestra un ejemplo de un extracto de uno co), detectando las emociones en los mismos. de los PDFs descargados. Cuenta con tres m´odulosprincipales: 1. Descarga de documentos de las sesiones y procesamiento 2. Analizador autom´aticode emociones 3. Aplicaci´onweb para la visualizaci´onde la informaci´onprocesada La Figura 1 muestra el esquema general de EmoCon.

Figura 2: Extracto de un PDF de una sesi´on

De cada uno de los PDFs descargados el sistema extrae las intervenciones de los dipu- tados, texto que posteriormente es analizado en el m´odulo2 para determinar las emocio- nes expresadas por cada diputado. Esto es posible ya que como se puede observar en la Figura 2, en estos documentos se sigue un patr´onpara mostrar cada una de las interven- Figura 1: Esquema general de EmoCon ciones de los distintos diputados: El se˜nor/La se˜nora + nombre en may´usculay negrita + A continuaci´onse detallan cada uno de dos puntos + intervenci´ondel diputado. estos m´odulos. 3.2 M´odulo2: an´alisisde 3.1 M´odulo1: descarga y emociones procesamiento de datos Este m´oduloes el encargado de analizar las El sistema trabaja con informaci´onde las in- emociones presentes en los textos extra´ıdos tervenciones realizadas en las sesiones parla- de cada una de las sesiones parlamentarias. mentarias que tienen lugar en el Congreso de El sistema realiza dos tipos de an´alisis:uno los Diputados. Estos documentos est´anacce- a nivel de sesi´ony otro a nivel de diputado. sibles a traves del portal web del Congreso de Para analizar las emociones utiliza el mo- los Diputados1. Los documentos est´anclasi- delo de las seis emociones de Ekman comen- ficados por legislatura, a˜no,y mes, lo que fa- tado previamente y un clasificador autom´ati- cilita la localizaci´onde una sesi´onespec´ıfica. co basado en lexic´on. El clasificador hace uso 1http://www.congreso.es/portal/page/portal/ 2http://www.congreso.es/public oficiales/L12/ Congreso/Congreso/Publicaciones/DiaSes/Pleno CONG/DS/PL/DSCD-12-PL-[numsesion].PDF 116 EmoCon: Analizador de emociones en el congreso de los diputados del lexic´on“Spanish Emotion Lexicon”, co- framework Flask 5 de Python, de los lenguajes nocido como SEL por su siglas en ingl´es3 (Si- HTML, CSS y JavaScript, y de la herramien- dorov et al., 2012), e incorpora una heur´ıstica ta Google Charts6. para el tratamiento de negadores y modifica- Tal y como se ha comentado en el m´odulo dores. 2, el analizador de emociones permite realizar El analizador de emociones a nivel de se- dos tipos de an´alisis:uno a nivel de sesi´ony si´oncalcula el perfil emocional general de la otro a nivel de diputado (Figura 3). sesi´ony para cada diputado determina en qu´e porcentaje ha expresado alguna de las seis emociones de Ekman. En primer lugar, reali- za un preprocesamiento sobre el texto asocia- do a cada diputado: i) tokenizaci´onmediante expresiones regulares, ii) extracci´onde ra´ıces utilizando el paquete NLTK Snowball stem- mer4, iii) eliminaci´onde palabras vac´ıas,y iv) conversi´ondel texto a min´uscula.En segundo lugar, realiza la clasificaci´onde las emociones Figura 3: EmonCon - P´aginaprincipal presentes en el texto de la siguiente forma: Mediante la opci´on sesi´on, la aplicaci´on 1: diputado = {‘ira’: 0, ‘asco’: 0, ‘miedo’: 0, web permite filtrar por fecha de sesi´on,con- ‘alegr´ıa’:0, ‘tristeza’: 0, ‘sorpresa’: 0} sultar el PDF de cada sesi´ony visualizar el 2: for palabra in texto do perfil emocional completo de la sesi´on(Figu- 3: if palabra in lexicon then ra 4) y de un diputado concreto ese d´ıa(Fi- 4: if anterior is negador then gura 5). En el perfil general de la sesi´onapa- 5: peso = 0 recer´aun diagrama de sectores que indicar´a 6: else if anterior is increment then el porcentaje en el que ha estado presente ca- 7: peso = 1,75 da emoci´onese d´ıa y seis rankings, uno por 8: else if anterior is atenuante then cada emoci´on,en los que se mostrar´alos cin- 9: peso = 0,75 co diputados que m´ashan sentido cada emo- 10: else ci´on.En el perfil del diputado seleccionado 11: peso = 1 aparecer´aun diagrama de sectores represen- 12: end if tando los porcentajes en los que ha expresado 13: end if las distintas emociones en la sesi´oncelebrada. 14: for emocion in lexicon[palabra] do Adem´as,se mostrar´auna gr´aficade barras en 15: fpa = lexicon[palabra][emocion] la que se podr´acomparar las emociones del 16: diputado[emocion] += peso * fpa diputado ese d´ıaconcreto con las emociones 17: end for que suele mostrar, es decir, con su perfil ha- 18: end for bitual. El analizador de emociones a nivel de Mediante la opci´on diputados es posible diputado calcula su perfil emocional en ba- seleccionar un miembro de los grupos parla- se a sus intervenciones en todas las sesiones mentarios (Figura 6), accediendo a su per- y en qu´eporcentaje ha expresado cada una fil emocional habitual por medio de un gr´afi- de las seis emociones a lo largo del tiempo. co de sectores y de un hist´oricoen el que se mostrar´ala evoluci´onde este perfil del dipu- 3.3 M´odulo3: aplicaci´onweb para tado (Figura 7). Adem´as,en el diagrama del visualizaci´on hist´oricose puede seleccionar un d´ıaconcreto Una vez extra´ıda,procesada y analizada la y acceder a la correspondiente sesi´on. informaci´on,se desarroll´ouna aplicaci´onweb para mostrarla. Esta aplicaci´onhace uso del 4 Conclusiones y trabajo futuro

3 En este trabajo se ha presentado EmoCon, un El lexic´onSEL est´aformado por 2.036 palabras prototipo para analizar las emociones expre- en espa˜nolque est´anasociadas con al menos una emo- ci´onb´asica(ira, asco, miedo, alegr´ıa,tristeza y sorpre- sadas en el Congreso de los Diputados, con el sa) a trav´esdel factor de probabilidad de uso afectivo objetivo de obtener y mostrar el perfil emo- (FPA por su siglas en ingl´es). 4http://www.nltk.org/ modules/nltk/stem/ 5https://www.fullstackpython.com/flask.html snowball.html 6https://developers.google.com/chart/ 117 Salud María Jiménez-Zafra, Miguel Ángel García-Cumbreras, María Teresa Martín-Valdivia, Andrea López-Fernández

Figura 6: EmonCon - An´alisisde un dipu- tado: selecci´ondel grupo parlamentario

Figura 4: EmonCon - An´alisisde una sesi´on: perfil completo de la sesi´on

Figura 7: EmonCon - An´alisisde un dipu- tado: perfil general de un diputado Figura 5: EmonCon - An´alisisde una sesi´on: perfil de un diputado Bibliograf´ıa cional a nivel de sesi´onparlamentaria y de Cambria, E. 2016. Affective computing and cada diputado que interviene. sentiment analysis. IEEE Intelligent Sys- Como posible trabajo futuro de este siste- tems, 31(2):102–107. ma nos planteamos estudiar otros m´etodos de Ekman, P. 1992. An argument for basic emo- an´alisisde emociones y analizar la legislatura tions. Cognition & emotion, 6(3-4):169– actual. Adem´as,ser´ıainteresante a˜nadir un 200. apartado de tem´aticaspara reflejar los temas L¨ovheim, H. 2012. A new three- tratados en cada sesi´ony as´ıpoder ver qu´e dimensional model for emotions and mo- emociones se generan en funci´onde lo que se noamine neurotransmitters. Medical Hy- est´ehablando. As´ımismo queremos analizar potheses, 78(2):341 – 348. perfiles de usuarios y partidos pol´ıticospara comprobar si los miembros de un mismo par- Plutchik, R., . K. H. 1986. Emotion: Theory, tido tienen un perfil similar, analizando sus Research and Experience. New York: Aca- intervenciones en el Congreso y lo que publi- demic Press. can en medios sociales. Sidorov, G., S. Miranda-Jim´enez,F. Viveros- Jim´enez,A. Gelbukh, N. Castro-S´anchez, Agradecimientos F. Vel´asquez,I. D´ıaz-Rangel, S. Su´arez- Este trabajo ha sido parcialmente financiado Guerra, A. Trevi˜no,y J. Gordon. 2012. por el Fondo Europeo de Desarrollo Regio- Empirical study of machine learning based nal (FEDER) y el proyecto LIVING-LANG approach for opinion mining in tweets. En (RTI2018-094653-B-C21) del Gobierno de Mexican international conference on Arti- Espa˜na. ficial intelligence, p´aginas1–14. Springer. 118 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 119-122 recibido 30-03-2020 revisado 24-05-2020 aceptado 12-06-2020

Nalytics: Natural Speech and Text Analytics

Nalytics: Anal´ıticas del Habla y Texto Naturales

Ander Gonz´alez-Docasal1, Naiara P´erez1, Aitor Alvarez´ 1, Manex Serras1 Laura Garc´ıa-Sardi˜na1, Haritz Arzelus1, Aitor Garc´ıa-Pablos1, Montse Cuadros1 Paz Delgado2, Ane Lazpiur2, Blanca Romero2 1Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Mikeletegi 57, 20009 Donostia-San Sebasti´an(Spain) +34 943 309 230 {agonzalezd, nperez, aalvarez, mserras, lgarcias, harzelus, agarciap, mcuadros}@vicomtech.org 2Natural Vox S.A.U., Bulevar de Salburua Kalea 8, 01002 Vitoria-Gasteiz (Spain) +34 945 227 200 {pdelgado, ane, bromero}@naturalvox.com

Abstract: Call centres have long demanded technology for analysing the data they manage. In this context, we present Nalytics, a platform that integrates Speech and Text Analytics in Spanish in a modular design, and capable of customising its models to the users’ demands. Keywords: Speech Analytics, NLP, Deep Learning, Call Centres Resumen: Los centros de llamadas han demandado durante a˜nos soluciones tecnol´ogicaspara analizar los datos que gestionan. En este contexto, presentamos Nalytics, una plataforma que integra el an´alisisde voz y texto en espa˜nolen un dise˜nomodular capaz de personalizar sus modelos a demanda del cliente. Palabras clave: Anal´ıticadel Habla, PLN, Aprendizaje Profundo, Centros de Lla- madas 1 Introduction tion and rich transcription as its main com- Traditionally, call centres have addressed ponents; and Text Analytics, including seg- their conversational quality audits manually, mentation and normalisation, sentiment ana- by means of agents who manage to analyse lysis, and text classification. Within its mod- 5 − 10% of the recorded material at best. ular design, the solution allows different com- Therefore, the market has long demanded binations of these modules considering the Speech Analytics tools for the automatic ana- users’ needs, resulting on a powerful platform lysis of conversational content. for the automatic analysis of call centres’ con- Nowadays, companies which offer Speech tent in Spanish. Analytics services can be categorised into The rest of the paper is structured as fol- two groups: a) those who offer a black box lows: Section 2 introduces a general overview flow that handles the client directly and of the platform; Section 3 describes the de- comes with numerous functionalities and ad- veloped front-end application for the use of vanced technology, but through solutions in the platform; Sections 4 and 5 explain the the cloud; and b) those who offer more ad- technological modules for speech and text apted services but rely on more traditional analytics, respectively; Section 6 describes search technology, such as word spotting, and the process of including new models to the work on batch mode. solution; and finally, Section 7 adds some Nalytics combines the best of both cases. conclusions to this work. It includes advanced technology based on ma- chine and deep learning techniques, operates 2 General architecture in both batch and online modes, it can be Nalytics is a back-end service built in Python deployed as a Software as a Service (SaaS) and hosted on a Linux system. It is integ- or on premise, and offers the possibility of rated as a web service inspired on the REST adapting its technological modules to the ap- architecture, in which the communication plication domain of the customers. Nalytics with the REST client is performed through integrates technological modules to perform HTTP requests and, if necessary, JSON ob- Speech Analytics, including speaker diarisa- jects sent to a specific address (i.e., IP and ISSN 1135-5948. DOI 10.26342/2020-65-17 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural A. González-Docasal, ofiual ie trtrsatmoterror. timeout a returns it and time, the defined configurable previously if a However, than more response. takes as task client HTTP REST the the of to part result directly the it for returns waits an and server using master successful, the the later is quests, In retrieved request service. be the informative may If result the server. resources the hardware in available de- the background on the in pending executed are tasks the very a with to scenario, allows latency. real-time low It a to in deployed requests. operate remains incoming and the port attend as- free is which a API, signed On an into loaded returned. previously is is result in memory the contrary, once the in it model frees and corresponding the loads For methods. over decoding including analysis, data for REST main the client. to left queue is priority the priorities of individual a management the in where Nalytics, put inside are requests the HTTP all licences, different contracted have may shown 1. is Nalytics Figure of in architecture the dia- of of schematic gram minimal status A or processes. requests ongoing sent information previously as retrieve in- such addition, services In web stored. formative are results a and and information, data, the database all a where plat- repository by The physical complemented code. is status JSON form HTTP informative an and an REST object the with received, responds is request server a Once port). the of Nalytics diagram of schematic architecture Minimal 1: Figure User diinly in Additionally, types request multiple supports Nalytics that customers multiple service to order In Front-end REST Client

JSON HTTP N. Pérez, Repo synchronous offline ETServer REST online Back-end A. Nalytics Álvarez, asynchronous eoig h platform the decoding, eoig h model the decoding, DB M. and offline Serras, synchronous asynchronous L. Model Model and API API García-Sardiña, methods Model API online re- H. 120 Arzelus, rlVxi e plcto htteusers the that application web a is Vox ural a applica- 3, presented. front-end Figure is current the tion In the of clients. to screenshot user solution Service) of a number as served a (Software being SaaS currently a is company as it the and client Vox by the Natural developed project, was this In application back-end. the and user the main between the communication front-end as for serve interface any can it application with This Nalytics, integrated client. of easily API be RESTful can the to applicationThanks Front-end 3 the node ending in and module starting a Every se- is submodule graph inference. request in Accepted quences 2: Figure are sequences module shown. 2, request Figure accepted In the classification). or analysis senti- ment normalisation, ana- and text (segmentation to prepro- lysis transcription audio speech the and from cessing chain in modules the combo a what as in pipeline, known a is as or individually result applied completed. the is retrieve it to once repository able the is from as client treated the is and request the so, If iue3 h aua o rn-n applic- Nalytics front-end Vox for Natural ation The 3: Figure h urn rn-n ouinfo Nat- from solution front-end current The be can modules technological the Finally, A. Sentiment García-Pablos, Diarise eoigivle h xcto fall of execution the involves decoding SAD combo M. Cap. Cuadros, Normalise Segment ASR eoig h longest The decoding. Punct. P. Delgado, asynchronous Classify A. Lazpiur, B. Romero Nalytics: natural speech and text analytics can access through any web browser. The et al., 2016), with an Kneser-Ney smoothed main objective is to provide a powerful tool 5-gram used to rescore the hypothesis. Both to analyse speech and text data from call acoustic and language models were estimated centres. The users can analyse audios by with the SAVAS corpus (del Pozo et al., 2014) means of the rich transcription and speaker transferred through land- and mobile-lines diarisation modules, whilst input texts or and augmented, as explained in (Bernath et draft transcriptions can be classified or ana- al., 2018). lysed at sentiment level. In addition, new The automatic transcription can then be categories for classification can be currently optionally sent through the capitalisation incorporated, adapting the modules to their and punctuation modules, which take use domain and contents. It is worth mention- of bidirectional Recurrent Neural Networks ing that a rule-based anonymisation module (RNNs) with an attention mechanism (Tilk is also included to deal with personal data. and Alum¨ae, 2016) and a recasing model Finally, the results are generated in CSV trained with the Moses toolkit (Koehn et al., format which are later converted into formal 2007), respectively. reports using Qlik1. 5 Text Analytics 4 Speech Analytics This section introduces the different modules This section presents the speech analysis for text analysis available in Nalytics: Text modules in more detail, grouped as Audio preprocessing, Text classification and Senti- preprocessing and Speech transcription. ment analysis. 4.1 Audio preprocessing 5.1 Text preprocessing Firstly, the incoming audio data from the The text preprocessing module can be sub- telephonic domain (e.g., raw format com- divided in two different tasks: Normalising pressed with A-law, µ-law or G.729 codecs) the text, that is, correcting misspelling er- is transcoded to a known audio format for rors; and Segmenting or dividing each sen- Nalytics (PCM WAV 16 kHz and 16 bits). tence into spans with potentially different Then, the audio can be optionally processed polarities (e.g., “I loved the service but my by the Speech Activity Detection (SAD) and problem was not solved” starts with positive Speaker Diarisation modules, with the aim polarity but ends with a negative remark). of discarding non-speech segments and per- The normalisation module is built on Hun- forming speakers segmentation and cluster- spell2. Through configurable dictionaries and ing, respectively. Both modules have been morphotactic rules, detects possible developed using the Kaldi toolkit (Povey et typographic and orthographic errors and sug- al., 2011). gests correction candidates for each. Then, the actual correction is chosen using heur- 4.2 Speech transcription istics that involve Levenshtein’s distance and The core of any Speech Analyitcs solution knowledge about frequent orthographic er- relies on the Automatic Speech Recognition rors in Spanish. (ASR) engine. Two different ASR archi- With respect to segmentation, the mod- tectures have been integrated into Nalytics. ule first locates in the input text adversat- The first was estimated with the Kaldi ive conjunctions and phrases that may indic- toolkit and is composed of a hybrid Long- ate contrast (e.g., “but”, “on the contrary”, Short Term Memory-Hidden Markov Model etc.). Then, it obtains the dependency tree (LSTM-HMM) acoustic model, where uni- of the text using spaCy3. Finally, the text directional LSTMs were trained to provide is segmented via heuristics based on these posterior probability estimates for the HMM two pieces of information: each segment is states. In addition, a modified Kneser-Ney a phrase or clause headed by the token an smoothed 3-gram language model, trained adversative phrase depends on. This mod- with the KenLM toolkit (Heafield, 2011), is ule’s behaviour can also be configured by in- used for decoding. The second system corres- corporating new segmentation rules through ponds to an E2E neural network based on the configuration files. Baidu’s Deep Speech 2 architecture (Amodei 2https://hunspell.github.io 1https://www.qlik.com 3https://spacy.io/api/dependencyparser 121 A. González-Docasal, N. Pérez, A. Álvarez, M. Serras, L. García-Sardiña, H. Arzelus, A. García-Pablos, M. Cuadros, P. Delgado, A. Lazpiur, B. Romero

5.2 Text classification technological modules included in Nalytics Nalytics offers a diverse set of classifica- perform better if they are adapted to the ap- tion algorithms in an attempt to cover a plication domain. wide range of application scenarios –read data availability, computational power, and 7 Conclusions so on–: a) Multinomial Naive Bayes (MNB) This work presents a new solution for Speech and Support Vector Machines (SVM), as and Text Analytics. Nalytics takes advantage implemented in scikit-learn4; b) spaCy’s of the last modelling paradigms to construct ensemble implementation5 of Convolutional a back-end platform that can be easily integ- Neural Networks (CNN); c) finally, a learner rated with any REST based client. Its oper- for sequence-labelling tasks based on Con- ation modes, the available request methods, ditional Random Fields (CRF), from sk- its modularity, powerful functionalities along learn-crfsuite6. with the variety of its technological modules turns Nalytics into an attractive solution that 5.3 Sentiment analysis is already being exploited in the market. The sentiment analysis module is a special- isation of the Text classification module; that References is, it offers the same text classification model Amodei, D., S. Ananthanarayanan, learners and decoding functionalities, but ori- R. Anubhai, et al. 2016. Deep Speech ented towards the sentiment analysis task. 2: End-to-End Speech Recognition in Specifically, the classifiers benefit from a rich English and Mandarin. In Proceedings of set of features well-known for improving sen- ICML 2016, volume 48, pages 173–182. timent classification results, such as Brown clusters and gazetteer lookup. In addition, Bernath, C., A. Alvarez, H. Arzelus, and this module takes special advantage of the C. D. Mart´ınez. 2018. Exploring E2E Text preprocessing module, which divides speech recognition systems for new lan- clauses with potentially different polarity. guages. In Proceedings of IberSPEECH 2018, pages 102–106. 6 Adapted models acquisition del Pozo, A., C. Aliprandi, A. Alvarez,´ et al. As previously stated, Nalytics includes mech- 2014. Savas: Collecting, annotating and anisms for incorporating new models to the sharing audiovisual language resources for internal database. In the scope of speech re- automatic subtitling. In Proceedings of cognition, externally created models can be LREC 2014, pages 432–436. transferred by placing them in the repository Heafield, K. 2011. Kenlm: Faster and smal- and executing a single HTTP request that ler language model queries. In Proceedings registers the model in the database. This is of WMT 2011, pages 187–197. implemented both for Kaldi and E2E mod- els, whilst it is not supported for the SAD Koehn, P., H. Hoang, A. Birch, et al. 2007. and diarisation models yet. New rules for the Moses: Open source toolkit for statistical text preprocessing segmentation submodule machine translation. In Proceedings of the can be registered in a similar way. 45th annual meeting of the ACL, pages In the case of Text Analytics, new mod- 177–180. els can be trained directly from Nalytics. If Povey, D., A. Ghoshal, G. Boulianne, et al. the training process is successful, the result- 2011. The Kaldi Speech Recognition ing model is automatically registered in the Toolkit. In Proceedings of ASRU 2011. database and placed in the repository for fu- ture use. Nalytics accepts all the training Tilk, O. and T. Alum¨ae.2016. Bidirectional parameters exposed by the aforementioned Recurrent Neural Network with Attention libraries (Section 5.2). Mechanism for Punctuation Restoration. These interesting functionalities allow In Proceedings of INTERSPEECH 2016. Natural Vox to offer customised models to their user clients, considering that all the

4https://scikit-learn.org 5https://spacy.io/api/textcategorizer#architectures 6https://sklearn-crfsuite.readthedocs.io 122 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 123-126 recibido 30-03-2020 revisado 24-05-2020 aceptado 02-06-2020

RESIVOZ: Dialogue System for Voice-based Information Registration in Eldercare

RESIVOZ: Sistema de Di´alogo para el Registro de Informaci´onde Cuidado de Mayores mediante Voz

Laura Garc´ıa-Sardi˜na,Manex Serras, Arantza del Pozo, Mikel D. Fern´andez-Bhogal Vicomtech Foundation, Basque Research and Technology Alliance (BRTA) Mikeletegi 57, 20009 Donostia-San Sebasti´an(Spain), +[34] 943 30 92 30 {lgarcias, mserras, adelpozo, mfernandez}@vicomtech.org

Abstract: RESIVOZ is a spoken dialogue system aimed at helping geriatric nurses easily register resident caring information. Compared to the traditional use of com- puters installed at specific control points for information recording, RESIVOZ’s hands-free and mobile nature allows nurses to enter their activities in a natural way, when and where needed. Besides the core spoken dialogue component, the presented prototype system also includes an administration panel and a mobile phone App de- signed to visualise and edit resident caring information. Keywords: Gerontechnology, Spoken Dialogue Systems, Eldercare, Information Registration Resumen: RESIVOZ es un sistema de di´alogoorientado a ayudar a geront´ologosa registrar f´acilmente informaci´onsobre sus cuidados a los residentes. En comparaci´on con el uso tradicional de ordenadores instalados en puntos de control espec´ıficospara registrar la informaci´on,la naturaleza manos-libres y m´ovilde RESIVOZ permite al personal gerontol´ogico registrar sus actividades de forma natural, donde y cuando se necesite. Adem´asdel principal componente de sistema de di´alogohablado, el prototipo de sistema tambi´enincluye un panel de administraci´ony una aplicaci´on m´ovildise˜nadapara visualizar y editar la informaci´onde cuidados a residentes. Palabras clave: Gerontotecnolog´ıa, Sistemas de Di´alogoHablado, Cuidado de Mayores, Registro de Informaci´on

1 Introduction EMPATHIC1, MENHIR2 or SHAPES3. Spoken Dialogue Systems (SDS) are increas- The use case presented in this paper is also ingly being integrated and deployed into a linked to healthcare and aims to facilitate the wide range of devices we can find in our registration of resident information by geri- daily lives. Many smart phones, speakers, atric nurses throughout a work shift. The im- and cars come with conversational assistants plemented SDS provides them with a hands- (e.g. Alexa, Siri, Google Assistant, Cortana) free, natural, and flexible interface to regis- which can provide us with useful information ter resident information in a nursing home (e.g. weather forecasts) and help us with management system, allowing them to devote various simple tasks (e.g. playing music). more time to practice people-based care. Because of their popularity for infotainment The paper is structured as follows: Section purposes, the application of SDS is also in 2 presents our application use case; Section 3 demand and being explored in professional describes the system’s components; Section fields like healthcare (Laranjo et al., 2018). 4 displays a sample dialogue showing differ- In particular, the use of conversational agents ent interaction situations; and Section 5 gives to support different kinds of patient popu- some conclusions and future work directions. lations (e.g. the elderly, the chronically ill) 1www.empathic-project.eu/ with health tasks is an emerging field of re- 2www.menhir-project.eu/ search with several projects underway, like 3cordis.europa.eu/project/id/857159 ISSN 1135-5948. DOI 10.26342/2020-65-18 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Sardiña, Manex Serras, Arantza del Pozo, Mikel D. Fernández-Bhogal

2 Use Case phone has soft noise cancelling and mute The application scenario is that of an elderly option, so geriatric nurses can disable au- care residence where geriatric nurses need to dio recording when talking to someone other register information regarding the activities than the system. The smartphone allows they carry out with each resident into a man- mobile audio capturing, App usage, and re- agement system. Such activities include: oral sponse generation using its built-in Text-To- hygiene, shaving, grooming, changing dia- Speech (TTS) functionalities. pers, showering, checking overall appearance, As for software, the developed prototype cutting and checking fingernails, performing system consists of four main components: 1) postural changes, checking bowel movements, a Spoken Dialogue System that allows user- and checking how much residents eat and system interaction, 2) a Mobile Phone App drink in each meal. Sometimes, nurses also for information visualisation and edition, 3) need to add observations to these activities a Control Panel Interface to manage users, and follow-up comments for each resident, so shifts, and resident caring activities, and 4) that other carers in following shifts are in- a Domain Database where all the necessary formed about particular incidents. information is stored. One of the issues with the traditional 3.1 Spoken Dialogue System computer-based systems is that there is usu- This is RESIVOZ’s core component, acting ally just one shared computer per floor or as a voice-based interaction bridge between unit. Geriatric nurses usually do not have the final users and the rest of the system. the time to stop by and register activities as The SDS receives the information to save and they are completed, so they tend to wait until lets the users know whether the it has been the end of their shift to enter all the activi- correctly logged in or there has been some ties carried out throughout their working day. problem in the process: there is no identi- This entails two main problems: on the one fied resident, the information is incomplete or hand, queues are generated, since all geriatric invalid for the current shift, there has been nurses want to register all the information si- a database connection issue, or the system multaneously at the end of their shift; on the could not parse the user’s input. The imple- other hand, the information introduced may mented SDS is a pipeline of four submodules: not be entirely correct due to the difficulty of remembering all activities carried out with (1) Automatic Speech Recognition each resident throughout a shift. (ASR) In this module, the audio stream cap- The proposed solution integrates a mo- tured by the headset goes to an energy- bile spoken dialogue system that allows geri- based Voice Activity Detector, which sepa- atric nurses to introduce the information in rates speech from non-speech segments and an easy, hands-free way anywhere and any- sends the former to a Speech-To-Text (STT) time throughout their working day. Given service. Here, Google Cloud’s STT API ser- the nature of the geriatric nurses’ work, the vice was integrated after introducing some solution’s selected hardware needs to be: (i) custom domain phrases, words, and names 4 ergonomic it needs to be lightweight, wire- as contexts for domain adaptation . less and wearable, since it needs to be car- (2) Spoken Language Understanding ried throughout the whole shift; (ii) safe, (SLU) For this module, given that real do- some wearable devices like smartwatches can main data was not available, Phoenix gram- cause abrasion injuries on residents’ skin and mars (Ward, 1991) were developed to parse should be avoided; and (iii) discreet, audio the transcribed geriatric nurses’ input into se- capturing and playing should not be invasive mantic codifications of act-slot-value sets. A nor interfere with social interactions. hierarchical taxonomy was defined covering the domain. It includes four acts (identify, 3 System Architecture inform, deny, delete), 34 slots, and 28 fixed Figure 1 shows a diagram of RESIVOZ’s values (as named entities, resident names architecture. To comply with the identi- are considered variable values and not in- fied requirements, the selected hardware de- cluded in the taxonomy). Apart from parsing vices include (i) a single-earphone plus short 4cloud.google.com/speech-to-text/docs/ microphone Bluetooth headset, and (ii) a context-strength. Data sharing options were lightweight, small smartphone. The micro- disabled in compliance with the GDPR 124 RESIVOZ: dialogue system for voice-based information registration in eldercare

Figure 1: Architecture of the RESIVOZ system the semantic actions from users’ inputs, the tasks, previous follow-up comments on res- NLU module also identifies named entities – idents, and so on. Also, a TCP connec- resident names in this case– from users’ utter- tion is established with the server, starting ances, using a fuzzy string-matching method the audio stream. The application has three and their registered full names. Residents are purposes: 1) to be a gateway to easily dis- identified with their name and first surname5, play and modify the information that is be- while their second surname is used for disam- ing logged into the database; 2) to handle the biguation when needed. socket connection and send the audio stream (3) Dialogue Manager (DM) Decision to the server where the back-end is located; making and interaction planning is based and 3) to play the text responses given by the on Attributed Probabilistic Finite State Bi- SDS using the phone’s built-in TTS. Automata (Torres, 2013), which store the lat- 3.3 Control Panel Interface est user input, system output, and attributes A control panel web interface was developed (i.e. dialogue memory). The latter store to allow managers to easily administer shift- task-sensitive information such as whether a related data. From this panel, shifts and resident is identified, or if there is any ambi- nurse and resident profiles can be created and guity with the provided information (Serras, deleted in the database in a user-friendly way. Torres, and del Pozo, 2019). The DM is also responsible for reading and writing informa- 3.4 Domain Database tion in the Domain Database and for mod- The domain database is where all the in- ifying attributes of the dialogue state. To formation is registered and read from: resi- leverage the lack of data, the next action is dent and user profiles, the shifts scheduled by selected using if this, then that rules. the manager, and the caring activities, obser- (4) Natural Language Generation vations and follow-up comments introduced (NLG) Natural language text templates were by the geriatric nurses for each resident. It generated for each possible system action so also stores the users’ sessions during their as- as to transmit an adequate response to users signed shifts. A series of customised RESTful in a human-interpretable manner. API calls were implemented to communicate the database with the SDS, the Mobile Phone 3.2 Mobile Phone App App, and the Control Panel. An Android App was developed to serve as a front-end and access point for the geriatric 4 Sample Dialogue nurses. This is meant to be used at spo- This section presents a simulated dialogue6 radic times, since nurses need to have both between a geriatric nurse (U ) and the system hands free while on their caring duties. When (S), showing some of the different situations they log in, the information on their shift is that may happen during interaction. shown: their assigned resident group, daily 6The names are made up and do not correspond 5In Spain, unlike in other countries, people have to specific real residents. The dialogue is presented two surnames. in English although RESIVOZ uses Spanish. 125 Sardiña, Manex Serras, Arantza del Pozo, Mikel D. Fernández-Bhogal

(1) U: Mouth cleaning okay registered before for all residents in the group S: There is no resident identified except for the now identified. The system U: Ana Atxa had her teeth brushed informs that it cannot delete a control that has not been saved as complete. S: Oral hygiene done for resident Ana Atxa Cruz 5 Conclusions and Future Work In the first interaction, presented in exam- ple 1, no resident is identified, so the system We have developed and integrated a Spoken will not register the information until one is. Dialogue System that helps geriatric nurses Then, the system is able to identify a resident in an elderly care centre easily register res- given their name and first surname. ident caring information at any time and any place during their work shifts. Using (2) U: Resident Pilar Garc´ıa RESIVOZ over the traditional system, only S: There are several residents with available in a few shared computers at spe- that name cific control points, may lend geriatric nurses U: Pilar Garc´ıaZubia more quality time for their elderly caring ac- S: Resident Pilar Garc´ıaZubia has tivities. For future work, an evaluation by been identified geriatric nurses is to be performed in their In example 2, the user tries to identify a workplace, tackling different aspects such as resident by name and first surname, but there system usability, points of improvement, dif- are several residents with that name. By us- ficulties found, adoption period, and other ing the second surname, the target resident feedback regarding the system’s usability and is correctly disambiguated. applicability in the real work environment. (3) U: She had half her meal for lunch Acknowledgements S: Intake of solids at lunchtime 50% This work was supported by the Adinberri U: Add observation to intake of Programme of the Provincial Council of solids at lunchtime Gipuzkoa, via the RESIVOZ project. S: Dictate U: Pilar said she was too full and References refused to eat more, she didn’t want Laranjo, L., A. G. Dunn, H. L. Tong, A. B. dessert either, end of annotation Kocaballi, J. Chen, R. Bashir, D. Surian, S: Observations to solid intake at B. Gallego, F. Magrabi, A. Y. Lau, and lunchtime saved E. Coiera. 2018. Conversational agents Although the user is not saying a name ex- in healthcare: a systematic review. Jour- plicitly, in example 3 the identified resident nal of the American Medical Informatics is known from the session memory. In this Association, 25(9):1248–1258. example, the user adds an observation, i.e. Serras, M., M. I. Torres, and A. del Pozo. a free text comment, to a registered control. 2019. Goal-conditioned user modeling Dictation ends when a pre-established com- for dialogue systems using stochastic bi- mand is uttered: “end of annotation”. automata. In Proceedings of the 8th Inter- (4) U: All my residents but Ander Go˜ni national Conference on Pattern Recogni- have been showered tion Applications and Methods, pages 128– S: Shower registered for all your res- 134. idents but Ander Go˜niUria Torres, M. I. 2013. Stochastic bi-languages RESIVOZ incorporates shortcuts to allow to model dialogs. In Proceedings of the the selection of several residents and/or con- 11th International Conference on Finite trols at the same time. Example 4 shows the State Methods and Natural Language Pro- use of a shortcut where all the residents in the cessing, pages 9–17. nurse’s assigned group but one are selected. Ward, W. 1991. Understanding spontaneous (5) U: Delete shower for Ander Go˜ni speech: The phoenix system. In Proceed- S: No completed control with that ings of the 1991 International Conference name has been found on Acoustics, Speech, and Signal Process- In example 5, the user tries to delete a ing, pages 365–367. control for a resident, but this control was 126 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 127-130 recibido 28-03-2020 revisado 30-05-2020 aceptado 30-05-2020

The Text Complexity Library

Biblioteca de Complejidad Textual

Roc´ıoL´opez-Anguita, Jaime Collado-Monta˜nez,Arturo Montejo-R´aez Centre for Advanced Information and Communication Technologies (CEATIC) University of Ja´en,Spain {rlanguit, jcollado, amontejo}@ujaen.es

Abstract: This paper introduces a new resource for computing textual complexity. It consists in a Python library for calculating different complexity metrics for several languages from plain texts. The resource has been made available to the research community and provides all needed instructions for its installation and use. To our knowledge, it is the first time a resource like this is published, so we expect many researchers can profit from it. Keywords: Demostration, linguistic resources, textual complexity, Resumen: Este art´ıculopresenta un nuevo recurso para el c´alculode la compleji- dad textual. Se trata de una biblioteca de programaci´onen Python que facilita el c´omputode distintas m´etricasde complejidad para varios idiomas a partir de textos en lenguaje natural. El recurso se ha liberado para su uso por parte de la comu- nidad cient´ıfica y proporciona todas las instrucciones necesarias para su instalaci´on y aprovechamiento. Hasta donde sabemos, es la primera vez que un recurso as´ıest´a disponible, por lo que esperamos sea de utilidad. Palabras clave: Demostraci´on,recursos ling¨u´ısticos,complejidad textual, an´alisis l´exico

1 Introduction 12 of the most widely used metrics for lexi- cal and syntactic readability. These measures Reading comprehension and reading compe- and their interpretation are presented below, tence are complex processes that are closely as well as details on the use of the library. related, according to P´erez(2014), as is the concept of readability. The reading compre- 2 Complexity metrics provided hension is close to the reader’s capabilities and the latter is an objective view of the com- In this section, we introduce the dif- plexity of the text. ferent complexity metrics offered in this Python library, proposed by different au- Determining the readability of a text is thors, for different languages (Spanish, En- not a simple task, as each reader has differ- glish, French...). ent skills or limitations (Cain, Oakhill, and Bryant, 2004). It is usually determined by Lexical complexity: The lexical complex- linguistic features, which are usually grouped ity of a text, determined by the frequency into those related to grammar (in other of use and lexical density, was proposed by words, syntax) and those related to the lex- Anula (2008). It is based on the number of icon (i.e. vocabulary) (Alliende Gonz´alez, different content words per sentence (Lexical 1994). Complexity Index, LC ) and on measuring the Currently, we consider that measures of number of low frequency words per 100 con- complexity can be a convenient way to model tent words (Index of Low Frequency Words, natural language in certain applications, such ILFW ). Consequently, the higher the LC in- as authorship detection, text selection for dex, the greater the difficulty in reading com- people with difficulties associated with lan- prehension. guage disorders (autism, cerebral palsy...), Spaulding readability: Commonly or early detection of cognitive impairments, known as the SSR Index, it was proposed by such as Alzheimer’s. Therefore, in this article Spaulding (1956). It focuses on measuring we present a “demo” paper, which consists of vocabulary and sentence structure to predict ISSN 1135-5948. DOI 10.26342/2020-65-19 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Rocío López Anguita, Jaime Collado Montañez, Arturo Montejo Ráez the relative difficulty of a text’s readability. cuses on measuring the number of syllables Its formula is an empirically adjusted mea- per word and the number of sentences per sure to try to keep the score between 0 and word in the text. 1. Comprehensibility of Guti´errez de Complexity of sentences: The Sentence Polini: This metric, originally developed in Complexity Index (SCI) was proposed by An- 1972, is not an adaptation of English, but ula (2008), as a measure of the complexity of was created from the beginning for Spanish sentences in a literary text aimed at second (Rodr´ıguez,1980). It focuses on measuring language learners. the average number of letters per word and This syntactic complexity measure focuses the average number of words per sentence. on measuring the number of words per sen- tence, thus obtaining the sentence length in- µ Readability: It is a formula to calculate dex (Average Sentence Length, ASL), and the readability of a text. It provides an in- the number of complex sentences per sen- dex between 0 and 100 and was developed tence, from a complex sentence index (Com- by Mu˜noz(2006). This measure focuses on plex Sentences, CS). measuring the number of words, the average number of letters per word and their vari- Automated Readability Index (ARI): ance. Senter and Smith (1967) proposes one of the most used indexes due to its ease of calcula- Minimum age to understand: In work tion, the Automated Readability Index, bet- of Garc´ıaL´opez (2001) we can find another ter known as ARI (Automated Readability In- formula to measure the age needed to under- dex). This index measures the difficulty of a stand a text. It is, again, an adaptation into text from the average number of characters Spanish of Flesch’s original formula (Flesch (letters and numbers) per word and the av- (1948)) for English. It measures the average erage number of words per sentence. number of syllables per word and the average number of words per sentence to obtain the Dependency tree depth: This measure minimum age needed to understand a text. was proposed by Saggion et al. (2015). It is a very useful metric to capture syntac- SOL Readability: Contreras et al. (1999) tic complexity: long sentences can be syn- proposes the SOL metric as an adaptation to tactically complex or contain a large number Spanish of the SMOG formula proposed by of modifiers (adjectives, adverbs or adverbial Mc Laughlin (1969). It measures the read- phrases). It complements the ASL measure, ability of a text by means of grade level, as it captures syntactic complexity in terms which is the number of years of schooling re- of recursive or nested structures. quired to understand the text. Punctuation Marks: This measure was Years Crawford: This measure was pro- also proposed by Saggion et al. (2015). In posed by Alan N. Crawford in 1989 (Craw- the complexity of a text, the average number ford, 1984). It is used to calculate the years of punctuation marks is used as one of the of school required to understand a text. Mea- indicators of the simplicity of the text. sures the number of sentences per hundred Readability of Fern´andez-Huerta: words and the number of syllables per hun- Blanco P´erezand Guti´errezCouto (2002) y dred words. Ram´ırez-Puerta et al. (2013) propose this measure of complexity as an adaptation to 3 How to obtain it Spanish of Flesch’s readability test (Flesch (1948)). The library has been released under the Gen- eral Public License (GPL v3.0) license1 and Readability of Flesch-Szigrist (IFSZ): can be downloaded or cloned from its pub- The works of Barrio-Cantalejo et al. (2008) lic repository2. The library will be updated and Ram´ırez-Puerta et al. (2013) propose with new features in the future, and you can the Flesch-Szigristzt readability index as a always get the latest version from that link. modification of the Flesch formula (Flesch, 1948) adapted to Spanish by Szigriszt-Pazos 1http://www.gnu.org/licenses/gpl.html in 1993. This index is currently considered 2https://gitlab.ujaen.es/amontejo/text- a reference for the Spanish language. It fo- complexity 128 The text complexity library

4 Installation hace las delicias de quienes atribuyen al astro In order to use this library, you first need to cualidades esot´ericas. Sucede cuando la Tierra install some previous requirements: se encuentra ubicada exactamente entre el sol y la luna, de forma que la luna recibe directamente • NumPy, Scipy, Pandas, Matplotlib and la luz. La luna llena ser´avisible durante toda Openpyxl for python3 have to be in- la noche, pero alcanzar´asu magnitud m´axima stalled in your system. cuando se encuentre a medio cielo, de forma que, al reflejar completamente la luz del sol que in- • The FreeLing (Padr´oand Stanilovsky, cide en la tierra, se ver´aespecialmente grande y 2012) package, which is a library provid- luminosa. Se llama luna fr´ıaporque marca la lle- ing language analysis services that our li- gada del invierno en el hemisferio norte, aunque brary makes use of. In order to install it, tambi´ense conoce como luna de las noches largas you have to follow its installation man- al ocurrir cerca del solsticio, informa National ual under the project’s GitHub page3 Geographic, que cita a un astr´onomode la NASA. 5 Usage examples La luna fr´ıade 2019 coincide adem´ascon la lluvia de meteoros Gem´ınidas,visible entre el 7 y el 17 In order to test the library and teach how to de diciembre pero que alcanzar´asu punto m´aximo use it, we have prepared some testing texts de actividad entre el 11 y el 13. Es la lluvia de for Spanish under the ./texts folder (you estrellas m´asmasiva, lo que la hace mucho m´as can use your own texts by modifying these brillante. El cielo augura todo un espect´aculoesta files or adding more to that location). noche. To compute complexity metrics on these This is the generated output: text samples, modify the FREELINGDIR vari- able in the TextComplexityFreeling.py Number of words (N_w): 207 script (line 18) to your own FreeLing instal- Punctuation marks: 21 lation directory (/usr/local by default). Number of low freq. words (N_lfw): 67 Then, if you run the Jupyter notebook Number of content words (N_dcw): 71 examples.ipynb you should get some tables Number of sentences (N_s): 8 with the metrics for each text provided. The Number of total content words (N_cw): 93 script will also generate three MS Excel files Lexical Distribution Index (LDI): 8.875 containing the results in your project’s folder. Index Low Frequency Words (ILFW): 0.72 To use it in your Python scripts, this is as Lexical Complexity Index (LC): 4.7977 simple as follows: Number of rare words (N_rw): 65 Spaulding Spanish Readability (SSR): 167.82 import TextComplexityFreeling as TCF Average Sentence Length (ASL): 25.875 # Create the text complexity calculator Complex Sentences (CS): 23.75 tc = TCF.TextComplexityFreeling() Sentence Complexity Index:(SCI): 24.81 # Load text to analyze Automated Readability Index (ARI): 13.66 text_processed = tc.textProcessing(text) Average embddings depth (MeanDEPTH): 7.25 # Compute different metrics Huerta’s Readability index: 80.86 pmarks = tc.punctuationMarks() IFSZ Readability: 54.25 lexcomplexity = tc.lexicalComplexity() Polani’s Compressibility (Polani’s): 42.61 ssreadability = tc.ssReadability() Mu Readability: 53.19 sencomplexity = tc.sentenceComplexity() Minimum age to understand: 12.48 autoreadability = tc.autoReadability() embeddingdepth = tc.embeddingDepth() SOL Readability: 11.66 readability = tc.readability() Years needed: 5.76 agereadability = tc.ageReadability() yearscrawford = tc.yearsCrawford() 6 Conclusions and future For example, for a given sample text: versions La ´ultimaluna llena del a˜no,que se obser- There are several studies that reflect the var´acompleta este jueves en el cielo, ser´aespe- strong influence of the richness of the reader’s cial. Se produce estos d´ıasel fen´omenoconocido vocabulary on reading comprehension, be- como luna fr´ıa,una coincidencia astron´omica que yond symbols and grammar. However, we 3https://github.com/TALP-UPC/FreeLing-User- consider that complexity measures can be a Manual convenient way to model natural language in 129 Rocío López Anguita, Jaime Collado Montañez, Arturo Montejo Ráez certain applications, such as authorship de- english, and french. Journal of health tection, text selection for people with dif- communication, 4(1):21–29. ficulties associated with language disorders Crawford, A. N. 1984. A spanish language (autism, cerebral palsy...), or early detection fry-type readability procedure: Elemen- of cognitive impairments, such as Alzheimer. tary level. bilingual education paper se- Another future line of work is to define ries, vol. 7, no. 8. complexity from computed language models (like RNNs or BERT models). We believe Flesch, R. 1948. A new readability yardstick. that information measures on the parame- Journal of applied psychology, 32(3):221. ters of these models may capture the inherent Garc´ıaL´opez, J. 2001. Legibilidad de los fol- complexity of the texts they were trained on. letos informativos. Pharmaceutical Care Espa˜na, 3(1):49–56. 7 Acknowledgements Mc Laughlin, G. H. 1969. Smog grading-a This work has been partially supported new readability formula. Journal of read- by Fondo Europeo de Desarrollo Re- ing, 12(8):639–646. gional (FEDER), LIVING-LANG project (RTI2018-094653-B-C21) from the Spanish Mu˜noz,M. 2006. Legibilidad y variabilidad Government. de los textos. Bolet´ınde Investigaci´onEd- ucacional, Pontificia Universidad Cat´olica References de Chile, 21, 2:13–26. Alliende Gonz´alez,F. 1994. La legibilidad Padr´o,L. and E. Stanilovsky. 2012. Freel- de los textos. Santiago de Chile: Andr´es ing 3.0: Towards wider multilinguality. Bello, 24. In Proceedings of the Eighth International Anula, A. 2008. Lecturas adaptadas a la Conference on Language Resources and ense˜nanzadel espa˜nolcomo l2: variables Evaluation (LREC’12), pages 2473–2479. ling¨u´ısticaspara la determinaci´ondel nivel P´erez, E. J. 2014. Comprensi´on lectora de legibilidad. La evaluaci´onen el apren- vs competencia lectora: qu´e son y qu´e dizaje y la ense˜nanzadel espa˜nolcomo LE relaci´onexiste entre ellas. Investigaciones L, 2:162–170. sobre lectura, (1):65–74. Barrio-Cantalejo, I. M., P. Sim´on-Lorda, Ram´ırez-Puerta, M., R. Fern´andez- M. Melguizo, I. Escalona, M. I. Mar- Fern´andez, J. Fr´ıas-Pareja, M. Yuste- iju´an, and P. Hernando. 2008. Vali- Ossorio, S. Narbona-Gald´o,and L. Pe˜nas- daci´onde la Escala INFLESZ para eval- Maldonado. 2013. An´alisisde legibilidad uar la legibilidad de los textos dirigidos de consentimientos informados en cuida- a pacientes. In Anales del Sistema Sani- dos intensivos. Medicina Intensiva, tario de Navarra, volume 31, pages 135– 37(8):503–509. 152. SciELO Espa{ na. Rodr´ıguez, T. 1980. Determinaci´onde la Blanco P´erez, A. and U. Guti´errezCouto. comprensibilidad de materiales de lectura 2002. Legibilidad de las p´aginasweb so- por medio de variables ling¨u´ısticas. Lec- bre salud dirigidas a pacientes y lectores tura y vida, 1(1):29–32. de la poblaci´ongeneral. Revista espa˜nola Saggion, H., S. Stajner,ˇ S. Bott, S. Mille, de salud p´ublica, 76(4):321–331. L. Rello, and B. Drndarevic. 2015. Mak- Cain, K., J. Oakhill, and P. Bryant. 2004. ing it simplext: Implementation and eval- Children’s reading comprehension abil- uation of a text simplification system for ity: Concurrent prediction by working spanish. ACM Transactions on Accessible memory, verbal ability, and component Computing (TACCESS), 6(4):14. skills. Journal of educational psychology, Senter, R. and E. A. Smith. 1967. Auto- 96(1):31. mated readability index. Technical report, Contreras, A., R. Garcia-Alonso, CINCINNATI UNIV OH. M. Echenique, and F. Daye-Contreras. Spaulding, S. 1956. A spanish readability 1999. The sol formulas for converting formula. The Modern Language Journal, smog readability scores between health 40(8):433–441. education materials written in spanish, 130 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 131-134 recibido 30-04-2020 revisado 26-05-2020 aceptado 03-06-2020

The Impact of Coronavirus on our Mental Health

El Impacto del Coronavirus en nuestra Salud Mental

Jiawen Wu1, Francisco Rangel1, Juan Carlos Mart´ınez2 1Symanto, Germany 2Atribus, Spain {jiawen.wu,francisco.rangel}@symanto.net, [email protected]

Abstract: The Coronavirus represents the greatest threat to physical health in modern times. Simultaneously, fear of the unknown and the fear of the very real repercussions of the virus is threatening to impact the mental health of many around the world. To provide insights on the impact of Coronavirus on our mental health, we are constantly monitoring millions of conversations on Twitter each day, and analysing this enormous amount of data by means of psychological models trained with artificial intelligence techniques and deep neural networks. Keywords: Coronavirus, COVID19, mental health, psychological models, deep learning Resumen: El Coronavirus representa la mayor amenaza para la salud f´ısica en tiempos modernos. A su vez, el miedo a lo desconocido y a las repercusiones reales del virus, est´aamenazando con impactar en la salud mental de las personas alrededor de todo el mundo. Para analizar dicho impacto, estamos monitorizando millones de conversaciones en Twitter en tiempo real, y analizando esta gran cantidad de datos mediante modelos psicol´ogicosentrenados con t´ecnicasde inteligencia artificial y redes neuronales profundas. Palabras clave: Coronavirus, COVID19, salud mental, modelos psicol´ogicos, aprendizaje profundo

1 Introduction terms. Furthermore, together with our part- ner Atribus3, we have created another tracker Social media allow instant and borderless to monitor the mental health of people in communication of what happens in people’s Spanish speaking countries4. daily lives, giving immediate access to knowl- In this paper, we summarise some of our edge that would otherwise be limited or bi- key findings in the development of mental ased at the discretion of a few. This is even health discussion in different countries, the more important when it comes to spreading triggering factors, the rising of digital psy- information regarding threats. And we are chotherapy and its perception, and much living the greatest threat to physical health more. in modern times: the Coronavirus. Add into this the huge lifestyle changes we’re all be- 2 Social Media Monitoring and ing asked to make and it’s easy to see why our emotions and feelings are being affected. Data Collection Our aim is at analysing the impact of the We are constantly retrieving discussions virus on the mental health of people around about the Coronavirus on Twitter. We search the world through our psychological models for terms such as Covid-19, Coronavirus or trained with artificial intelligence. SARS-Cov-2. We exclude spam (e.g., the At Symanto1, we have created two live same tweet changing only a user mention) trackers based on English and German2 and retweets. For this paper, we have anal- tweets with mentions of Coronavirus-related ysed the data collected between March 10 and April 22. In Table 1, we present the to- 1https://www.symanto.net/ 2https://www.symanto.net/live-insights/ 3https://www.atribus.com/ mental-health-coronavirus/ 4https://www.atribus.com/covid19-esplatam/ ISSN 1135-5948. DOI 10.26342/2020-65-20 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Jiawen Wu, Francisco Rangel, Juan Carlos Martínez tal number of tweets and the total number of Schulz von Thun, we have developed the per- tweets referring to mental health issues5. sonality and communication style identifica- tion model which allows to identify the com- Lang. # Tweets # Mental health munication purpose. English 14M 811K German 1.6M 75K Spanish 11.2M 1.1M

Table 1: Number of tweets collected between March 10 and April 22 Figure 2: Communication Purpose

According to Figure 2, most of the users 3 Mental Health Issues amid share their own experience and struggles Coronavirus Pandemic (self-revealing), followed by users seeking for Within the conversations around Coron- support and tips (information-seeking), par- avirus, an essential volume of discussion is ticipating in general discussions to obtain in- around mental health issues, which strongly formation and support (fact-oriented) and indicates the negative impact of the pan- asking for attention, awareness campaigns demic on our mental health. We are closely (action-seeking). monitoring both the volume of these conver- 3.2 A different perspective to look sations and the concrete issues in the discus- sion. Anxiety is by far the most dominant at Depression mental health issue, followed by Stress, De- Depression poses greater potential threat to pression, OCD, and Suicide. Though less our mental health than it may appear. Of significant, there are also mentions of Eat- the general conversation around Covid-19, ing Disorder, Personality Disorder, and Self- less than 1% mentions explicitly ”depression” harm in relation to Covid-196. and related terms. However, by applying our models trained on conversations from people diagnosed with depression, we find that up to 10% of the conversations show linguistic signs of depression7. 3.3 How is Covid-19 affecting young people’s mental state? Over 35% of tweets about mental health struggles in Covid-19 discussion are from younger adults (18-24 yrs). Alarmingly, Sui- cide is mentioned more frequently (over 50%) by younger people than the entire older age group. Young people are talking about Anx- iety as frequently as the older generations. Figure 1: Mental Health Issues (18 Apr) Stress and Depression on the other hand, are significantly more prevailing among older age groups. 3.1 Why are People discussing Mental Health Illness on 3.4 How is Covid-19 affecting Social Media? different genders? Based on Carl Gustav Jung’s Psychologi- Around 60% of tweets about mental health cal Types and the Communication model of struggles in Covid-19 discussion come from women. Anxiety, Stress and Sleeping disor- 5Due to space limitatons, in the next sections, we der are much more prevailing among women use the English set to present the insights. 6Look at Figure 1 as an example for day 18 of 7To notice that the prediction reflects the ten- April dency and is by no means diagnostic. 132 The impact of coronavirus on our mental health than men. Self-harm and Personality disor- Exercising has gained more conversations in der are more frequently mentioned by men countries such as UK and Germany. Work- than women. ing is more frequently mentioned by Germans than in any other country. 3.5 What factors are triggering mental health issues during 4.2 How is the Mood in the the pandemic? Discussion around Isolation is the biggest trigger across dif- Coronavirus? ferent mental health issues. Financial is- We detect the emotions expressed in the sues, Job insecurity are triggering Suicidal tweets and calculate the Mood Index follow- thoughts more drastically than other triggers. ing Equation 1: the ratio of the number of Burnout, though triggered by all various fac- tweets conveying positive emotions and the tors, is linked most to discussion around number of tweets conveying any emotion. workload. #positive MoodIndex = 100 · (1) 3.6 The Supporting of Mental #positive + #negative Health We use the Mood Index to gauge the pos- Appointment cancellation due to Covid-19 itivity and negativity of people. The more has lead to mounting health issue challenge. positive the mood, the higher the index and Using digital psychotherapy as alternative is vice versa when the mood is more pessimistic. so far creating a mixed sentiment, especially In Figure 3, we show an example of the Mood around anxiety. Index evolution in the range of a week. For example, a user says ”My therapist wants to do the telehealth thing , but it actu- ally gives me more anxiety to talk over video chat” while another says ”I am SO thankful telemedicine is a thing . Being able to con- tinue my weekly sessions with my therapist during the stay at home order has helped me keep my anxiety mostly under control over the past few weeks”. Though India and South Africa have higher volume of general mental health dis- Figure 3: Mood Index (12 Apr-18 Apr) cussion, yet they are lagging behind in the conversation around digital psychotherapy. As shown in Figure 4, African countries In other developing countries such as Philip- are showing a higher Mood Index than Euro- pines, Nigeria and Malaysia, there has not pean and Asian countries. been much discussion around teletherapy.

4 Society Resilience We are also monitoring the major topics of discussion to gain more insights on what is drawing our attention or keeping us occupied, in order to identify the main factors that are impacting mental health. 4.1 Coping in this Challenging Time People of different cultures are engaged in Figure 4: Mood Index around the World various activities, some contributing to bet- ter mental health, and some less. Play- Figure 5 shows the Mood Index in some ing and Watching TV are equally frequently countries. You can see that the negativity mentioned across different countries. In US prevails all around the world; in these exam- and Canada, Socialising makes up to the ples, the index ranges from less than 10 to most important activities in their daily life. around 20. 133 Jiawen Wu, Francisco Rangel, Juan Carlos Martínez

• Depression poses great potential threat to our mental health and often under- stated.

6 Beyond Insights The mental health of individuals and collec- tives is affected by physical threats such as Figure 5: Mood Index in some Countries in the Coronavirus, as well as earthquakes or the different Continents terrorist attacks. As an example, we can see in Figure 6 how the Mood Index correlated 4.3 Individualisctiv vs. negatively (r=-0.831 ) to the rate of change Collectivistic attitudes in new Coronavirus cases in Spain for nine days. Collectivists attitude can be crucial in pan- demic especially as conformity and self- sacrifice are key to the success of regulation and virus containment. It can also greatly impact our mental well-being state, espe- cially when isolation and loneliness are one of the greatest challenges to our mental health. Our preliminary results, although incon- clusive, point out that individualistic cultures are more emotional, while collectivists are more information-oriented as they want to take advantage of rational information for the sake of ”we”. Figure 6: Pearson Correlation between the rate of change in new Coronavirus cases in 5 Insights Spain and the Mood Index in a 9 days time The vast social media conversations have span opened up a unique opportunity to under- Furthermore, mental health can also af- stand and monitor the impact of Covid-19 fect our daily life. Fear of the unknown, such on our mental health on a global scale. Our as losing your job, affects the economy. If live Mental Health Trackers have raised the politicians capitalise fear, instead of working awareness around this topic and offered a first for the social good; if they seize the popu- data-driven glimpse into the world of mood, lation’s affiliation feeling by spreading fake emotions and struggles in this crisis time. news and misinformation, fostering hate; if Here, we summarise some of the key findings they use technology to monitor and censor of this study: our opinions; they will only further polarise society. And this might raise civil conflicts, • Nearly 6% of mental health issue discus- or even worse. sion are related to Covid-19. We plan to navigate through the pandemic • Suicide is mentioned more frequently by and post-pandemic era, and we will also de- younger people between 18-24 yrs than code the mood development through the lens the entire older population. of psychology aiming to provide an indicator of the social dynamics as a whole. A holistic • Digital Psychotherapy still out of reach approach should give us the proper instru- for most in developing countries. ments to fight against these threats and to work together for the social good. • Less prevailing issues such as Eating Dis- order, Personality Disorder, and Self- Acknowledgements harm can also be related to Covid-19. We want to warmly thank all people at • African countries are showing a higher Symanto and Atribus for their hard work and Mood Index than European and Asian for making possible this project. country. 134 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 135-138 recibido 29-04-2020 revisado 26-05-2020 aceptado 07-06-2020

AREVA: Augmented Reality Voice Assistant for Industrial Maintenance

AREVA: Asistente por Voz Dotado de Realidad Aumentada para el Mantenimiento Industrial

Manex Serras1, Laura Garc´ıa-Sardi˜na1, Bruno Sim˜oes1, Hugo Alvarez´ 1, Jon Arambarri 2 1Vicomtech Foundation, Basque Research and Technology Alliance (BRTA) 2VirtualWare Labs Foundation {mserras, lgarcias, bsimoes, halvarez}@vicomtech.org, [email protected]

Abstract: Within the context of Industry 4.0, AREVA is presented: a Voice Assistant with Augmented Reality visualisations for the support and guidance of operators when carrying out tasks and processes in industrial environments. With the aim of validating its use for the training of new operators, first evaluations were per- formed by a group of non-expert users who were asked to carry out a maintenance task on a Universal Robot. Keywords: Spoken Dialogue System, Augmented Reality, Industry 4.0

Resumen: Dentro del marco de la Industria 4.0 se presenta AREVA: un Asis- tente por Voz dotado con visualizaciones de Realidad Aumentada para el apoyo y guiado del operario a la hora de realizar tareas y procesos en entornos indus- triales. Con el objetivo de validar su uso para el entrenamiento y capacitacion´ de nuevos operarios, se ha realizado una primera evaluacion´ con un grupo de usuar- ios no-expertos a quienes se les pidio´ que realizasen una tarea de mantenimiento sobre un Robot Universal. Palabras clave: Sistema de Dialogo´ Hablado, Realidad Aumentada, Industria 4.0

1 Introduction This paper presents the AREVA system for industrial maintenance tasks. The pre- Industry 4.0 is one of the main paradigms sented prototype system was tested by a set of the latest years, where the industrial fac- of 10 participants (Gender: 5 male, 4 fe- tories are augmented with wireless connec- male, 1 would rather not say; Age ranges: tivity, sensors, and AI mechanisms that can 50% 25-29, 30% 35-39, 10% 30-34, and 10% help with different processes and tasks. In- 18-24) who had little-to-none technical ex- dustrial 4.0 revolution has shown how tech- pertise on the field and no prior knowledge nology can alter the way work is performed, of similar maintenance tasks, using the pre- the structure of organisations, and the role sented assistant to support and guide them that workers play in the manufacturing pro- through the process. After the experimental cess (Simoes˜ et al., 2019; Posada et al., 2018; sessions, participants filled an assessment Serras et al., 2020). form to judge the system’s validity. This paper describes an Augmented RE- Section 2 describes the architecture of the ality Voice Assistant (AREVA) to assist op- system and the use case, Section 3 details erators during the maintenance of Univer- the experimental framework and presents sal Robot’s (UR) Grippers. To this end, the evaluation results. Finally 4 presents the the system guides the operator combining conclusions and sets the future work. voice interaction and real-time 3D visualisa- tions. Augmented Reality (AR) glasses with 2 System Architecture a built-in microphone are used to provide hands-free interaction and as a solution to The system architecture follows the classi- overcome practical limitations in tasks that cal SDS schema, but incorporating commu- rely on intense manual work. nication channels with the necessary hard- ISSN 1135-5948. DOI 10.26342/2020-65-21 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural Manex Serras, Laura García-Sardiña, Bruno Simões, Hugo Álvarez, Jon Arambarri

res, 2013; Serras, Torres, and Del Pozo, 2019) which uses a stack of expert-rules and simulated interactions to guide the operator through the task. • Response Generation: this module re- ceives the response selected by the DM to carry out with the interaction. It per- forms two actions: 1) it selects an ad- equate text template for the response which is then synthesised by a Text To Speech service (Hernaez et al., 2001), and 2) it checks if the action has an asso- ciated AR rendering. Then, these audio Figure 1: Architecture of the Augmented Re- and visual responses are sent to the AR ality enhanced Spoken Dialogue System Headset to be conveyed to the operator. ware devices to allow for both visual and au- ral communication with operators, as shown 2.1 Use Case in Figure 1. The core components are pre- We considered as a use case the maintenance sented in more detail below. of the Universal Robot’s gripper. The objec- tive of the system is to assist untrained op- • AR Headset: this device is responsi- erators in real time with contextualised re- ble of capturing the users’ speech, play sponses to the questions that may arise dur- AREVA’s responses, and render AR an- ing the course of the task. AR is used to over- imations. The selection of a head- lay animations that further enhance the ex- mounted device is motivated so that op- perience. erators have theirs hands free to per- The maintenance steps considered for the form the task while interacting with the use case were retrieved from Subsections 7.1 system. For the proposed system, Mi- and 7.3 of the official maintenance manual1, crosoft HoloLens were used. which require the following actions: • Speech to Text (STT): is the compo- nent in charge of converting the audio 1. Shutdown the UR, unplug the con- streams into texts. First, an energy- nected wires, and dismount the gripper. based automata is used as Voice Ac- 2. Open the gripper to clean it and inspect tivity Detector (VAD) to segment au- it both visually and by moving its fin- dio streams into speech chunks, which gers to detect any wear, damages or in- are then sent to a speech transcription correct joint articulations. service to get the text utterance. Tran- scriptions were provided by the Google 3. Close the gripper’s fingers, mount it Speech API. back to the UR, and attach the wires. • Natural Language Understanding (NLU): once the voice segments have Once these four stages are completed, the been decoded into text, the NLU ex- monthly maintenance of the UR’s gripper is tracts the communicative intention of finished. As it can be implied by the steps the transcribed text so it can be inter- to perform the task, a heavy manual effort pretable by the system. For the AREVA is required to perform the operation, so a system, the grammar-based Phoenix hands-free system is critical to improve the parser was used (Ward, 1991). User Experience (UX). For the presented use case, the developed • Dialogue Manager (DM): the DM is the NLU module can understand a total of 10 module that ensures the correct plan- 1 ning and interaction through the indus- https://assets.robotiq.com/ website-assets/support_documents/document/ trial process. The DM implemented for 3-Finger_PDF_20190322.pdf?_ga=2.105997637. AREVA consists of an Attributed Prob- 750885948.1563187207-1298495031.1563187207, abilistic Finite State Bi-Automata (Tor- last accessed 29/04/2020 136 AREVA: augmented reality voice assistant for industrial maintenance

Augmented Reality & Spoken Dialogue System Visualisations aside, how helpful has the voice interaction Mean score: 4,3/5 been in performing maintenance? Voice interaction aside, how useful have the visualisations been to you? Mean score: 3,7/5 Yes: 1, Do you think you could have done the maintenance properly Yes, but it would have taken longer: 4, without the visualisations, just with the voice interaction? No, AR was a key feature: 4, I don’t know: 1 Do you think you could have done the maintenance properly without the voice interaction, Yes, but it would have taken longer: 4 just with the visualisations? No, voice interaction was a key feature: 6 AR: 1, If you had to say which technology was more relevant, which would it be? SDS: 3, Both: 6 Do you think it was useful to combine both technologies? Yes: 10, No: 0 Usability and perception Has the system made maintenance easier? Mean score: 4/5 Have you felt frustrated at any point in the process? Never: 1, Almost never: 4, Sometimes: 5 “The system has increased my confidence that I was performing the task correctly” Mean agreement score: 3,8/5 How much has having your hands free made maintenance easier? Mean score: 4,5/5 Task Completion and Recommendation Have you been able to complete the maintenance task? Yes: 9, No: 1 Would you recommend using the assistant to someone who has never done the maintenance before? Yes: 10, No: 0 Would you use the assistant to do maintenance a second time, the following month? Yes: 8, No: 2

Table 1: Summary of the evaluation form filled by the participants different intents, 20 entities, and more than 20 entity-values in total, using more than 2000 grammar rules to parse the operators’ input into valid semantic actions. The initial DM version was trained us- ing 33 expert-rules. These DM rules map the dialogue state –which encodes the lat- est system’s response, user action, and dis- crete memory attributes– with the possible actions to perform in the task, with phys- ical objects (e.g. tools, wires, bolts), and with their properties, so that questions like ”Which tool do I need?” and ”Where are the bolts?” can be disambiguated according to the current step context. These rules also control the dialogue navigation aspects such Figure 2: An operator using the AREVA sys- as the welcoming prompt, repetitions, flow tem to carry out the gripper’s maintenance constraints (do not allow the operator to go to the next step if the current one is not sat- isfied), and so on. ticipants were recruited to validate the sys- The Response Generation had 48 possi- tem and asked to perform a maintenance ble response actions, amounting to 131 text task in which they had no previous experi- templates (an action may have more than ence. one associated template). A total of 8 possi- The robot arm was fixed on a pre-set po- ble AR animations were developed: 1) initial sition, with the gripper coupled and its fin- gripper position over the UR, 2) point shut- gers closed, as shown in Figure 2. The tools down button, 3) point the wires location, 4) needed for the task were placed on the table. point the bolts in the base of the gripper, 5) At the beginning of the session, while the point the bolts under the gripper’s fingers, participants were adjusting and getting fa- 6) gripper’s fingers opening, 7) gripper’s fin- miliarised with the AR Glasses, a researcher gers movement to check, and finally 8) grip- would present the UR, the system, and the per’s fingers closing. task at hand that they should complete with no additional details. 3 Experimental Framework The evaluation form had several ques- The objective of AREVA is to help and tions regarding the technological compo- guide operators during maintenance tasks nents of the system, which were evaluated and make them more independent. Ten par- both separately and in conjunction. Besides, 137 Manex Serras, Laura García-Sardiña, Bruno Simões, Hugo Álvarez, Jon Arambarri additional questions about usability, partici- timodal system was well accepted by a pants’ perception, and task completion were group of non-expert operators unfamiliar to included. The results to these questions are the task. As future work, a wider sample of summarised in Table 2. participants will participate as testers. In ad- The results reported in Table 2 show dition, UX factors will be further improved that the use of AR and SDS technologies and evaluated, such as improvements in the combined in AREVA is useful for training STT and NLU modules and the use of Wake- and assisting applications for industrial pro- up words to avoid ambient-noise activation cesses. of the VAD module. Also, the extrapolation The Spoken Dialogue System is perceived to other industrial processes will be further more positively by the participants, both in investigated. helpfulness and as a key feature, but, over- all, the combination of both technologies is References perceived as advantageous and both of them Hernaez, I., E. Navas, J. L. Murugarren, and are relevant to assist and guide the operator B. Etxebarria. 2001. Description of the through the gripper’s maintenance task. Re- ahotts system for the basque language. garding the scores achieved in the system’s In 4th ISCA Tutorial and Research Workshop usability and perception evaluation, all par- (ITRW) on Speech Synthesis. ticipants agreed that the system made the maintenance easier, especially by allowing Posada, J., M. Zorrilla, A. Dominguez, them to have both hands free, and made B. Simoes, P. Eisert, D. Stricker, J. Ram- them feel more confident that they were do- bach, J. Dollner,¨ and M. Guevara. 2018. ing it correctly. Still, there were times when Graphics and media technologies for op- some users felt frustrated using the system, erators in industry 4.0. IEEE computer mostly due to STT or Voice Activity Detec- graphics and applications, 38(5):119–132. tion errors. Serras, M., L. Garc´ıa-Sardina,˜ B. Simoes,˜ Finally, 9 out of 10 users were able to H. Alvarez,´ and J. Arambarri. 2020. Di- complete the maintenance –note that no par- alogue enhanced extended reality: Inter- ticipant had made this process before and active system for the operator 4.0. Applied their knowledge of industrial processes was Sciences, 10(11):3960. limited–. All of them would recommend the system to someone who has never done this Serras, M., M. I. Torres, and A. Del Pozo. task before, proving the usefulness of the 2019. User-aware dialogue management system when training new operators. More- policies over attributed bi-automata. Pat- over, 8 of them would use AREVA again for tern Analysis and Applications, 22(4):1319– the following month’s maintenance. 1330. Additional observations made by the Simoes,˜ B., R. De Amicis, I. Barandiaran, and participants set the focus on improving J. Posada. 2019. Cross reality to enhance the hardware ergonomics, as sometimes it worker cognition in industrial assembly would bother the participants when per- operations. The International Journal of Ad- forming the maintenance. Also, the capabil- vanced Manufacturing Technology, pages 1– ity of adapting to each user profile (begin- 14. ners/intermediate/advanced) to adjust the Torres, M. I. 2013. Stochastic bi-languages to response verbosity was deemed as a useful model dialogs. In Proceedings of the 11th feature. international conference on finite state meth- 4 Conclusions and Future Work ods and natural language processing, pages 9–17. This paper describes an Augmented Real- ity Voice Assistant for natural and hands- Ward, W. 1991. Understanding sponta- free guiding for a Universal Robot Gripper’s neous speech: The phoenix system. In maintenance. The proposed solution facili- [Proceedings] ICASSP 91: 1991 Interna- tates the capture, distribution, and commu- tional Conference on Acoustics, Speech, and nication of domain specific knowledge to Signal Processing, pages 365–367. IEEE. operators in training and production phases. The hands-free, technology-combining mul- 138 Procesamiento del Lenguaje Natural, Revista nº 65, septiembre de 2020, pp. 139-142 recibido 25-03-2020 revisado 30-05-2020 aceptado 30-05-2020

UMUCorpusClassifier: Compilation and evaluation of linguistic corpus for Natural Language Processing tasks

UMUCorpusClassifier: Recolecci´ony evaluaci´onde corpus ling¨u´ısticos para tareas de Procesamiento del Lenguaje Natural

Jos´eAntonio Garc´ıa-D´ıaz1, Angela´ Almela2, Gema Alcaraz-M´armol3, Rafael Valencia-Garc´ıa1 1Facultad de Inform´atica,Universidad de Murcia 2Facultad de Letras, Universidad de Murcia 3Departamento de Filolog´ıaModerna, Universidad de Castilla-La Mancha {joseantonio.garcia8, angelalm, valencia}@um.es [email protected]

Abstract: The development of an annotated corpus is a very time-consuming task. Although some researchers have proposed the automatic annotation of a cor- pus based on ad-hoc heuristics, valid hypotheses cannot always be made. Even when the annotation process is performed by human annotators, the quality of the cor- pus is heavily influenced by disagreements between annotators or with themselves. Therefore, the lack of supervision of the annotation process can lead to poor quality corpus. In this work, we propose a demonstration of UMUCorpusClassifier, a NLP tool for aid researches for compiling corpus as well as coordinating and supervising the annotation process. This tool eases the daily supervision process and permits to detect deviations and inconsistencies during early stages of the annotation process. Keywords: Corpus compilation, Document classification Resumen: La construcci´onde un corpus anotado es una tarea que consume mu- cho tiempo. Aunque algunos investigadores han propuesto la anotaci´onautom´atica basada en heur´ısticas,´estasno siempre son posibles. Adem´as, incluso cuando la anotaci´ones realizada por personas puede haber discrepancias entre los mismos an- otadores o de un anotador consigo mismo que influyen en la calidad del corpus. Por tanto, la falta de supervisi´onsobre el proceso de anotaci´onpuede llevar a corpus con baja calidad. En este trabajo, proponemos una demostraci´onde UMUCorpusClas- sifier, una herramienta PLN para ayudar a los investigadores a compilar corpus y tambi´ena coordinar y supervisar el proceso de anotaci´on. Esta herramienta facilita la monitorizaci´ondiaria y permite detectar inconsistencias durante etapas tempranas del proceso de anotaci´on. Palabras clave: Compilaci´onde corpus, Clasificaci´onde documentos

1 Introduction The main idea behind this is that supervised Supervised learning is the machine learning learning models can infer new knowledge by task which consists of building a model capa- establishing associations between the exam- ble of predicting output for a specific prob- ples provided and the expected tags. How- lem based on prior observation of a previ- ever, supervised learning requires a sufficient ously labeled data set (Singh, Thakur, and number of labeled examples that model the Sharma, 2016). Supervised learning has ap- problem domain and, at the same time, the plications for solving document classification number of examples should be enough to tasks, which consist of matching a set of cluster the examples in two subsets, one for documents with a set of predefined labels. model learning, and another for evaluating ISSN 1135-5948. DOI 10.26342/2020-65-22 © 2020 Sociedad Española para el Procesamiento del Lenguaje Natural José Antonio García-Díaz, Ángela Almela, Gema Alcaraz-Mármol, Rafael Valencia-García its accuracy based on samples that are not ated through the Twitter API website. Once seen during the training stage. users have created and validated their ac- The development of an annotated corpus count, they can create a corpus by entering is a very time-consuming process. To facili- a search string and, optionally, a geoloca- tate this task, some researchers have used dis- tion. The search string has the same flex- tant supervision as a method of getting auto- ibility and limitations as the Twitter API; matically annotated data (Go, Bhayani, and that is, it is allowed to nest terms, do ex- Huang, 2009). Distant supervision consists in act searches, or search for specific user ac- the automatic tagging of the whole dataset counts. A list of these operators can be found based on certain assumptions or heuristics; here1. In addition, users can specify that only however, valid hypotheses cannot always be user-initiated messages are obtained by fil- made. Furthermore, automatic annotated tering messages that are responses to other data that have not thoroughly been reviewed users’ tweets. Once the search query is spec- can lead to noise in the data, which intro- ified, the corpus are compiled automatically. duces information that do not follow the re- From time to time, a scheduled cron job pro- lationship with the rest of the elements of the cess queries new tweets based on keywords. dataset. These noisy documents require even UMUCorpusClassifier works effectively with greater volume of data to minimize its effects. timelines, making use of since id parameters Furthermore, even though the annotation to get new tweets and avoid duplicate re- process occurs manually, the quality of the sults. However, this feature can be omitted corpus cannot be guaranteed. In this sense, in cases where the user deliberately changes Mozetiˇc, Igor et al. (Mozetiˇc, Grˇcar, and the search string. Smailovi´c,2016) conducted an experiment to Identifying duplicate tweets is a complex measure the quantity and quality of tagged task. Although Twitter provides a mecha- Twitter datasets to evaluate the impact of nism called retweet that allows the content measuring accuracy in sentiment analysis- of messages written by other users to be dis- based classification experiments, and they seminated and the identification of these mes- found that not all corpus annotated manually sages is trivial, many users in the social net- had a strong degree of agreement among the work use copy and paste mechanisms so it annotators, which indicated poor-quality. To is possible to find duplicate or virtually the solve these drawbacks, we present UMUCor- same tweets. Also, hyperlinks in tweets are pusClassifier, a tool that assists researchers encoded differently with each new tweet due in compiling , and helps to coor- to Twitter’s own hyperlink shortening mech- dinate the annotation process. anism. For this reason, we have made the The remainder of this paper is organized decision to replace URLs with a fixed to- as follows: Section 2 describes our proposal, ken, which makes it easier to identify certain and Section 3 contains information regarding tweets. In addition, we have added a mecha- studies based on corpus compiled with this nism to calculate the similarity of the texts. tool, as well as the description of future lines Being an experimental technology, tweets are of action. not removed, but administrators are given the opportunity to combine the responses of 2 System architecture the tweets at their discretion. UMUCorpusClassifier is designed to work The next step is to assign each corpus a with the social network Twitter, which is a set of independent labels. They can be made widely used network for compiling corpus in using a set of predefined labels, such as out- various branches of Natural Language Pro- of-domain, positive, negative, neutral, do-not- cessing (NLP) (Pak and Paroubek, 2010). know-do-not-answer or define a new set of Twitter provides a developer API through tags for the corpus. Each label is identified which you can query the social network. The with a color and a name. free version of the Twitter API has some re- The corpus labeling process can be carried strictions regarding the number of calls that out manually by the same user or allow ac- can be made in a time interval. Therefore, cess to the platform to a set of annotators to create an account in the platform, users and to supervise their work. Documents are are required to link a Twitter API key with 1https://developer.twitter.com/en/docs/tweets/rules- their account. These credentials are gener- and-filtering/overview/standard-operators 140 UMUCorpusClassifier: compilation and evaluation of linguistic corpus for natural language processing tasks

Figure 3: Screenshot of an annotator’s view

to estimate the expected accuracy when find- ing a classifier to solve the classification prob- Figure 1: Screenshot of an annotator’s view lem (Mozetiˇc,Grˇcar,and Smailovi´c,2016). Figure 3 shows the overall inter-agreement of the corpus, as well as the inter-agreement for each pair of annotators. Once the corpus has been compiled and annotated, it can be exported to text for- mats. The export process is flexible and al- lows you to choose the number of classes to export. Combining classes is also allowed. This is useful, for example, when a classifica- Figure 2: Screenshot of a researcher’s view tion has been made on a scale of very neg- ative, negative, neutral, positive, and very positive type values, but one may want to annotated individually. When an annotator combine the results to return the corpus enters the web application, a tweet appears grouped into positive, neutral, and negative. randomly from tweets that they had not pre- Corpora can be exported balanced, that is viously classified. Figure 1 shows the view of to say, the system automatically searches for an annotator at the platform. the class with the largest number of instances To facilitate the labeling monitoring pro- and cuts instances from the other classes. cess, UMUCorpusClassifier provides a set of These deleted instances are agreed by con- metrics and charts, such as the evolution of sensus. Finally, it is possible to export only the annotations made by time, to evaluate the Twitter IDs in order to share them with that the work is being done constantly; the the community as recommended by Twitter’s total rankings by tag; or (3) the average of privacy policies2. annotations as well as their standard devia- Furthermore, the number of instances to tion. Figure 2 shows the researcher’s view of export can be selected and set. One of the the platform. advantages of this approach is that corpus For each annotator, the degree of self- can be exported by consensus: since the same agreement is calculated, which measures how tweet can be classified by different annota- the same annotator classifies semantically tors, the number of tweets to export can be similar documents. To cluster similar doc- limited and retrieve those tweets that have uments, we obtain the sentence-embeddings achieved strong consensus among annotators. from FastText in its Spanish version (Grave Thus, subsets of the corpus comprising the et al., 2018) and then we have calculated the documents with common agreement can be cosine distance among those vectors. Tweets retrieved, and the rest of the documents can that exceed a certain threshold distance are be analyzed. Furthermore, the software is discarded. easily extensible. In this respect, it is rela- For each tweet, one can see the degree tively easy to include new strategies to ex- of inter-agreement that measures how the port the data or to improve the platform to same tweet is classified by different annota- include new data sources other than Twitter. tors by using the Krippendorff’s alpha coeffi- cient (Krippendorff, 2018). Moreover, inter- 2https://developer.twitter.com/en/developer- agreement has been proposed as upper bound terms/more-on-restricted-use-cases 141 José Antonio García-Díaz, Ángela Almela, Gema Alcaraz-Mármol, Rafael Valencia-García

3 Further work classification: An infodemiological case In this study, we have presented UMUCor- study regarding infectious diseases in latin pusClassifier, a NLP tool that assists in the america. Future Generation Computer compilation and annotation of linguistic cor- Systems, 112:614–657. pus. So far, we have used this application on Go, A., R. Bhayani, and L. Huang. 2009. several domains. Specifically, we have com- Twitter sentiment classification using dis- piled tweets about different types of diseases tant supervision. CS224N project report, to carry out infodemiology studies that in- Stanford, 1(12):2009. volve measuring the population’s perception Grave, E., P. Bojanowski, P. Gupta, of infectious diseases (Apolinardo-Arzube et A. Joulin, and T. Mikolov. 2018. Learn- al., 2019; Garc´ıa-D´ıaz,C´anovas-Garc´ıa,and ing word vectors for 157 languages. arXiv Valencia-Garc´ıa, 2020; Medina-Moreira et preprint arXiv:1802.06893. al., 2018), diabetes (Medina-Moreira et al., 2019), and figurative language such as satire Krippendorff, K. 2018. Content analysis: An (Salas-Z´arateet al., 2017). introduction to its methodology. Sage pub- In the current version of the platform it lications. is only possible to assign a label to a docu- Medina-Moreira, J., J. A. Garc´ıa-D´ıaz, ment. We are working to enable the multi- O. Apolinardo-Arzube, H. Luna-Aveiga, label classification. Another line of research and R. Valencia-Garc´ıa. 2019. Mining is the addition of a contextual feature extrac- twitter for measuring social perception tion module, enabling the analysis of groups towards diabetes and obesity in central of Twitter accounts from which tweets are america. In International Conference on extracted. These features may include infor- Technologies and Innovation, pages 81–94. mation on the time of publication, number Springer. of followers, etc. Lastly, with regard to se- mantic similarity, we are currently analyzing Medina-Moreira, J., J. O. Salavarria-Melo, ways to distance the most different tweets K. Lagos-Ortiz, H. Luna-Aveiga, and from each other, so that we can export the R. Valencia-Garc´ıa. 2018. Opinion min- tweets with strongest consensus and the most ing for measuring the social perception of distant ones. infectious diseases. an infodemiology ap- proach. In Proceedings of the Technologies Acknowledgments and Innovation: 4th International Confer- ence, CITI, page 229. Springer. This demonstration has been supported by the Spanish National Research Agency Mozetiˇc, I., M. Grˇcar, and J. Smailovi´c. (AEI) and the European Regional De- 2016. Multilingual twitter sentiment clas- velopment Fund (FEDER/ERDF) through sification: The role of human annotators. projects KBS4FIA (TIN2016-76323-R) and PloS one, 11(5). LaTe4PSP (PID2019-107652RB-I00). In ad- Pak, A. and P. Paroubek. 2010. Twitter as a dition, Jos´eAntonio Garc´ıa-D´ıaz has been corpus for sentiment analysis and opinion supported by Banco Santander and Univer- mining. In LREc, volume 10, pages 1320– sity of Murcia through the Doctorado indus- 1326. trial programme. Salas-Z´arate, M. d. P., M. A. Paredes- ´ References Valverde, M. A. Rodr´ıguez-Garc´ıa, R. Valencia-Garc´ıa, and G. Alor- Apolinardo-Arzube, O., J. A. Garc´ıa-D´ıaz, Hern´andez.2017. Automatic detection of J. Medina-Moreira, H. Luna-Aveiga, and satire in twitter: A psycholinguistic-based R. Valencia-Garc´ıa. 2019. Evalu- approach. Knowl. Based Syst., 128:20–33. ating information-retrieval models and machine-learning classifiers for measuring Singh, A., N. Thakur, and A. Sharma. 2016. the social perception towards infectious A review of supervised machine learning diseases. Applied Sciences, 9(14):2858. algorithms. In 2016 3rd International Conference on Computing for Sustainable Garc´ıa-D´ıaz,J. A., M. C´anovas-Garc´ıa,and Global Development (INDIACom), pages R. Valencia-Garc´ıa. 2020. Ontology- 1310–1315. Ieee. driven aspect-based sentiment analysis 142

Información General

Información para los Autores

Formato de los Trabajos  La longitud máxima admitida para las contribuciones será de 8 páginas DIN A4 (210 x 297 mm.), incluidas referencias y figuras.  Los artículos pueden estar escritos en inglés o español. El título, resumen y palabras clave deben escribirse en ambas lenguas.  El formato será en Word ó LaTeX Envío de los Trabajos  El envío de los trabajos se realizará electrónicamente a través de la página web de la Sociedad Española para el Procesamiento del Lenguaje Natural (http://www.sepln.org)  Para los trabajos con formato LaTeX se mandará el archivo PDF junto a todos los fuentes necesarios para compilación LaTex  Para los trabajos con formato Word se mandará el archivo PDF junto al DOC o RTF  Para más información http://www.sepln.org/la-revista/informacion-para-autores

Información Adicional

Funciones del Consejo de Redacción Las funciones del Consejo de Redacción o Editorial de la revista SEPLN son las siguientes: • Controlar la selección y tomar las decisiones en la publicación de los contenidos que han de conformar cada número de la revista • Política editorial • Preparación de cada número • Relación con los evaluadores y autores • Relación con el comité científico

El consejo de redacción está formado por los siguientes miembros L. Alfonso Ureña López (Director) Universidad de Jaén [email protected] Patricio Martínez Barco (Secretario) Universidad de Alicante [email protected] Manuel Palomar Sanz Universidad de Alicante [email protected] Felisa Verdejo Maíllo UNED [email protected]

Funciones del Consejo Asesor Las funciones del Consejo Asesor o Científico de la revista SEPLN son las siguientes: • Marcar, orientar y redireccionar la política científica de la revista y las líneas de investigación a potenciar • Representación • Impulso a la difusión internacional • Capacidad de atracción de autores • Evaluación • Composición • Prestigio • Alta especialización • Internacionalidad

El Consejo Asesor está formado por los siguientes miembros: Manuel de Buenaga Universidad de Alcalá (España) Sylviane Cardey-Greenfield Centre de recherche en linguistique et traitement automatique des langues (Francia) Irene Castellón Universidad de Barcelona (España) Arantza Díaz de Ilarraza Universidad del País Vasco (España) Antonio Ferrández Rodríguez Universidad de Alicante (España) Alexander Gelbukh Instituto Politécnico Nacional (México) Koldo Gojenola Universidad del País Vasco (España) Xavier Gómez Guinovart Universidad de Vigo (España) José Miguel Goñi Menoyo Universidad Politécnica de Madrid (España) Ramón López-Cózar Delgado Universidad de Granada (España)

Bernardo Magnini Fondazione Bruno Kessler (Italia) Nuno J. Mamede Instituto de Engenharia de Sistemas e Computadores (Portugal) M. Antònia Martí Antonín Universidad de Barcelona (España) M. Teresa Martín Valdivia Universidad de Jaén (España) Patricio Martínez-Barco Universidad de Alicante (España) Paloma Martínez Fernández Universidad Carlos III (España) Eugenio Martínez Cámara Universidad de Granada (España) Raquel Martínez Unanue Universidad Nacional de Educación a Distancia (España) Leonel Ruiz Miyares Centro de Lingüística Aplicada de Santiago de Cuba (Cuba) Ruslan Mitkov University of Wolverhampton (Reino Unido) Manuel Montes y Gómez Instituto Nacional de Astrofísica, Óptica y Electrónica (México) Lluís Padró Cirera Universidad Politécnica de Cataluña (España) Manuel Palomar Sanz Universidad de Alicante (España) Ferrán Pla Santamaría Universidad Politécnica de Valencia (España) German Rigau Claramunt Universidad del País Vasco (España) Horacio Rodríguez Hontoria Universidad Politécnica de Cataluña (España) Paolo Rosso Universidad Politécnica de Valencia (España) Leonel Ruíz Miyares Centro de Lingüística Aplicada de Santiago de Cuba (Cuba) Emilio Sanchís Universidad Politécnica de Valencia (España) Kepa Sarasola Gabiola Universidad del País Vasco (España) Encarna Segarra Soriano Universidad Politécnica de Valencia (España) Thamar Solorio University of Houston (Estados Unidos de América) Maite Taboada Simon Fraser University (Canadá) Mariona Taulé Delor Universidad de Barcelona Juan-Manuel Torres-Moreno Laboratoire Informatique d’Avignon/Université d’Avignon (Francia) José Antonio Troyano Jiménez Universidad de Sevilla (España) L. Alfonso Ureña López Universidad de Jaén (España) Rafael Valencia García Universidad de Murcia (España) René Venegas Velásquez Universidad Católica de Valparaiso (Chile) Felisa Verdejo Maíllo Universidad Nacional de Educación a Distancia (España) Manuel Vilares Ferro Universidad de la Coruña (España) Luis Villaseñor-Pineda Instituto Nacional de Astrofísica, Óptica y Electrónica (México)

Cartas al director Sociedad Española para el Procesamiento del Lenguaje Natural Departamento de Informática. Universidad de Jaén Campus Las Lagunillas, Edificio A3. Despacho 127. 23071 Jaén [email protected]

Más información Para más información sobre la Sociedad Española del Procesamiento del Lenguaje Natural puede consultar la página web http://www.sepln.org. Si desea inscribirse como socio de la Sociedad Española del Procesamiento del Lenguaje Natural puede realizarlo a través del formulario web que se encuentra en esta dirección http://www.sepln.org/sepln/la-sociedad Los números anteriores de la revista se encuentran disponibles en la revista electrónica: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/issue/archive Las funciones del Consejo de Redacción están disponibles en Internet a través de http://www.sepln.org/la-revista/consejo-de-redaccion Las funciones del Consejo Asesor están disponibles Internet a través de la página http://www.sepln.org/la-revista/consejo-asesor La inscripción como nuevo socio de la SEPLN se puede realizar a través de la página http://www.sepln.org/sepln/inscripcion-para-nuevos-socios

Demostraciones EmoCon: Analizador de emociones en el congreso de los diputados Salud María Jiménez-Zafra, Miguel Ángel García-Cumbreras, María Teresa Martín-Valdivia, Andrea López-Fernández ...... 115 Nalytics: natural speech and text analytics Ander González-Docasal, Naiara Pérez, Aitor Álvarez, Manex Serras, Laura García-Sardiña, Haritz Arzelus, Aitor García-Pablos, Montse Cuadros, Paz Delgado, Ane Lazpiur, Blanca Romero ...... 119 RESIVOZ: dialogue system for voice-based information registration in eldercare Sardiña, Manex Serras, Arantza del Pozo, Mikel D. Fernández-Bhogal ...... 123 The text complexity library Rocío López Anguita, Jaime Collado Montañez, Arturo Montejo Ráez ...... 127 The impact of coronavirus on our mental health Jiawen Wu, Francisco Rangel, Juan Carlos Martínez ...... 131 AREVA: augmented reality voice assistant for industrial maintenance Manex Serras, Laura García-Sardiña, Bruno Simões, Hugo Álvarez, Jon Arambarri ...... 135 UMUCorpusClassifier: compilation and evaluation of linguistic corpus for natural language processing tasks José Antonio García-Díaz, Ángela Almela, Gema Alcaraz-Mármol, Rafael Valencia-García ...... 139

Información General Información para los autores ...... 145 Información adicional ...... 147