Artificial Intelligence to Extract, Analyze and Generate Knowledge and Arguments from Texts to Support Informed Interaction and Decision Making Elena Cabrio

Artificial Intelligence to Extract, Analyze and Generate Knowledge and Arguments from Texts to Support Informed Interaction and Decision Making Elena Cabrio To cite this version: Elena Cabrio. Artificial Intelligence to Extract, Analyze and Generate Knowledge and Arguments from Texts to Support Informed Interaction and Decision Making. Artificial Intelligence [cs.AI]. Université Côte d’Azur, 2020. tel-03084380 HAL Id: tel-03084380 https://hal.inria.fr/tel-03084380 Submitted on 21 Dec 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 1 Universite´ Cote^ d'Azur HABILITATION THESIS Habilitation àDiriger des Recherches (HDR) Major: Computer Science Elena CABRIO Artificial Intelligence to Extract, Analyze and Generate Knowledge and Arguments from Texts to Support Informed Interaction and Decision Making Jury: Fabien Gandon, Research Director, INRIA (France) - President Pietro Baroni, Full Professor, Universita' di Brescia (Italy) - Rapporteur Marie-Francine Moens, Full Professor, KU Leuven (Belgium) - Rapporteur Anne Vilnat, Full Professor, UniversitéParis-Sud (France) - Rapporteur Chris Reed, Full Professor, University of Dundee (UK) - Examinateur October 22, 2020 2 Contents 1 Introduction 7 1.1 Information Extraction to generate structured knowledge . .8 1.2 Natural language interaction with the Web of Data . 10 1.3 Mining argumentative structures from texts . 11 1.4 Cyberbullying and abusive language detection . 13 1.5 Structure of this report . 14 2 IE to generate structured knowledge 15 2.1 Towards Lifelong Object Learning . 17 2.2 Mining semantic knowledge from the Web . 30 2.3 Natural Language Processing of Song Lyrics . 49 2.4 Lyrics Segmentation via bimodal text-audio representation . 50 2.5 Song lyrics summarization inspired by audio thumbnailing . 65 2.6 Enriching the WASABI Song Corpus with lyrics annotations . 72 2.7 Events extraction from social media posts . 85 2.8 Building events timelines from microblog posts . 93 2.9 Conclusions . 102 3 Natural language interaction with the Web of Data 105 3.1 QAKiS . 107 3.2 RADAR 2.0: a framework for information reconciliation . 111 3.3 Multimedia answer visualization in QAKiS . 125 3.4 Related Work . 127 3.5 Conclusions . 129 4 Mining argumentative structures from texts 131 4.1 A natural language bipolar argumentation approach . 134 4.2 Argument Mining on Twitter: arguments, facts and sources . 150 4.3 Argumentation analysis for political speeches . 155 4.4 Mining arguments in 50 years of US presidential campaign debates . 163 4.5 Argument Mining for Healthcare applications . 167 4.6 Related work . 178 4.7 Conclusions . 180 5 Cyberbullying and abusive language detection 183 5.1 Introduction . 183 5.2 A Multilingual evaluation for online hate speech detection . 185 5.3 Cross-platform evaluation for Italian hate speech detection . 199 5.4 A system to monitor cyberbullying . 203 5.5 Related Work . 207 5.6 Conclusions . 209 3 4 CONTENTS 6 Conclusions and Perspectives 211 Acknowledgements I would like to thank the main authors of the publications that are summarized here for their kind agreement to let these works contribute to this thesis. 5 6 CONTENTS Chapter 1 Introduction This document relates and synthesizes my research and research management experience since I joined the EDELWEISS team led by Olivier Corby in 2011 for a postdoctoral position at Inria Sophia Antipolis (2 years funded by the Inria CORDIS postdoctoral program, followed by 1 year funded by the Labex UCN@SOPHIA and 1 year funded by the SMILK LabCom project). In 2011, the Inria EDELWEISS team and the I3S KEWI team merged to become WIMMICS (Web Instrumented Man-Machine Interactions, Communities and Semantics)1, a joint team between Inria, University of Nice Sophia Antipolis and CNRS, led by Fabien Gandon. In October 2015, I got an Assistant Professor Position (Maitre de Conference) at the University of Nice Sophia Antipolis, now part of the UniversitéCôted'Azur. WIMMICS is a sub-group of the SPARKS2 team (Scalable and Pervasive softwARe and Knowledge Systems) in I3S which has been structured into three themes in 2015. My research activity mainly contributes to the FORUM theme (FORmalising and Reasoning with Users and Models). Throughout this 9-year period, I was involved in several research projects and my role has progressively evolved from junior researcher to scientific leader. I initiated several research projects on my own, and I supervised several PhD thesis. In the meantime, I was also involved in the scientific animation of my research community. My research area is Natural Language Processing, and the majority of my works are in the sub-areas of Argument(ation) Mining and Information Extraction. Long-term goal of my research (and of works in such research area) is to make computers/machines as intelligent as human beings in understanding and generating language, being thus able: to speak, to make deduction, to ground on common knowledge, to answer, to debate, to support humans in decision making, to explain, to persuade. Natural language understanding can come in many forms. In my research career so far I put efforts in investigating some of these forms, strongly connected to the actions I would like intelligent artificial systems to be able to perform. In this direction, my first research topic was the study of semantic inferences in natural language texts, that I investigated in the context of my PhD in Information and Communication Technologies (International Doctoral School in Trento, Italy), supervised by Bernardo Magnini at the Fondazione Bruno Kessler, Human Language Technology research group (Trento, Italy). During my PhD, I focused on better understanding the semantic inference needs across NLP applications, and in particular on the Textual Entailment (TE) framework. Textual Entail- ment has been proposed as a generic framework to model language variability and capture major semantic inference needs across applications in Natural Language Processing. In the TE recognition (RTE) task systems are asked to automatically judge whether the meaning of a portion of text (T), entails the meaning of another text (Hypothesis, H). TE comes at various levels of complexity and involves almost all linguistic phenomena of natural languages. 1https://team.inria.fr/wimmics/ 2https://sparks.i3s.unice.fr/ 7 8 CHAPTER 1. INTRODUCTION Although several approaches have been experimented, TE systems performances are still far from being optimal. My work started from the belief that crucial progress may derive from a focus on decomposing the complexity of the TE task into basic phenomena and on their com- bination. More specifically, I analyzed how the common intuition of decomposing TE allows a better comprehension of the problem from both a linguistic and a computational viewpoint. I proposed a framework for component-based TE, where each component is in itself a complete TE system, able to address a TE task on a specific phenomenon in isolation. I investigated the following dimensions: i) the definition and implementation of a component-based TE architec- ture; ii) the linguistic analysis of the phenomena; iii) the automatic acquisition of knowledge to support entailment judgments; iv) the development of evaluation methodologies to assess TE systems capabilities. During my PhD, I have also spent 6 months at the Xerox Research Center Europe in Grenoble (now NAVER Labs Europe), in the Parsing and Semantics research lab, where I continued my research on my PhD topic. Following the long-term goal mentioned above and after defending my PhD, my research topics gradually evolved from the study of semantic inferences between textual snippets to the investigation of methods to extract structured information from unstructured natural language text, to populate knowledge bases in different application scenarios, with the goal of making intelligent system ground on common knowledge. To enhance users interactions with such structured data, I then addressed the challenge of mapping natural language expressions (e.g., user queries) with concepts and relations in structured knowledge bases, implementing a ques- tion answering system. The following step was to focus on mining and analyzing argumentative structures (from heterogeneous sources, but mainly from the Web), as well as connecting them with datasets on the Web of Data for their interlinking and semantic enrichment. I am one of the very first initiators of the research topic, very popular nowadays, called Argument Mining. The rationale behind my research work (as one of the core topics of the WIMMICS team) is to support structured argument exchange, informed decision making and improved fact-checking, also considering the role of emotions and persuasion. For this reason, argumentative structures and relevant factual information should be mined, understood and interlinked, and the Web represents both an invaluable information source, and the ideal place to publish the mined results, producing new types of links in the global vision of weaving a richer and denser Web. To preserve an

Load more