Applications of Natural Language Processing in Digital Humanities Pablo Ruiz Fabo

Concept-based and relation-based corpus navigation : applications of natural language processing in digital humanities Pablo Ruiz Fabo To cite this version: Pablo Ruiz Fabo. Concept-based and relation-based corpus navigation : applications of natural language processing in digital humanities. Linguistics. Université Paris sciences et lettres, 2017. English. NNT : 2017PSLEE053. tel-01575167v2 HAL Id: tel-01575167 https://tel.archives-ouvertes.fr/tel-01575167v2 Submitted on 2 Jul 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE DE DOCTORAT de l’Université de recherche Paris Sciences et Lettres PSL Research University Préparée à l’École normale supérieure Concept-Based and Relation-Based Corpus Navigation: Applications of Natural Language Processing in Digital Humanities Ecole doctorale n°540 TRANSDISCIPLINAIRE LETTRES / SCIENCES Spécialité SCIENCES DU LANGAGE COMPOSITION DU JURY : Mme. BEAUDOUIN Valérie Télécom ParisTech, Rapporteur Mme. SPORLEDER Caroline Universität Göttingen, Rapporteur M. GANASCIA Jean-Gabriel Université Paris 6, Membre du jury Mme. GONZÁLEZ-BLANCO Elena Soutenue par PABLO RUIZ FABO UNED Madrid, Membre du jury le 23 juin 2017 h Mme. TELLIER Isabelle Université Paris 3, Membre du jury Dirigée par Thierry POIBEAU Mme. TERRAS Melissa University College London, Membre du jury h PSL RESEARCH UNIVERSITY ÉCOLE NORMALE SUPÉRIEURE DOCTORAL THESIS Concept-Based and Relation-Based Corpus Navigation: Applications of Natural Language Processing in Digital Humanities Author: Supervisor: Pablo RUIZ FABO Thierry POIBEAU Research Unit: Laboratoire LATTICE École doctorale 540 – Transdisciplinaire Lettres / Sciences Defended on June 23, 2017 Thesis committee: Valérie BEAUDOUIN Télécom ParisTech Rapporteur Jean-Gabriel GANASCIA Université Paris 6 Examinateur Elena GONZÁLEZ-BLANCO UNED Madrid Examinateur Caroline SPORLEDER Universität Göttingen Rapporteur Isabelle TELLIER Université Paris 3 Examinateur Melissa TERRAS University College London Examinateur iii Abstract Social sciences and Humanities research is often based on large textual corpora, that it would be unfeasible to read in detail. Natural Language Processing (NLP) can identify important concepts and actors mentioned in a corpus, as well as the relations between them. Such information can provide an overview of the corpus useful for domain-experts, and help identify corpus areas relevant for a given research question. To automatically annotate corpora relevant for Digital Humanities (DH), the NLP technologies we applied are, first, Entity Linking, to identify corpus actors and concepts. Second, the relations between actors and concepts were determined based on an NLP pipeline which provides semantic role labeling and syntactic dependencies among other information. Part I outlines the state of the art, paying attention to how the technologies have been applied in DH. Generic NLP tools were used. As the efficacy of NLP methods depends on the corpus, some technological development was undertaken, described in Part II, in order to better adapt to the corpora in our case studies. Part II also shows an intrinsic evaluation of the technology developed, with satisfactory results. The technologies were applied to three very different corpora, as described in Part III. First, the manuscripts of Jeremy Bentham. This is a 18th–19th century corpus in political philosophy. Second, the PoliInformatics corpus, with heterogeneous materi- als about the American financial crisis of 2007–2008. Finally, the Earth Negotiations Bulletin (ENB), which covers international climate summits since 1995, where treaties like the Kyoto Protocol or the Paris Agreements get negotiated. For each corpus, navigation interfaces were developed. These user interfaces (UI) combine networks, full-text search and structured search based on NLP annotations. As an example, in the ENB corpus interface, which covers climate policy negotiations, searches can be performed based on relational information identified in the corpus: The negotiation actors having discussed a given issue using verbs indicating support or opposition can be searched, as well as all statements where a given actor has expressed support or opposition. Relation information is employed, beyond simple co-occurrence between corpus terms. The UIs were evaluated qualitatively with domain-experts, to assess their potential usefulness for research in the experts’ domains. First, we payed attention to whether the corpus representations we created correspond to experts’ knowledge of the corpus, as an indication of the sanity of the outputs we produced. Second, we tried to determine whether experts could gain new insight on the corpus by using the applications, e.g. if they found evidence unknown to them or new research ideas. Examples of insight gain were attested with the ENB interface; this constitutes a good validation of the work carried out in the thesis. Overall, the applications’ strengths and weaknesses were pointed out, outlining possible improvements as future work. iv Keywords: Entity Linking, Wikification, Relation Extraction, Proposition Extraction, Corpus Visualization, Natural Language Processing, Digital Humanities v Résumé Note : Le résumé étendu en français commence à la p. 263. La recherche en Sciences humaines et sociales repose souvent sur de grandes masses de données textuelles, qu’il serait impossible de lire en détail. Le Traitement automa- tique des langues (TAL) peut identifier des concepts et des acteurs importants men- tionnés dans un corpus, ainsi que les relations entre eux. Ces informations peuvent fournir un aperçu du corpus qui peut être utile pour les experts d’un domaine et les aider à identifier les zones du corpus pertinentes pour leurs questions de recherche. Pour annoter automatiquement des corpus d’intérêt en Humanités numériques, les technologies TAL que nous avons appliquées sont, en premier lieu, le liage d’entités (plus connu sous le nom de Entity Linking), pour identifier les acteurs et concepts du corpus ; deuxièmement, les relations entre les acteurs et les concepts ont été détermi- nées sur la base d’une chaîne de traitements TAL, qui effectue un étiquetage des rôles sémantiques et des dépendances syntaxiques, entre autres analyses linguistiques. La partie I de la thèse décrit l’état de l’art sur ces technologies, en soulignant en même temps leur emploi en Humanités numériques. Des outils TAL génériques ont été utilisés. Comme l’efficacité des méthodes de TAL dépend du corpus d’application, des développements ont été effectués, décrits dans la partie II, afin de mieux adapter les méthodes d’analyse aux corpus dans nos études de cas. La partie II montre également une évaluation intrinsèque de la technologie développée, avec des résultats satisfaisants. Les technologies ont été appliquées à trois corpus très différents, comme décrit dans la partie III. Tout d’abord, les manuscrits de Jeremy Bentham, un corpus de philosophie politique des 18e et 19e siècles. Deuxièmement, le corpus PoliInformatics, qui contient des matériaux hétérogènes sur la crise financière américaine de 2007–2008. Enfin, le Bulletin des Négociations de la Terre (ENB dans son acronyme anglais), qui couvre des sommets internationaux sur la politique climatique depuis 1995, où des traités comme le Protocole de Kyoto ou les Accords de Paris ont été négociés. Pour chaque corpus, des interfaces de navigation ont été développées. Ces interfaces utilisateur combinent les réseaux, la recherche en texte intégral et la recherche structu- rée basée sur des annotations TAL. À titre d’exemple, dans l’interface pour le corpus ENB, qui couvre des négociations en politique climatique, des recherches peuvent être effectuées sur la base d’informations relationnelles identifiées dans le corpus : les acteurs de la négociation ayant abordé un sujet concret en exprimant leur soutien ou leur opposition peuvent être recherchés. Le type de la relation entre acteurs et concepts est exploité, au-delà de la simple co-occurrence entre les termes du corpus. Les interfaces ont été évaluées qualitativement avec des experts de domaine, afin d’estimer leur utilité potentielle pour la recherche dans leurs domaines respectifs. Tout d’abord, on a vérifié que les représentations générées pour le contenu des corpus sont vi en accord avec les connaissances des experts du domaine, pour déceler des erreurs d’annotation. Ensuite, nous avons essayé de déterminer si les experts pouvaient être en mesure d’avoir une meilleure compréhension du corpus grâce à l’utilisation des applications développées, par exemple, si celles-ci permettent de renouveler leurs questions de recherche existantes. On a pu mettre au jour des exemples où un gain de compréhension sur le corpus est observé grâce à l’interface dédiée au Bulletin des Négociations de la Terre, ce qui constitue une bonne validation du travail effectué dans la thèse. En conclusion, les points forts et faiblesses des applications développées ont été soulignés, en indiquant de possibles pistes d’amélioration en tant que travail futur. Mots Clés

Applications of Natural Language Processing in Digital Humanities Pablo Ruiz Fabo

Newcastle University Eprints

RASLAN 2015 Recent Advances in Slavonic Natural Language Processing

A Decade in Digital Humanities

Book of Abstracts

The Stuff We Forget: Digital Humanities, Digital Data, and the Academic Cycle

Children Online: a Survey of Child Language and CMC Corpora

Email in the Australian National Corpus

C 2014 Gourab Kundu DOMAIN ADAPTATION with MINIMAL TRAINING

Crowdsourcing Bentham: Beyond the Traditional Boundaries of Academic History’ by Tim Causer and Melissa Terras

CS224N Section 3 Corpora, Etc. Pi-Chuan Chang, Friday, April 25

Machine Learning Approaches To

Image and Interpretation: Using Artificial Intelligence to Read Ancient Roman Texts"