Applications of Natural Language Processing in Digital Humanities Pablo Ruiz Fabo

Total Page:16

File Type:pdf, Size:1020Kb

Applications of Natural Language Processing in Digital Humanities Pablo Ruiz Fabo Concept-based and relation-based corpus navigation : applications of natural language processing in digital humanities Pablo Ruiz Fabo To cite this version: Pablo Ruiz Fabo. Concept-based and relation-based corpus navigation : applications of natural lan- guage processing in digital humanities. Linguistics. Université Paris sciences et lettres, 2017. English. NNT : 2017PSLEE053. tel-01575167v2 HAL Id: tel-01575167 https://tel.archives-ouvertes.fr/tel-01575167v2 Submitted on 2 Jul 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE DE DOCTORAT de l’Université de recherche Paris Sciences et Lettres PSL Research University Préparée à l’École normale supérieure Concept-Based and Relation-Based Corpus Navigation: Applications of Natural Language Processing in Digital Humanities Ecole doctorale n°540 TRANSDISCIPLINAIRE LETTRES / SCIENCES Spécialité SCIENCES DU LANGAGE COMPOSITION DU JURY : Mme. BEAUDOUIN Valérie Télécom ParisTech, Rapporteur Mme. SPORLEDER Caroline Universität Göttingen, Rapporteur M. GANASCIA Jean-Gabriel Université Paris 6, Membre du jury Mme. GONZÁLEZ-BLANCO Elena Soutenue par PABLO RUIZ FABO UNED Madrid, Membre du jury le 23 juin 2017 h Mme. TELLIER Isabelle Université Paris 3, Membre du jury Dirigée par Thierry POIBEAU Mme. TERRAS Melissa University College London, Membre du jury h PSL RESEARCH UNIVERSITY ÉCOLE NORMALE SUPÉRIEURE DOCTORAL THESIS Concept-Based and Relation-Based Corpus Navigation: Applications of Natural Language Processing in Digital Humanities Author: Supervisor: Pablo RUIZ FABO Thierry POIBEAU Research Unit: Laboratoire LATTICE École doctorale 540 – Transdisciplinaire Lettres / Sciences Defended on June 23, 2017 Thesis committee: Valérie BEAUDOUIN Télécom ParisTech Rapporteur Jean-Gabriel GANASCIA Université Paris 6 Examinateur Elena GONZÁLEZ-BLANCO UNED Madrid Examinateur Caroline SPORLEDER Universität Göttingen Rapporteur Isabelle TELLIER Université Paris 3 Examinateur Melissa TERRAS University College London Examinateur iii Abstract Social sciences and Humanities research is often based on large textual corpora, that it would be unfeasible to read in detail. Natural Language Processing (NLP) can identify important concepts and actors mentioned in a corpus, as well as the relations between them. Such information can provide an overview of the corpus useful for domain-experts, and help identify corpus areas relevant for a given research question. To automatically annotate corpora relevant for Digital Humanities (DH), the NLP technologies we applied are, first, Entity Linking, to identify corpus actors and concepts. Second, the relations between actors and concepts were determined based on an NLP pipeline which provides semantic role labeling and syntactic dependencies among other information. Part I outlines the state of the art, paying attention to how the technologies have been applied in DH. Generic NLP tools were used. As the efficacy of NLP methods depends on the corpus, some technological development was undertaken, described in Part II, in order to better adapt to the corpora in our case studies. Part II also shows an intrinsic evaluation of the technology developed, with satisfactory results. The technologies were applied to three very different corpora, as described in Part III. First, the manuscripts of Jeremy Bentham. This is a 18th–19th century corpus in political philosophy. Second, the PoliInformatics corpus, with heterogeneous materi- als about the American financial crisis of 2007–2008. Finally, the Earth Negotiations Bulletin (ENB), which covers international climate summits since 1995, where treaties like the Kyoto Protocol or the Paris Agreements get negotiated. For each corpus, navigation interfaces were developed. These user interfaces (UI) combine networks, full-text search and structured search based on NLP annotations. As an example, in the ENB corpus interface, which covers climate policy negotiations, searches can be performed based on relational information identified in the corpus: The negotiation actors having discussed a given issue using verbs indicating support or opposition can be searched, as well as all statements where a given actor has expressed support or opposition. Relation information is employed, beyond simple co-occurrence between corpus terms. The UIs were evaluated qualitatively with domain-experts, to assess their potential usefulness for research in the experts’ domains. First, we payed attention to whether the corpus representations we created correspond to experts’ knowledge of the corpus, as an indication of the sanity of the outputs we produced. Second, we tried to determine whether experts could gain new insight on the corpus by using the applications, e.g. if they found evidence unknown to them or new research ideas. Examples of insight gain were attested with the ENB interface; this constitutes a good validation of the work carried out in the thesis. Overall, the applications’ strengths and weaknesses were pointed out, outlining possible improvements as future work. iv Keywords: Entity Linking, Wikification, Relation Extraction, Proposition Extraction, Corpus Visualization, Natural Language Processing, Digital Humanities v Résumé Note : Le résumé étendu en français commence à la p. 263. La recherche en Sciences humaines et sociales repose souvent sur de grandes masses de données textuelles, qu’il serait impossible de lire en détail. Le Traitement automa- tique des langues (TAL) peut identifier des concepts et des acteurs importants men- tionnés dans un corpus, ainsi que les relations entre eux. Ces informations peuvent fournir un aperçu du corpus qui peut être utile pour les experts d’un domaine et les aider à identifier les zones du corpus pertinentes pour leurs questions de recherche. Pour annoter automatiquement des corpus d’intérêt en Humanités numériques, les technologies TAL que nous avons appliquées sont, en premier lieu, le liage d’entités (plus connu sous le nom de Entity Linking), pour identifier les acteurs et concepts du corpus ; deuxièmement, les relations entre les acteurs et les concepts ont été détermi- nées sur la base d’une chaîne de traitements TAL, qui effectue un étiquetage des rôles sémantiques et des dépendances syntaxiques, entre autres analyses linguistiques. La partie I de la thèse décrit l’état de l’art sur ces technologies, en soulignant en même temps leur emploi en Humanités numériques. Des outils TAL génériques ont été utilisés. Comme l’efficacité des méthodes de TAL dépend du corpus d’application, des développements ont été effectués, décrits dans la partie II, afin de mieux adapter les méthodes d’analyse aux corpus dans nos études de cas. La partie II montre également une évaluation intrinsèque de la technologie développée, avec des résultats satisfaisants. Les technologies ont été appliquées à trois corpus très différents, comme décrit dans la partie III. Tout d’abord, les manuscrits de Jeremy Bentham, un corpus de philosophie politique des 18e et 19e siècles. Deuxièmement, le corpus PoliInformatics, qui contient des matériaux hétérogènes sur la crise financière américaine de 2007–2008. Enfin, le Bulletin des Négociations de la Terre (ENB dans son acronyme anglais), qui couvre des sommets internationaux sur la politique climatique depuis 1995, où des traités comme le Protocole de Kyoto ou les Accords de Paris ont été négociés. Pour chaque corpus, des interfaces de navigation ont été développées. Ces interfaces utilisateur combinent les réseaux, la recherche en texte intégral et la recherche structu- rée basée sur des annotations TAL. À titre d’exemple, dans l’interface pour le corpus ENB, qui couvre des négociations en politique climatique, des recherches peuvent être effectuées sur la base d’informations relationnelles identifiées dans le corpus : les acteurs de la négociation ayant abordé un sujet concret en exprimant leur soutien ou leur opposition peuvent être recherchés. Le type de la relation entre acteurs et concepts est exploité, au-delà de la simple co-occurrence entre les termes du corpus. Les interfaces ont été évaluées qualitativement avec des experts de domaine, afin d’estimer leur utilité potentielle pour la recherche dans leurs domaines respectifs. Tout d’abord, on a vérifié que les représentations générées pour le contenu des corpus sont vi en accord avec les connaissances des experts du domaine, pour déceler des erreurs d’annotation. Ensuite, nous avons essayé de déterminer si les experts pouvaient être en mesure d’avoir une meilleure compréhension du corpus grâce à l’utilisation des applications développées, par exemple, si celles-ci permettent de renouveler leurs questions de recherche existantes. On a pu mettre au jour des exemples où un gain de compréhension sur le corpus est observé grâce à l’interface dédiée au Bulletin des Négociations de la Terre, ce qui constitue une bonne validation du travail effectué dans la thèse. En conclusion, les points forts et faiblesses des applications développées ont été soulignés, en indiquant de possibles pistes d’amélioration en tant que travail futur. Mots Clés
Recommended publications
  • Newcastle University Eprints
    Newcastle University ePrints Knight D, Adolphs S, Carter R. CANELC: constructing an e-language corpus. Corpora 2014, 9(1), 29-56. Copyright: The definitive version of this article, published by Edinburgh University Press, 2014, is available at: http://dx.doi.org/10.3366/cor.2014.0050 Always use the definitive version when citing. Further information on publisher website: www.euppublishing.com Date deposited: 23-07-2014 Version of file: Author Accepted Manuscript This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License ePrints – Newcastle University ePrints http://eprint.ncl.ac.uk CANELC – Constructing an e-language corpus ___________________________________________________________________________ Dawn Knight1, Svenja Adolphs2 and Ronald Carter2 This paper reports on the construction of CANELC: the Cambridge and Nottingham e- language Corpus 3 . CANELC is a one million word corpus of digital communication in English, taken from online discussion boards, blogs, tweets, emails and SMS messages. The paper outlines the approaches used when planning the corpus: obtaining consent; collecting the data and compiling the corpus database. This is followed by a detailed analysis of some of the patterns of language used in the corpus. The analysis includes a discussion of the key words and phrases used as well as the common themes and semantic associations connected with the data. These discussions form the basis of an investigation of how e-language operates in both similar and different ways to spoken and written records of communication (as evidenced by the BNC - British National Corpus). Keywords: Blogs, Tweets, SMS, Discussion Boards, e-language, Corpus Linguistics 1. Introduction Communication in the digital age is a complex many faceted process involving the production and reception of linguistic stimuli across a multitude of platforms and media types (see Boyd and Heer, 2006:1).
    [Show full text]
  • RASLAN 2015 Recent Advances in Slavonic Natural Language Processing
    RASLAN 2015 Recent Advances in Slavonic Natural Language Processing A. Horák, P. Rychlý, A. Rambousek (Eds.) RASLAN 2015 Recent Advances in Slavonic Natural Language Processing Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2015 Karlova Studánka, Czech Republic, December 4–6, 2015 Proceedings Tribun EU 2015 Proceedings Editors Aleš Horák Faculty of Informatics, Masaryk University Department of Information Technologies Botanická 68a CZ-602 00 Brno, Czech Republic Email: [email protected] Pavel Rychlý Faculty of Informatics, Masaryk University Department of Information Technologies Botanická 68a CZ-602 00 Brno, Czech Republic Email: [email protected] Adam Rambousek Faculty of Informatics, Masaryk University Department of Information Technologies Botanická 68a CZ-602 00 Brno, Czech Republic Email: [email protected] This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the Czech Copyright Law, in its current version, and permission for use must always be obtained from Tribun EU. Violations are liable for prosecution under the Czech Copyright Law. Editors ○c Aleš Horák, 2015; Pavel Rychlý, 2015; Adam Rambousek, 2015 Typography ○c Adam Rambousek, 2015 Cover ○c Petr Sojka, 2010 This edition ○c Tribun EU, Brno, 2015 ISBN 978-80-263-0974-1 ISSN 2336-4289 Preface This volume contains the Proceedings of the Ninth Workshop on Recent Advances in Slavonic Natural Language Processing (RASLAN 2015) held on December 4th–6th 2015 in Karlova Studánka, Sporthotel Kurzovní, Jeseníky, Czech Republic.
    [Show full text]
  • A Decade in Digital Humanities
    Journal of Siberian Federal University. Humanities & Social Sciences 7 (2016 9) 1637-1650 ~ ~ ~ УДК 009:004.9 A Decade in Digital Humanities Melissa Terras* University College London London, UK Received 15.02.2016, received in revised form 07.05.2016, accepted 09.06.2016 The paper reviews the meaning and development of digital humanities giving the ex-amples of work published in various DH areas. The paper discusses what using these technologies means for the humanities, giving recommendations that can be useful across the sector. Keywords: digital humanities, UCL Centre for Digital Humanities, Innovation Curve. DOI: 10.17516/1997-1370-2016-9-7-1637-1650. Research area: culture studies. I decided to call my paper “A Decade in paper gives me a rare chance to pause and look Digital Humanities” for three reasons: behind me to see what the body of work built up 1. The term Digital Humanities has been over this time represents. commonly used to describe the application 3. You’ll have to wait for later in the paper to of computational methods in the arts and see the third reason... humanities for 10 years, since the publication, in Who here would be comfortable defining 2004, of the Companion to Digital Humanities1. what is meant by the term Digital Humanities3? “Digital Humanities” was quickly picked up by This paper is also related to the week of UCL the academic community as a catch-all, big tent2 Festival of the Arts, celebrating all things to name for a range of activities in computing, the do with the Arts and Humanities at my home arts, and culture.
    [Show full text]
  • Book of Abstracts
    The Association for Literary and Lingustic Computing The Association for Computers and the Humanities Society for Digital Humanities — Société pour l’étude des médias interactifs Digital Humanities 2008 The 20th Joint International Conference of the Association for Literary and Linguistic Computing, and the Association for Computers and the Humanities and The 1st Joint International Conference of the Association for Literary and Linguistic Computing, the Association for Computers and the Humanities, and the Society for Digital Humanities — Société pour l’étude des médias interactifs University of Oulu, Finland 24 – 29 June, 2008 Conference Abstracts International Programme Committee • Espen Ore, National Library of Norway, Chair • Jean Anderson, University of Glasgow, UK • John Nerbonne, University of Groningen, The Netherlands • Stephen Ramsay, University of Nebraska, USA • Thomas Rommel, International Univ. Bremen, Germany • Susan Schreibman, University of Maryland, USA • Paul Spence, King’s College London, UK • Melissa Terras, University College London, UK • Claire Warwick, University College London, UK, Vice Chair Local organizers • Lisa Lena Opas-Hänninen, English Philology • Riikka Mikkola, English Philology • Mikko Jokelainen, English Philology • Ilkka Juuso, Electrical and Information Engineering • Toni Saranpää, English Philology • Tapio Seppänen, Electrical and Information Engineering • Raili Saarela, Congress Services Edited by • Lisa Lena Opas-Hänninen • Mikko Jokelainen • Ilkka Juuso • Tapio Seppänen ISBN: 978-951-42-8838-8 Published by English Philology University of Oulu Cover design: Ilkka Juuso, University of Oulu © 2008 University of Oulu and the authors. _____________________________________________________________________________Digital Humanities 2008 Introduction On behalf of the local organizers I am delighted to welcome you to the 25th Joint International Conference of the Association for Literary and Linguistic Computing (ALLC) and the Association for Computers and the Humanities (ACH) at the University of Oulu.
    [Show full text]
  • The Stuff We Forget: Digital Humanities, Digital Data, and the Academic Cycle
    The Stuff we Forget: Digital Humanities, Digital Data, and the Academic Cycle Professor Melissa Terras Director, UCL Centre for Digital Humanities [email protected], @melissaterras Vindolanda Texts • Roman Fort on Hadrian’s Wall, England • Texts from AD 92 onwards • Two types – ink texts • Carbon ink on wood. 300 texts survive – stylus tablets • recessed centre filled with wax. 100 texts Close up - Tablet 1563 .Complex incisions .Woodgrain .Surface discolouration .Warping .Cracking .Noisy image .Palimpsest .Long process Image processing: illumination correction Original image After illumination correction Image processing: woodgrain removal -1 Original image After woodgrain removal With thanks to Dr Segolene Tarte, eSAD project, OeRC 1996 - 2008 http://www.collective.co.uk/thrones/htm/index.htm http://wwwcdn.actian.com/wp-content/uploads/2014/02/data_icon1.png Jeremy Bentham (1748-1832) •Jurist, philosopher, and legal and social reformer •Leading theorist in Anglo-American philosophy of law •Influenced the development of welfarism •Advocated utilitarianism •Animal rights, •Work on the “panopticon” •Not founder of UCL, but... •60,000 folios in UCL Sp. Collections •40,000 untranscribed •Auto-icon Baked apple pudding 61/2 per peck. Apples 1 peck 3d peasemeal 12lb 1/2d malt dust 1/2 3/4d milk 1 quart 2d water - D0 - treacle — — — 1 2 eggs — — 1 labour - 1 91/4 Boil & mash the apples stir in the malt dust & treacle, press the mass into a pan; boil the meal, milk & water together till thick, add the eggs and the remainder of JB/107/110/002:
    [Show full text]
  • Children Online: a Survey of Child Language and CMC Corpora
    Children Online: a survey of child language and CMC corpora Alistair Baron, Paul Rayson, Phil Greenwood, James Walkerdine and Awais Rashid Lancaster University Contact author: Paul Rayson Email: [email protected] The collection of representative corpus samples of both child language and online (CMC) language varieties is crucial for linguistic research that is motivated by applications to the protection of children online. In this paper, we present an extensive survey of corpora available for these two areas. Although a significant amount of research has been undertaken both on child language and on CMC language varieties, a much smaller number of datasets are made available as corpora. Especially lacking are corpora which match requirements for verifiable age and gender metadata, although some include self-reported information, which may be unreliable. Our survey highlights the lack of corpus data available for the intersecting area of child language in CMC environments. This lack of available corpus data is a significant drawback for those wishing to undertake replicable studies of child language and online language varieties. Keywords: child language, CMC, survey, corpus linguistics Introduction This survey is part of a wider investigation into child language and computer-mediated communication (CMC)1 corpora. Its aim is to assess the availability of relevant corpora which can be used to build representative samples of the language of children online. The Isis project2, of which the survey is a part, carries out research in the area of online child protection. Corpora of child and CMC language serve two main purposes in our research. First, they enable us to build age- and gender-based standard profiles of such language for comparison purposes.
    [Show full text]
  • Email in the Australian National Corpus
    Email in the Australian National Corpus Andrew Lampert CSIRO ICT Centre, North Ryde, Australia and Macquarie University, Australia 1. Introduction Email is not only a distinctive and important text type but one that touches the lives of most Australians. In 2008, 79.4% of Australians used the internet, ahead of all Asia Pacific countries except New Zealand (Organisation for Economic Co-operation and Development, 2008). A recent Nielsen study suggests that almost 98% of Australian internet users have sent or received email messages in the past 4 weeks (Australian Communications and Media Authority, 2008), making email the most used application on the internet by a significant margin. It seems logical to embrace a communication medium used by the vast majority of Australians when considering the text types and genres that should be included in the Australian National Corpus. Existing corpora such as the British National Corpus (2007) and the American National Corpus (Macleod, Ide, & Grishman, 2000) provide many insights and lessons for the creation and curation of the Australian National Corpus. Like many existing corpora, the British National Corpus and the American National Corpus contain language data drawn from a wide variety of text types and genres, including telephone dialogue, novels, letters, transcribed face-to-face dialogue, technical books, newspapers, web logs, travel guides, magazines, reports, journals, and web data. Notably absent from this list are email messages. In many respects, email has replaced more traditional forms of communication such as letters and memoranda, yet this reality is not reflected in existing corpora. This lack of email text is a significant gap in existing corpus resources.
    [Show full text]
  • C 2014 Gourab Kundu DOMAIN ADAPTATION with MINIMAL TRAINING
    c 2014 Gourab Kundu DOMAIN ADAPTATION WITH MINIMAL TRAINING BY GOURAB KUNDU DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2014 Urbana, Illinois Doctoral Committee: Professor Dan Roth, Chair Professor Chengxiang Zhai Assistant Professor Julia Hockenmaier Associate Professor Hal DaumeIII, University of Maryland Abstract Machine learning models trained on labeled data of a domain degrade performance severely when tested on a different domain. Traditional approaches deal with this problem by training a new model for every new domain. In Natural language processing, top performing systems often use multiple interconnected models and therefore training all of them for every new domain is computationally expensive. This thesis is a study on how to adapt to a new domain, using the system trained on a different domain, avoiding the cost of retraining. This thesis identifies two key ingredients for adaptation without training: broad coverage re- sources and constraints. We show how resources like Wikipedia, VerbNet, WordNet that contain comprehensive coverage of entities, semantic roles and words in English can help a model adapt to a new domain. For the task of semantic role labeling, we show that in the decision phase, we can replace a linguistic unit (e.g. verb, word) with another equivalent linguistic unit residing in the same cluster defined in these resources (e.g. VerbNet, WordNet) such that after replacement, text becomes more like text on which the model was trained. We show that the model's output is more accurate on the transformed text than on original text.
    [Show full text]
  • Crowdsourcing Bentham: Beyond the Traditional Boundaries of Academic History’ by Tim Causer and Melissa Terras
    This is a pre-publication version of ‘Crowdsourcing Bentham: beyond the traditional boundaries of academic history’ by Tim Causer and Melissa Terras. It will be published in April 2014 in vol. 8 (1) of the International Journal of Humanities and Arts Computing (http://www.euppublishing.com/journal/ijhac), and appears here thanks to Edinburgh University Press. 1 Crowdsourcing Bentham: beyond the traditional boundaries of academic history Tim Causer, Bentham Project, Faculty of Laws, University College London Melissa Terras, Department of Information Studies, University College London Abstract: The Bentham Papers Transcription Initiative12 (Transcribe Bentham for short) is an award-winning crowdsourced manuscript transcription initiative which engages students, researchers, and the general public with the thought and life of the philosopher and reformer, Jeremy Bentham (1748–1832), by making available digital images of his manuscripts for anyone, anywhere in the world, to transcribe. Since its launch in September 2010, over 2.6 million words have been transcribed by volunteers. This paper will examine Transcribe Bentham’s contribution to humanities research and the burgeoning field of digital humanities. It will then discuss the potential for the project’s volunteers to make significant new discoveries among the vast Bentham Papers collection, and examine several examples of interesting material transcribed by volunteers thus far. We demonstrate here that a crowd- sourced initiative such as Transcribe Bentham can open up activities that were traditionally viewed as academic endeavors to a wider audience interested in history, whilst uncovering new, important historical primary source material. In addition, we see this as a switch in focus for those involved in digital humanities, highlighting the possibilities in using online and social media technologies for user engagement and participation in cultural heritage.
    [Show full text]
  • CS224N Section 3 Corpora, Etc. Pi-Chuan Chang, Friday, April 25
    CS224N Section 3 Corpora, etc. Pi-Chuan Chang, Friday, April 25, 2008 Some materials borrowed from Bill’s notes in 2006: http://www.stanford.edu/class/cs224n/handouts/cs224n-section3-corpora.txt Proposal for Final project due two weeks later (Wednesday, 5/7) Look for interesting topics? Æ go through the syllabus : http://www.stanford.edu/class/cs224n/syllabus.html Æ final projects from previous years : http://nlp.stanford.edu/courses/cs224n/ Æ what data / tools are out there? Æ collect your own dataset 1. LDC (Linguistic Data Consortium) o http://www.ldc.upenn.edu/Catalog/ 2. Corpora@Stanford o http://www.stanford.edu/dept/linguistics/corpora/ o Corpus TA: Anubha Kothari o The inventory: http://www.stanford.edu/dept/linguistics/corpora/inventory.html o Some of them are on AFS; some of them are available on DVD/CDs in the linguistic department 3. http://nlp.stanford.edu/links/statnlp.html 4. ACL Anthology o http://aclweb.org/anthology-new/ 5. Various Shared Tasks o CoNLL (Conference on Computational Natural Language Learning) 2006: Multi-lingual Dependency Parsing 2005, 2004: Semantic Role Labeling 2003, 2002: Language-Independent Named Entity Recognition 2001: Clause Identification 2000: Chunking 1999: NP bracketing o Machine Translation shared tasks: 2008, 2007, 2006, 2005 o Pascal Challenges (RTE, etc) o TREC (IR) o Senseval (Word sense disambiguation) o … Parsing Most widely-used : Penn Treebank 1. English: (LDC99T42) Treebank-3 (see Bill’s notes) 2. In many different languages, like Chinese (CTB6.0), Arabic Other parsed corpora 1. Switchboard (spoken) 2. German: NEGRA, TIGER, Tueba-D/Z • There’s an ACL workshop on German Parsing this year… 3.
    [Show full text]
  • Machine Learning Approaches To
    Slides adapted from Dan Jurafsky, Jim Martin and Chris Manning ` This week x Finish semantics x Begin machine learning for NLP x Review for midterm ` Midterm ◦ October 27th, ◦ Where: 1024 Mudd (here) ◦ When: Class time, 2:40-4:00 ◦ Will cover everything through semantics ◦ A sample midterm will be posted ◦ Includes multiple choice, short answer, problem solving ` October 29th ◦ Bob Coyne and Words Eye: Not to be missed! ` TBD: Class outing to Where the Wild Things Are ` A subset of WordNet sense representation commonly used ` WordNet provides many relations that capture meaning ` To do WSD, need a training corpus tagged with senses ` Naïve Bayes approach to learning the correct sense ◦ Probability of a specific sense given a set of features ◦ Collocational features ◦ Bag of words ` A case statement…. ` Restrict the lists to rules that test a single feature (1-decisionlist rules) ` Evaluate each possible test and rank them based on how well they work. ` Glue the top-N tests together and call that your decision list. ` On a binary (homonymy) distinction used the following metric to rank the tests P(Sense1 | Feature) P(Sense2 | Feature) ` This gives about 95% on this test… ` In vivo versus in vitro evaluation ` In vitro evaluation is most common now ◦ Exact match accuracy x % of words tagged identically with manual sense tags ◦ Usually evaluate using held-out data from same labeled corpus x Problems? x Why do we do it anyhow? ` Baselines ◦ Most frequent sense ◦ The Lesk algorithm ` Wordnet senses are ordered in frequency order ` So “most
    [Show full text]
  • Image and Interpretation: Using Artificial Intelligence to Read Ancient Roman Texts"
    Terras, M. and Robertson, P. (2005). "Image and Interpretation: Using Artificial Intelligence to Read Ancient Roman Texts". HumanIT, Volume 7, Issue 3. http://www.hb.se/bhs/ith/3-7/mtpr.pdf Image and Interpretation: Using Artificial Intelligence To Read Ancient Roman Texts1 Terras, M. ([email protected]) Robertson, P. ([email protected]) Keywords Minimum Description Length, Ancient Documents, Artificial Intelligence, Vindolanda, Knowledge Elicitation, Agents, Information Fusion Abstract The ink and stylus tablets discovered at the Roman Fort of Vindolanda are a unique resource for scholars of ancient history. However, the stylus tablets have proved particularly difficult to read. This paper describes a system that assists expert papyrologists in the interpretation of the Vindolanda writing tablets. A model- based approach is taken that relies on models of the written form of characters, and statistical modelling of language, to produce plausible interpretations of the documents. Fusion of the contributions from the language, character, and image feature models is achieved by utilizing the GRAVA agent architecture that uses Minimum Description Length as the basis for information fusion across semantic levels. A system is developed that reads in image data and outputs plausible interpretations of the Vindolanda tablets. 1. Introduction The ink and stylus texts from Vindolanda are an unparalleled source of information regarding the Roman Army and Roman occupation of Britain for historians, linguists, palaeographers, and archaeologists. The visibility and legibility of the handwriting on the ink texts can be improved through the use of infrared photography. However, due to their physical state, the stylus tablets (one of the forms of official documentation of the Roman Army) have proved almost impossible to read.
    [Show full text]