Ad Hoc and General-Purpose Corpus Construction from Web Sources Adrien Barbaresi

Total Page:16

File Type:pdf, Size:1020Kb

Ad Hoc and General-Purpose Corpus Construction from Web Sources Adrien Barbaresi Ad hoc and general-purpose corpus construction from web sources Adrien Barbaresi To cite this version: Adrien Barbaresi. Ad hoc and general-purpose corpus construction from web sources. Linguistics. ENS Lyon, 2015. English. tel-01167309 HAL Id: tel-01167309 https://tel.archives-ouvertes.fr/tel-01167309 Submitted on 24 Jun 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License THÈSE en vue de l'obtention du grade de Docteur de l’Université de Lyon, délivré par l’École Normale Supérieure de Lyon Discipline : Linguistique Laboratoire ICAR École Doctorale LETTRES, LANGUES, LINGUISTIQUE, ARTS présentée et soutenue publiquement le 19 juin 2015 par Monsieur Adrien Barbaresi _____________________________________________ Construction de corpus généraux et spécialisés à partir du web ______________________________________________ Directeur de thèse : M. Benoît HABERT Devant la commission d'examen formée de : M. Benoît HABERT, ENS Lyon, Directeur M. Thomas LEBARBÉ, Université Grenoble 3, Examinateur M. Henning LOBIN, Université de Gießen, Rapporteur M. Jean-Philippe MAGUÉ, ENS Lyon, Co-Encadrant M. Ludovic TANGUY, Université Toulouse 2, Rapporteur Ad hoc and general-purpose corpus construction from web sources Adrien Barbaresi 2015 École Normale Supérieure de Lyon (Université de Lyon) ED 484 3LA ICAR lab Thesis Committee: Benoît H ENS Lyon advisor Thomas L$ University of Grenoble 3 chair Henning L University of Gießen reviewer Jean-Philippe M$ ENS Lyon co-advisor Ludovic T University of Toulouse 2 reviewer Submitted in partial fulVllment of the requirements for the degree of Doctor of Philosophy in Linguistics i Acknowledgements Many thanks to Benoît Habert and Jean-Philippe Magué Henning Lobin Serge Heiden, Matthieu Decorde, Alexei Lavrentev, Céline Guillot, and the people at ICAR Alexander Geyken, Lothar Lemnitzer, Kay-Michael Würzner, Bryan Jurish, and the team at the Zentrum Sprache of the BBAW my fellow former members of the board of ENthèSe Felix Bildhauer and Roland Schäfer the liberalities of my PhD grant and the working conditions I beneVted from1 the S. my friends and family Eva Brehm-Jurish for helping my English sound better Brendan O’Connor for his inspirational LATEX preamble 1PhD grant at the ENS Lyon, internal research grant at FU Berlin, DFG-funded research grants at the BBAW. I am grateful to the FU Berlin and the BBAW for the substantial computational resources they provided. iii Contents 1 Corpus design before and after the emergence of web data 1 1.1 Prologue: “A model of Universal Nature made private" – Garden design and text collections . 2 1.2 Introductory deVnitions and typology: corpus, corpus design and text . 4 1.2.1 What “computational" is to linguistics . 4 1.2.2 What theory and Veld are to linguistics . 5 1.2.3 Why use a corpus? How is it deVned? How is it built? . 6 1.3 Corpus design history: From copy typists to web data (1959-today) . 15 1.3.1 Possible periodizations . 15 1.3.2 Copy typists, Vrst electronic corpora and establishment of a scientiVc methodology (1959-1980s) . 16 1.3.3 Digitized text and emergence of corpus linguistics, between techno- logical evolutions and cross-breeding with NLP (from the late 1980s onwards) . 21 1.3.4 Enter web data: porosity of the notion of corpus, and opportunism (from the late 1990s onwards) . 31 1.4 Web native corpora – Continuities and changes . 48 1.4.1 Corpus linguistics methodology . 48 1.4.2 From content oversight to suitable processing: Known problems . 56 1.5 Intermediate conclusions . 64 1.5.1 (Web) corpus linguistics: a discipline on the rise? . 64 1.5.2 Changes in corpus design and construction . 66 1.5.3 Challenges addressed in the following sections . 68 2 Gauging and quality assessment of text collections: methodological insights on (web) text processing and classiVcation 71 2.1 Introduction . 72 2.2 Text quality assessment, an example of interdisciplinary research on web texts 72 2.2.1 Underestimated Waws and recent advances in discovering them . 72 2.2.2 Tackling the problems: General state of the art . 77 2.3 Text readability as an aggregate of salient text characteristics . 83 2.3.1 From text quality to readability and back . 83 2.3.2 Extent of the research Veld: readability, complexity, comprehensibility 83 v 2.3.3 State of the art of diUerent widespread approaches . 85 2.3.4 Common denominator of current methods . 96 2.3.5 Lessons to learn and possible application scenarios . 99 2.4 Text visualization, an example of corpus processing for digital humanists . 100 2.4.1 Advantages of visualization techniques and examples of global and lo- cal visualizations . 100 2.4.2 Visualization of corpora . 102 2.5 Conclusions on current research trends . 105 2.5.1 Summary . 105 2.5.2 A gap to bridge between quantitative analysis and (corpus) linguistics . 106 3 From the start URLs to the accessible corpus document: Web text collection and preprocessing 109 3.1 Introduction . 110 3.1.1 Structure of the Web and deVnition of web crawling . 110 3.1.2 “OYine" web corpora: overview of constraints and processing chain . 112 3.2 Data collection: how is web document retrieval done and which challenges arise? 113 3.2.1 Introduction . 113 3.2.2 Finding URL sources: Known advantages and diXculties . 115 3.2.3 In the civilized world: “Restricted retrieval" . 118 3.2.4 Into the wild: Web crawling . 120 3.2.5 Finding a vaguely deVned needle in a haystack: targeting small, partly unknown fractions of the web . 127 3.3 An underestimated step: Preprocessing . 130 3.3.1 Introduction . 130 3.3.2 Description and discussion of salient preprocessing steps . 132 3.3.3 Impact on research results . 138 3.4 Conclusive remarks and open questions . 142 3.4.1 Remarks on current research practice . 142 3.4.2 Existing problems and solutions . 143 3.4.3 (Open) questions . 144 4 Web corpus sources, qualiVcation, and exploitation 147 4.1 Introduction: qualiVcation steps and challenges . 148 4.1.1 Hypotheses . 148 4.1.2 Concrete issues . 149 4.2 PrequaliVcation of web documents . 151 4.2.1 PrequaliVcation: Finding viable seed URLs for web corpora . 151 4.2.2 Impact of prequaliVcation on (focused) web crawling and web corpus sampling . 177 4.2.3 Restricted retrieval . 180 4.3 QualiVcation of web corpus documents and web corpus building . 186 4.3.1 General-purpose corpora . 186 4.3.2 Specialized corpora . 203 4.4 Corpus exploitation: typology, analysis, and visualization . 210 4.4.1 A few speciVc corpora constructed during this thesis . 210 4.4.2 Interfaces and exploitation . 215 vi 5 Conclusion 229 5.0.1 The framework of web corpus linguistics . 230 5.0.2 Put the texts back into focus . 232 5.0.3 Envoi: Back to the garden . 235 6 Annexes 237 6.1 Synoptic tables . 238 6.2 Building a basic crawler . 240 6.3 Concrete steps of URL-based content Vltering . 243 6.3.1 Content type Vltering . 243 6.3.2 Filtering adult content and spam . 243 6.4 Examples of blogs . 244 6.5 A surface parser for fast phrases and valency detection on large corpora . 247 6.5.1 Introduction . 247 6.5.2 Description . 248 6.5.3 Implementation . 249 6.5.4 Evaluation . 252 6.5.5 Conclusion . 253 6.6 Completing web pages on the Wy with JavaScript . 255 6.7 Newspaper article in XML TEI format . 258 6.8 Résumé en français . 260 References 271 Index 285 vii List of Tables 1.1 Synoptic typological comparison of speciVc and general-purpose corpora . 51 3.1 Amount of documents lost in the cleanup steps of DECOW2012 . 138 4.1 URLs extracted from DMOZ and Wikipedia . 157 4.2 Crawling experiments for Indonesian . 158 4.3 Naive reference for crawling experiments concerning Indonesian . 159 4.4 5 most frequent languages of URLs taken at random on FriendFeed . 165 4.5 10 most frequent languages of spell-check-Vltered URLs gathered on FriendFeed . 166 4.6 10 most frequent languages of URLs gathered on identi.ca . 166 4.7 10 most frequent languages of Vltered URLs gathered on Reddit channels and on a combination of channels and user pages . 167 4.8 5 most frequent languages of links seen on Reddit and rejected by the primary language Vlter . 167 4.9 URLs extracted from search engines queries . 171 4.10 URLs extracted from a blend of social networks crawls . 172 4.11 URLs extracted from DMOZ and Wikipedia . 173 4.12 Comparison of the number of diUerent domains in the page’s outlinks for four diUerent sources . 174 4.13 Inter-annotator agreement of three coders for a text quality rating task . 189 4.14 Results of logistic regression analysis for selected features and choices made by rater S . 201 4.15 Evaluation of the model and comparison with the results of Schäfer et al. (2013) . 202 4.16 Most frequent license types . 207 4.17 Most frequent countries (ISO code) . 207 4.18 Various properties of the examined corpora. 222 4.19 Percentage distribution of selected PoS (super)tags on token and type level . 224 6.1 Corpora created during this thesis .
Recommended publications
  • Preparation and Exploitation of Bilingual Texts Dusko Vitas, Cvetana Krstev, Eric Laporte
    Preparation and exploitation of bilingual texts Dusko Vitas, Cvetana Krstev, Eric Laporte To cite this version: Dusko Vitas, Cvetana Krstev, Eric Laporte. Preparation and exploitation of bilingual texts. Lux Coreana, 2006, 1, pp.110-132. hal-00190958v2 HAL Id: hal-00190958 https://hal.archives-ouvertes.fr/hal-00190958v2 Submitted on 27 Nov 2007 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Preparation and exploitation of bilingual texts Duško Vitas Faculty of Mathematics Studentski trg 16, CS-11000 Belgrade, Serbia Cvetana Krstev Faculty of Philology Studentski trg 3, CS-11000 Belgrade, Serbia Éric Laporte Institut Gaspard-Monge, Université de Marne-la-Vallée 5, bd Descartes, 77454 Marne-la-Vallée CEDEX 2, France Introduction A bitext is a merged document composed of two versions of a given text, usually in two different languages. An aligned bitext is produced by an alignment tool or aligner, that automatically aligns or matches the versions of the same text, generally sentence by sentence. A multilingual aligned corpus or collection of aligned bitexts, when consulted with a search tool, can be extremely useful for translation, language teaching and the investigation of literary text (Veronis, 2000).
    [Show full text]
  • Distant Reading, Computational Stylistics, and Corpus Linguistics the Critical Theory of Digital Humanities for Literature Subject Librarians David D
    CHAPTER FOUR Distant Reading, Computational Stylistics, and Corpus Linguistics The Critical Theory of Digital Humanities for Literature Subject Librarians David D. Oberhelman FOR LITERATURE librarians who frequently assist or even collaborate with faculty, researchers, IT professionals, and students on digital humanities (DH) projects, understanding some of the tacit or explicit literary theoretical assumptions involved in the practice of DH can help them better serve their constituencies and also equip them to serve as a bridge between the DH community and the more traditional practitioners of literary criticism who may not fully embrace, or may even oppose, the scientific leanings of their DH colleagues.* I will therefore provide a brief overview of the theory * Literary scholars’ opposition to the use of science and technology is not new, as Helle Porsdam has observed by looking at the current debates over DH in the academy in terms of the academic debate in Great Britain in the 1950s and 1960s between the chemist and novelist C. P. Snow and the literary critic F. R. Leavis over the “two cultures” of science and the humanities (Helle Porsdam, “Too Much ‘Digital,’ Too Little ‘Humanities’? An Attempt to Explain Why Humanities Scholars Are Reluctant Converts to Digital Humanities,” Arcadia project report, 2011, DSpace@Cambridge, www.dspace.cam.ac.uk/handle/1810/244642). 54 DISTANT READING behind the technique of DH in the case of literature—the use of “distant reading” as opposed to “close reading” of literary texts as well as the use of computational linguistics, stylistics, and corpora studies—to help literature subject librarians grasp some of the implications of DH for the literary critical tradition and learn how DH practitioners approach literary texts in ways that are fundamentally different from those employed by many other critics.† Armed with this knowledge, subject librarians may be able to play a role in integrating DH into the traditional study of literature.
    [Show full text]
  • KB-Driven Information Extraction to Enable Distant Reading of Museum Collections Giovanni A
    KB-Driven Information Extraction to Enable Distant Reading of Museum Collections Giovanni A. Cignoni, Progetto HMR, [email protected] Enrico Meloni, Progetto HMR, [email protected] Short abstract (same as the one submitted in ConfTool) Distant reading is based on the possibility to count data, to graph and to map them, to visualize the re- lations which are inside the data, to enable new reading perspectives by the use of technologies. By contrast, close reading accesses information by staying at a very fine detail level. Sometimes, we are bound to close reading because technologies, while available, are not applied. Collections preserved in the museums represent a relevant part of our cultural heritage. Cataloguing is the traditional way used to document a collection. Traditionally, catalogues are lists of records: for each piece in the collection there is a record which holds information on the object. Thus, a catalogue is just a table, a very simple and flat data structure. Each collection has its own catalogue and, even if standards exists, they are often not applied. The actual availability and usability of information in collections is impaired by the need of access- ing each catalogue individually and browse it as a list of records. This can be considered a situation of mandatory close reading. The paper discusses how it will be possible to enable distant reading of collections. Our proposed solution is based on a knowledge base and on KB-driven information extraction. As a case study, we refer to a particular domain of cultural heritage: history of information science and technology (HIST).
    [Show full text]
  • The Iafor European Conference Series 2014 Ece2014 Ecll2014 Ectc2014 Official Conference Proceedings ISSN: 2188-1138
    the iafor european conference series 2014 ece2014 ecll2014 ectc2014 Official Conference Proceedings ISSN: 2188-1138 “To Open Minds, To Educate Intelligence, To Inform Decisions” The International Academic Forum provides new perspectives to the thought-leaders and decision-makers of today and tomorrow by offering constructive environments for dialogue and interchange at the intersections of nation, culture, and discipline. Headquartered in Nagoya, Japan, and registered as a Non-Profit Organization 一般社( 団法人) , IAFOR is an independent think tank committed to the deeper understanding of contemporary geo-political transformation, particularly in the Asia Pacific Region. INTERNATIONAL INTERCULTURAL INTERDISCIPLINARY iafor The Executive Council of the International Advisory Board IAB Chair: Professor Stuart D.B. Picken IAB Vice-Chair: Professor Jerry Platt Mr Mitsumasa Aoyama Professor June Henton Professor Frank S. Ravitch Director, The Yufuku Gallery, Tokyo, Japan Dean, College of Human Sciences, Auburn University, Professor of Law & Walter H. Stowers Chair in Law USA and Religion, Michigan State University College of Law Professor David N Aspin Professor Emeritus and Former Dean of the Faculty of Professor Michael Hudson Professor Richard Roth Education, Monash University, Australia President of The Institute for the Study of Long-Term Senior Associate Dean, Medill School of Journalism, Visiting Fellow, St Edmund’s College, Cambridge Economic Trends (ISLET) Northwestern University, Qatar University, UK Distinguished Research Professor of Economics,
    [Show full text]
  • Kernerman Kdictionaries.Com/Kdn DICTIONARY News the European Network of E-Lexicography (Enel) Tanneke Schoonheim
    Number 22 ● July 2014 Kernerman kdictionaries.com/kdn DICTIONARY News The European Network of e-Lexicography (ENeL) Tanneke Schoonheim On October 11th 2013, the kick-off meeting of the European production and reception of dictionaries. The internet offers Network of e-Lexicography (ENeL) project took place in entirely new possibilities for developing and presenting Brussels. This meeting was the outcome of an idea ventilated dictionary information, such as with the integration of sound, a year and a half earlier, in March 2012 in Berlin, at the maps or video, and various novel ways of interacting with European Workshop on Future Standards in Lexicography. dictionary users. For editors of scholarly dictionaries the new The workshop participants then confirmed the imperative to medium is not only a source of inspiration, it also generates coordinate and harmonise research in the field of (electronic) new and serious challenges that demand cooperation and lexicography across Europe, namely to share expertise relating standardization on various levels: to standards, discuss new methodologies in lexicography that a. Through the internet scholarly dictionaries can potentially fully exploit the possibilities of the digital medium, reflect on reach large audiences. However, at present scholarly the pan-European nature of the languages of Europe and attain dictionaries providing reliable information are often not easy a wider audience. to find and are hard to decode for a non-academic audience; A proposal was written by a team of researchers from
    [Show full text]
  • TANGO: Bilingual Collocational Concordancer
    TANGO: Bilingual Collocational Concordancer Jia-Yan Jian Yu-Chia Chang Jason S. Chang Department of Computer Inst. of Information Department of Computer Science System and Applictaion Science National Tsing Hua National Tsing Hua National Tsing Hua University University University 101, Kuangfu Road, 101, Kuangfu Road, 101, Kuangfu Road, Hsinchu, Taiwan Hsinchu, Taiwan Hsinchu, Taiwan [email protected] [email protected] [email protected] du.tw on elaborated statistical calculation. Moreover, log Abstract likelihood ratios are regarded as a more effective In this paper, we describe TANGO as a method to identify collocations especially when the collocational concordancer for looking up occurrence count is very low (Dunning, 1993). collocations. The system was designed to Smadja’s XTRACT is the pioneering work on answer user’s query of bilingual collocational extracting collocation types. XTRACT employed usage for nouns, verbs and adjectives. We first three different statistical measures related to how obtained collocations from the large associated a pair to be collocation type. It is monolingual British National Corpus (BNC). complicated to set different thresholds for each Subsequently, we identified collocation statistical measure. We decided to research and instances and translation counterparts in the develop a new and simple method to extract bilingual corpus such as Sinorama Parallel monolingual collocations. Corpus (SPC) by exploiting the word- We also provide a web-based user interface alignment technique. The main goal of the capable of searching those collocations and its concordancer is to provide the user with a usage. The concordancer supports language reference tools for correct collocation use so learners to acquire the usage of collocation.
    [Show full text]
  • Aconcorde: Towards a Proper Concordance for Arabic
    aConCorde: towards a proper concordance for Arabic Andrew Roberts, Dr Latifa Al-Sulaiti and Eric Atwell School of Computing University of Leeds LS2 9JT United Kingdom {andyr,latifa,eric}@comp.leeds.ac.uk July 12, 2005 Abstract Arabic corpus linguistics is currently enjoying a surge in activity. As the growth in the number of available Arabic corpora continues, there is an increased need for robust tools that can process this data, whether it be for research or teaching. One such tool that is useful for both groups is the concordancer — a simple tool for displaying a specified target word in its context. However, obtaining one that can reliably cope with the Arabic lan- guage had proved extremely difficult. Therefore, aConCorde was created to provide such a tool to the community. 1 Introduction A concordancer is a simple tool for summarising the contents of corpora based on words of interest to the user. Otherwise, manually navigating through (of- ten very large) corpora would be a long and tedious task. Concordancers are therefore extremely useful as a time-saving tool, but also much more. By isolat- ing keywords in their contexts, linguists can use concordance output to under- stand the behaviour of interesting words. Lexicographers can use the evidence to find if a word has multiple senses, and also towards defining their meaning. There is also much research and discussion about how concordance tools can be beneficial for data-driven language learning (Johns, 1990). A number of studies have demonstrated that providing access to corpora and concordancers benefited students learning a second language.
    [Show full text]
  • IXA-Pipe) and Parsing (Alpino)
    Details from a distance? CLIN26 Amsterdam A Dutch pipeline for event detection Chantal van Son, Marieke van Erp, Antske Fokkens, Paul Huygen, Ruben Izquierdo Bevia & Piek Vossen CLOSE READING DISTANT READING NewsReader & BiographyNet Apply the detailed analyses typically associated with close (or at least non-distant) reading to large amounts of textual data ✤ NewsReader: (financial) news data ✤ Day-by-day processing of news; reconstructing a coherent story ✤ Four languages: English, Spanish, Italian, Dutch ✤ BiographyNet: biographical data ✤ Answering historical questions with digitalized data from Biography Portal of the Netherlands (www.biografischportaal.nl) using computational methods http://www.newsreader-project.eu/ http://www.biographynet.nl/ Event extraction and representation 1. Intra-document processing: Dutch NLP pipeline generating NAF files ✤ pipeline includes tokenization, parsing, SRL, NERC, etc. ✤ NLP Annotation Format (NAF) is a multi-layered annotation format for representing linguistic annotations in complex NLP architectures 2. Cross-document processing: event coreference to convert NAF to RDF representation 3. Store the NAF and RDF in a KnowledgeStore Event extraction and representation 1. Intra-document processing: Dutch NLP pipeline generating NAF files ✤ pipeline includes tokenization, parsing, SRL, NERC, etc. ✤ NLP Annotation Format (NAF) is a multi-layered annotation format for representing linguistic annotations in complex NLP architectures 2. Cross-document processing: event coreference to convert NAF to RDF 3.
    [Show full text]
  • ISSN: 1647-0818 Lingua
    lingua Volume 12, Número 2 (2020) ISSN: 1647-0818 lingua Volume 12, N´umero 2 { 2020 Linguamatica´ ISSN: 1647{0818 Editores Alberto Sim~oes Jos´eJo~aoAlmeida Xavier G´omezGuinovart Conteúdo Artigos de Investiga¸c~ao Adapta¸c~aoLexical Autom´atica em Textos Informativos do Portugu^esBrasileiro para o Ensino Fundamental Nathan Siegle Hartmann & Sandra Maria Alu´ısio .................3 Avaliando entidades mencionadas na cole¸c~ao ELTeC-por Diana Santos, Eckhard Bick & Marcin Wlodek ................... 29 Avalia¸c~aode recursos computacionais para o portugu^es Matilde Gon¸calves, Lu´ısaCoheur, Jorge Baptista & Ana Mineiro ........ 51 Projetos, Apresentam-se! Aplicaci´onde WordNet e de word embeddings no desenvolvemento de proto- tipos para a xeraci´onautom´atica da lingua Mar´ıaJos´eDom´ınguezV´azquez ........................... 71 Editorial Ainda h´apouco public´avamosa primeira edi¸c~aoda Linguam´atica e, subitamente, eis-nos a comemorar uma d´uziade anos. A todos os que colaboraram durante todos estes anos, tenham sido leitores, autores ou revisores, a todos o nosso muito obrigado. N~ao´ef´acilmanter um projeto destes durante tantos anos, sem qualquer financi- amento. Todo este sucesso ´egra¸cas a trabalho volunt´ario de todos, o que nos permite ter artigos de qualidade publicados e acess´ıveisgratuitamente a toda a comunidade cient´ıfica. Nestes doze anos muitos foram os que nos acompanharam, nomeadamente na comiss~aocient´ıfica. Alguns dos primeiros membros, convidados em 2008, continuam ativamente a participar neste projeto, realizando revis~oesapuradas. Outros, por via do seu percurso acad´emico e pessoal, j´an~aocolaboram connosco. Como sinal de agradecimento temos mantido os seus nomes na publica¸c~ao.
    [Show full text]
  • Panacea D6.1
    SEVENTH FRAMEWORK PROGRAMME THEME 3 Information and communication Technologies PANACEA Project Grant Agreement no.: 248064 Platform for Automatic, Normalized Annotation and Cost-Effective Acquisition of Language Resources for Human Language Technologies D6.1 Technologies and Tools for Lexical Acquisition Dissemination Level: Public Delivery Date: July 16th 2010 Status – Version: Final Author(s) and Affiliation: Laura Rimell (UCAM), Anna Korhonen (UCAM), Valeria Quochi (ILC-CNR), Núria Bel (UPF), Tommaso Caselli (ILC-CNR), Prokopis Prokopidis (ILSP), Maria Gavrilidou (ILSP), Thierry Poibeau (UCAM), Muntsa Padró (UPF), Eva Revilla (UPF), Monica Monachini(CNR-ILC), Maurizio Tesconi (CNR-IIT), Matteo Abrate (CNR-IIT) and Clara Bacciu (CNR-IIT) D6.1 Technologies and Tools for Lexical Acquisition This document is part of technical documentation generated in the PANACEA Project, Platform for Automatic, Normalized Annotation and Cost-Effective Acquisition (Grant Agreement no. 248064). This documented is licensed under a Creative Commons Attribution 3.0 Spain License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/es/. Please send feedback and questions on this document to: [email protected] TRL Group (Tecnologies dels Recursos Lingüístics), Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra (IULA-UPF) D6.1 – Technologies and Tools for Lexical Acquisition Table of contents Table of contents ..........................................................................................................................
    [Show full text]
  • Translate's Localization Guide
    Translate’s Localization Guide Release 0.9.0 Translate Jun 26, 2020 Contents 1 Localisation Guide 1 2 Glossary 191 3 Language Information 195 i ii CHAPTER 1 Localisation Guide The general aim of this document is not to replace other well written works but to draw them together. So for instance the section on projects contains information that should help you get started and point you to the documents that are often hard to find. The section of translation should provide a general enough overview of common mistakes and pitfalls. We have found the localisation community very fragmented and hope that through this document we can bring people together and unify information that is out there but in many many different places. The one section that we feel is unique is the guide to developers – they make assumptions about localisation without fully understanding the implications, we complain but honestly there is not one place that can help give a developer and overview of what is needed from them, we hope that the developer section goes a long way to solving that issue. 1.1 Purpose The purpose of this document is to provide one reference for localisers. You will find lots of information on localising and packaging on the web but not a single resource that can guide you. Most of the information is also domain specific ie it addresses KDE, Mozilla, etc. We hope that this is more general. This document also goes beyond the technical aspects of localisation which seems to be the domain of other lo- calisation documents.
    [Show full text]
  • VIANA: Visual Interactive Annotation of Argumentation
    VIANA: Visual Interactive Annotation of Argumentation Fabian Sperrle* Rita Sevastjanova∗ Rebecca Kehlbeck∗ Mennatallah El-Assady∗ Figure 1: VIANA is a system for interactive annotation of argumentation. It offers five different analysis layers, each represented by a different view and tailored to a specific task. The layers are connected with semantic transitions. With increasing progress of the analysis, users can abstract away from the text representation and seamlessly transition towards distant reading interfaces. ABSTRACT are actively developing techniques for the extraction of argumen- Argumentation Mining addresses the challenging tasks of identi- tative fragments of text and the relations between them [41]. To fying boundaries of argumentative text fragments and extracting develop and train these complex, tailored systems, experts rely on their relationships. Fully automated solutions do not reach satis- large corpora of annotated gold-standard training data. However, factory accuracy due to their insufficient incorporation of seman- these training corpora are difficult and expensive to produce as they tics and domain knowledge. Therefore, experts currently rely on extensively rely on the fine-grained manual annotation of argumen- time-consuming manual annotations. In this paper, we present a tative structures. An additional barrier to unifying and streamlining visual analytics system that augments the manual annotation pro- this annotation process and, in turn, the generation of gold-standard cess by automatically suggesting which text fragments to annotate corpora is the subjectivity of the task. A reported agreement with next. The accuracy of those suggestions is improved over time by Cohen’s k [11] of 0:610 [64] between human annotators is consid- incorporating linguistic knowledge and language modeling to learn ered “substantial” [38] and is the state-of-the-art in the field.
    [Show full text]