Proceedings of CLARIN Annual Conference 2020

CLARIN Annual Conference 2020 PROCEEDINGS Edited by Costanza Navarretta, Maria Eskevich 05 – 07 October 2020 Virtual Edition Please cite as: Proceedings of CLARIN Annual Conference 2020. Eds. C. Navarretta and M. Eskevich. Virtual Edition, 2020. Programme Committee Chair: Costanza Navarretta, University of Copenhagen, Denmark • Members: Lars Borin, University of Gothenburg (SE) • António Branco, Universidade de Lisboa (PT) • Tomaž Erjavec, Jožef Stefan Institute (SI) • Eva Hajicová,ˇ Charles University Prague (CZ) • Erhard Hinrichs, University of Tübingen (DE) • Nicolas Larrousse, Huma-Num (FR) • Krister Lindén, University of Helsinki (FI) • Monica Monachini, Institute of Computational Linguistics “A. Zampolli" (IT) • Karlheinz Mörth, Austrian Academy of Sciences (AT) • Jan Odijk, Utrecht University (NL) • Maciej Piasecki, Wrocław University of Science and Technology (PL) • Stelios Piperidis, Institute for Language and Speech Processing (ILSP), Athena Research Center • (GR) Eirikur Rögnvaldsson, University of Iceland (IS) • Kiril Simov, IICT, Bulgarian Academy of Sciences (BG) • Inguna Skadina, University of Latvia (LV) • , Koenraad De Smedt, University of Bergen (NO) • Marko Tadicˇ , University of Zagreb (HR) • Jurgita Vaicenonienˇ e,˙ Vytautas Magnus University (LT) • Tamás Váradi, Research Institute for Linguistics, Hungarian Academy of Sciences (HU) • Kadri Vider, University of Tartu (EE) • Martin Wynne, University of Oxford (UK) • ii Reviewers: Lars Borin, SE Jan Odijk, NL • • António Branco, PT Stelios Piperidis, GR • • Tomaž Erjavec, SI Eirikur Rögnvaldsson, IS • • Eva Hajicová,ˇ CZ • Kiril Simov, BG • Martin Hennelly, ZA • Inguna Skadina, LV • , Erhard Hinrichs, DE • Koenraad De Smedt, NO Marinos Ioannides, CY • • Marko Tadic,´ HR Nicolas Larrousse, FR • • Jurgita Vaicenonienˇ e,˙ LT Krister Lindén, FI • • Monica Monachini, IT Tamás Váradi, HU • • Karlheinz Mörth, AT Kadri Vider, EE • • Costanza Navarretta, DK Martin Wynne, UK • • Subreviewers: Federico Boschetti, IT Fahad Khan, IT • • Daan Broeder, NL Daniël de Kok, DE • • Francesca Frontini, FR Penny Labropoulou, GR • • Maria Gavriilidou, GR Christophe Parisse, FR • • Riccardo Del Gratta, IT Niccolò Pretto, IT • • Neeme Kahusk, EE Thorsten Trippel, DE • • Aleksei Kelli, EE Valeria Quochi, IT • • CLARIN 2020 submissions, review process and acceptance Call for abstracts: 9 January 2020, 24 February 2020 • Submission deadline: 28 April 2020 • In total 40 submissions were received and reviewed (three reviews per submission) • Virtual PC meeting: 15-16 June 2020 • Notifications to authors: 22 June 2020 • 36 accepted submissions • More details on the paper selection procedure and the conference can be found at https://www.clarin. eu/event/2020/clarin-annual-conference-2020-virtual-form. iv Table of Contents Resources and Knowledge Centres for Language and AI Research The CLARIN Resource and Tool Families Jakob Lenardicˇ and Darja Fišer . 1 An Internationally FAIR Mediated Digital Discourse Corpus: Towards Scientific and Pedagogical Reuse Rachel Panckhurst and Francesca Frontini . 6 The First Dictionares in Esperanto: Towards the Creation of a Parallel Corpus Denis Eckert and Francesca Frontini . 11 Digital Neuropsychological Tests and Biomarkers: Resources for NLP and AI Exploration in the Neu- ropsychological Domain Dimitrios Kokkinakis and Kristina Lundholm Fors . 15 CORLI: The French Knowledge-Centre Eva Soroli, Céline Poudat, Flora Badin, Antonio Balvet, Elisabeth Delais-Roussarie, Carole Etienne, Lydia-Mai Ho-Dac, Loïc Liégeois and Christophe Parisse . 19 The CLASSLA Knowledge Centre for South Slavic Languages Nikola Ljubešic,´ Petya Osenova, Tomaž Erjavec and Kiril Simov . 23 Annotation and Visualization Tools Sticker2: A Neural Syntax Annotator for Dutch and German Daniël de Kok, Neele Falk and Tobias Pütz . 27 Exploring and Visualizing Data with GermanNet Rover Marie Hinrichs, Richard Lawrence and Erhard Hinrichs . 32 Named Entity Recognition for Distant Reading in ELTeC Francesca Frontini, Carmen Brando, Joanna Byszuk, Ioana Galleron, Diana Santos and Ranka Stankovic´ 37 Towards Semi-Automatic Analysis of Spontaneous Language for Dutch JanOdijk .............................................................................. 42 A Neural Parsing Pipeline for Icelandic Using the Berkeley Neural Parser Þórunn Arnardóttir and Anton Karl Ingason . 48 Research Cases Annotating Risk Factor Mentions in the COVID-19 Open Research Dataset Maria Skeppstedt, Magnus Ahltorp, Gunnar Eriksson and Rickard Domeij . 52 Contagious "Corona" Compounding in a CLARIN Newspaper Monitor Corpus Koenraad De Smedt . 56 Trawling the Gulf of Bothnia of News: A Big Data Analysis of the Emergence of Terrorism in Swedish and Finnish Newspapers, 1780–1926 v Mats Fridlund, Daniel Brodén, Leif-Jöran Olsson, Lars Borin . 61 Studying Emerging New Contexts for Museum Digitisations on Pinterest Bodil Axelsson, Daniel Holmer, Lars Ahrenberg and Arne Jönsson . 66 Evaluation of a Two-OCR Engine Method: First Results on Digitized Swedish Newspapers Spanning over nearly 200 Years Dana Dannélls, Lars Björk, Torsten Johansson and Ove Dirdal . 71 Stimulating Knowledge Exchange via Trans-National Access – the ELEXIS Travel Grants as a Lexico- graphical Use Case Sussi Olsen, Bolette S. Pedersen, Tanja Wissik, Anna Woldrich and Simon Krek . 77 Repositories and Workflows PoetryLab as Infrastructure for the Analysis of Spanish Poetry Javier De la Rosa, Álvaro Pérez, Laura Hernández, Aitor Díaz, Salvador Ros and Elena González- Blanco ................................................................................ 82 Reproducible Annotation Services for WebLicht Daniël de Kok and Neele Falk . 88 The CLARIN-DK Text Tonsorium Bart Jongejan . 93 Integrating TEITOK and Kontext at LINDAT Marten Janssen . 98 CLARINO+ Optimization of Wittgenstein Research Tools Alois Pichler . 102 Using the FLAT Repository: Two Years In Paul Trilsbeek . 107 Data Curation, Archives and Libraries Building a Home for Italian Audio Archives Silvia Calamai, Niccolò Pretto, Monica Monachini, Maria Francesca Stamuli, Silvia Bianchi and Pierangelo Bonazzoli . 112 Digitizing University Libraries – Evolving from Full Text Providers to CLARIN Contact Points on Cam- puses Manfred Nölte and Martin Mehlberg . 117 “Tea for Two”: The Archive of the Italian Latinity of the Middle Ages meets the CLARIN Infrastructure Federico Boschetti, Riccardo Del Gratta, Monica Monachini, Marina Buzzoni, Paolo Monella and Roberto Rosselli Del Turco . 121 Use Cases of the ISO Standard for Transcription of Spoken Language in the Project INEL Anne Ferger and Daniel Jettka . 126 Evaluating and Assuring Research Data Quality for Audiovisual Annotated Language Data Timofey Arkhangelskiy and Hanna Hedeland . 131 Towards Comprehensive Definitions of Data Quality for Audiovisual Annotated Language Resources Hanna Hedeland . 136 Towards an Interdisciplinary Annotation Framework: Combining NLP and Expertise in Humanities Laska Laskova, Petya Osenova and Kiril Simov . 141 Metadata and Legal Aspects Signposts for CLARIN Denis Arnold, Bernhard Fisseni and Thorsten Trippel . 146 Extending the CMDI Universe: Metadata for Bioinformatics Data Olaf Brandt, Holger Gauza, Steve Kaminski, Mario Trojan and Thorsten Trippel . 151 The CMDI Explorer Denis Arnold, Ben Campbell, Thomas Eckart, Bernhard Fisseni, Thorsten Trippel and Claus Zinn 157 Going to the ALPS: A Tool to Support Researchers and Help Legality Awareness Building Veronika Gründhammer, Vanessa Hannesschläger and Martina Trognitz . 162 When Size Matters. Legal Perspective(s) on N-grams Pawel Kamocki . 166 CLARIN Contractual Framework for Sharing Language Data: The Perspective of Personal Data Pro- tection Aleksei Kelli, Krister Lindén, Kadri Vider, Pawel Kamocki, Ramunas¯ Birštonas, Gaabriel Tavits, Penny Labropoulou, Mari Keskküla and Arvi Tavast . 170 Resources and Knowledge Centers for Language and AI Research 1 Extending the CLARIN Resource and Tool Families Jakob Lenardicˇ1 Darja Fiserˇ 1,2 1Dept. of Translation, Faculty of Arts 2Dept. of Knowledge Technologies University of Ljubljana, Slovenia Jozefˇ Stefan Institute, Slovenia [email protected] [email protected] Abstract This paper presents the current state of the CLARIN Resource and Tool families initiative, the aim of which is to provide user-friendly overviews of the available language resources and tools in the CLARIN infrastructure for researchers from digital humanities, social sciences and human language technologies. The initiative now consists of a total of 11 corpus families, 5 families of lexical resource resources, and 4 tool families, which together amount to 950 manually curated tools and resources as of 17 August 2020. We present the initiative from the perspective of miss- ing metadata as well as problems related to the general accessibility of the tools and resources and their findability in the Virtual Language Observatory (VLO). 1 Introduction This paper presents the current state of the CLARIN Resource and Tool families initiative,1 the aim of which is to provide user-friendly overviews of the available language resources and tools in the CLARIN infrastructure for researchers from digital humanities, social sciences and human language technologies. When the initiative first made its debut at LREC in 2018, it consisted of early versions of only 4 corpus families – parliamentary, computer-mediated communication (CMC), parallel, and newspaper corpora (Fiserˇ et al., 2018). Since then, the initiative has been expanded substantially and now consists

Proceedings of CLARIN Annual Conference 2020

CLIB 2016 Proceedings

Conference Abstracts

Morphological Doublets in Croatian: a Multi-Methodological Analysis

Languages in the European Information Society Croatian

Overabundance in Croatian Dual-Class Verbs FLUMINENSIA, God

The Bulgarian National Corpus: Theory and Practice in Corpus Design

Deliverable D3.3.-B Third Batch of Language Resources

Pdf Pravdová, M., & Svobodová, I

Proceedings of Rely on Different Character Sets Such As MATMT2008 Workshop: Mixing Approaches to CJK Or Arabic

Bollettino SOCIETÀ DI LINGUISTICA

GOVOR / SPEECH Zagreb, Godina 37, 2020, Broj 2, (2021) Mrežna Inačica: ISSN 1849-2126 UDK 81'34(05)"540.6" CODEN GOVOEB Tiskana Inačica: ISSN 0352-7565

Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC