Full Text (Pdf)
Total Page:16
File Type:pdf, Size:1020Kb
Selected Papers from the CLARIN Annual Conference 2015 October 14–16, 2015 Wrocław, Poland Edited by Koenraad De Smedt Published by Linköping University Electronic Press, Sweden Linköping Electronic Conference Proceedings, No. 123 NEALT Proceedings Series, Volume 28 eISSN 1650-3740 • ISSN 1650-3686 • ISBN 978-91-7685-765-6 Preface These proceedings present the highlights of the CLARIN Annual Conference 2015 that took place in Wrocław, Poland. It is the second volume of what we hope is becoming a series of publications that constitutes a multiannual record of the experience and insights gained within CLARIN. The selected papers present results which were obtained in the projects conducted within and between the national consortia and which contribute to the realization of what CLARIN aims to be: a European Re- search Infrastructure providing sustainable access to language resources in all forms, analysis services for the processing of language materials, and a platform for the sharing of knowledge that can stimulate the use, reuse and repurposing of the available data. As an international research infrastructure, CLARIN has the potential to lower the barriers for researchers to entry into digital scholarship and cutting edge research. With the growing number of participating countries and languages covered, CLARIN will help to establish a truly connected and multilingual European Research Area for digital research in the Humanities and Social Sciences, in which Europe’s multilinguality reinforces the basis for the comparative investigation of a wide range of intellectual and societal phenomena. This volume illustrates the various ways in which CLARIN supports researchers and it shows that there are more and more scholarly domains in which language resources are a crucial data type, whether approached as big data or nor so big data. Language resources can play a multitude of roles, such as carrier of information, record of the past, means of literary expression, social signal, or object of linguistic study. As diverse as the roles of language data are in the various research domains, so diverse is the community of scholars collabo- rating within CLARIN. The Annual Conference is one of the instruments for organizing the communication between those who build and maintain the infrastructure, those who provide data and tools, and those who use the CLARIN infrastructure in their scholarly projects. For all these groups, the sharing of insights in problems, solutions, failures and successes is a prerequisite for taking the next steps. Even more importantly, the Annual Conference offers a platform for the sharing of inspiring examples of investigations which have become feasible because of the infrastructure. These proceedings once again demonstrate the potential impact of CLARIN on scholarly work. With the increasing volumes and diversity of services offered, the potential for impact on the research agendas addressing societal challenges is growing as well. Hopefully this volume will help to attract new categories of users, with ideas and requirement for use cases that can help us identify the directions and next steps to take in the further development of the CLARIN infrastructure. April 4, 2016 Franciska de Jong Utrecht CLARIN ERIC Executive Director i Introduction This volume contains a selection of papers presented at the CLARIN Annual Conference 2015 which was held in Wrocław, Poland, from the 14th to the 16th of October 2015. CLARIN has been organizing its Annual Conference since 2012. The aim of these conferences is to exchange ideas and experiences on the CLARIN infrastructure. Topics include the infrastructure’s design, construction and operation, the data and services that it contains or should contain, its actual or potential use by researchers, its relation to other infrastructures and projects, and the CLARIN Knowledge Sharing Infrastructure. Since 2014, CLARIN changed the format of these events by launching an open call for contributions, subjecting submissions to peer review, and publishing Selected Papers after the event. The program of the conference is the responsibility of the CLARIN National Coordinators’ Forum (NCF). Each submission was reviewed by at least three members of the NCF or people delegated by NCF members. In 2015 the program consisted of 22 presentations, accepted on the basis of extended abstracts which were published in a Book of Abstracts. Ten presentations were accepted to be presented orally and twelve as posters. Some presentations were supplemented by demonstrations. The program also included an invited talk Interaction and dialogue with large-scale textual data: Parliamentary speeches about migration and speeches by migrants as a use case by Andreas Blätte (University of Duisburg-Essen). This talk brought methodologies relevant to CLARIN into the context of current societal concerns. After the conference, the authors of all presentations were asked to submit full papers. Ten full papers were received, each of which was evaluated by four reviewers. Nine papers were selected for publication, involving authors from Austria, the Czech Republic, Estonia, Finland, Germany, The Netherlands, Norway and Sweden. A substantial range of topics is covered by the papers in this volume. These include the construction of the CLARIN infrastructure, the standards that underpin it (in particular as regards metadata and data concepts), and the enrichment of data and tools that populate it. There are also use cases of its scientific exploitment in the fields of Dutch linguistics and rhetorical history, with appropriate methodological considerations. Re- search data workflows, data curation, data management plans and the regulatory and contractual framework governing the use of data are also addressed. I thank all reviewers for their evaluation of the submissions. I also thank Peter Berkesand at Linköping University Electronic Press for the efficient digital publication of this volume. April 4, 2016 Koenraad De Smedt Bergen Program Committee Chair ii Program Committee Lars Borin Jan Odijk António Branco Rūta Petrauskaitė Koenraad De Smedt Maciej Piasecki Tomaz Erjavec Stelios Piperidis Eva Hajičová Kiril Simov Erhard Hinrichs Kadri Vider Bente Maegaard Martin Wynne Karlheinz Mörth Additional reviewers Daniël de Kok Maria Gavrilidou Paweł Kamocki iii Table of Contents The CLARINO Bergen Centre: Development and Deployment . 1 Koenraad De Smedt, Gunn Inger Lyse, Rune Kyrkjebø, Hemed Al Ruwehy, Øyvind Liland Gjesdal, Victoria Rosén and Paul Meurer The Regulatory and Contractual Framework as an Integral Part of the CLARIN Infrastructure . 13 Aleksei Kelli, Kadri Vider and Krister Lindén Variability of the Facet Values in the VLO – a Case for Metadata Curation . 26 Margaret King, Davor Ostojic, Matej Ďurčo and Go Sugimoto A Use Case for Linguistic Research on Dutch with CLARIN . 45 Jan Odijk CLARIN Concept Registry: The New Semantic Registry. 62 Ineke Schuurman, Menzo Windhouwer, Oddrun Ohren and Daniel Zeman DMPTY – A Wizard For Generating Data Management Plans . 71 Thorsten Trippel and Claus Zinn How Can Big Data Help Us Study Rhetorical History?. 79 Jon Viklund and Lars Borin Research Data Workflows: From Research Data Lifecycle Models to Institutional Solutions . 94 Tanja Wissik and Matej Ďurčo Enriching a Grammatical Database with Intelligent Links to Linguistic Resources . 108 Ton van der Wouden, Gosse Bouma, Matje van de Kamp, Marjo van Koppen, Frank Landsbergen and Jan Odijk iv The CLARINO Bergen Centre: Development and Deployment Koenraad De Smedt, Gunn Inger Lyse, Rune Kyrkebø, Hemed Al Ruwehy, Øyvind Liland Gjesdal, and Victoria Rosén University of Bergen Bergen, Norway {desmedt|gunn.lyse|rune.kyrkjebo|hemed.ruwehy|oyvind.gjesdal|victoria} @uib.no Paul Meurer Uni Research Computing Bergen, Norway [email protected] Abstract The CLARINO Bergen Centre (Norway) provides a language resource repository, corpus and treebank services and metadata management services. We explain the motivation for using the LINDAT repository software as a model and describe the cloning and adaptation of that software for the CLARINO Bergen Repository. We also describe how the other centre services addressing CLARIN goals have been integrated into the centre, focusing on the steps taken to adapt the INESS treebanking service to CLARIN standards. 1 Introduction The CLARIN ERIC is a distributed research infrastructure, realized in the form of a network of centres which offer access to language data and tools and online services for search, analysis, visualization and other processing. The Norwegian research infrastructure project CLARINO is constructing a network of CLARIN centres in Norway. In this context, the CLARINO Bergen Centre was established through a cooperation between the University of Bergen (Norway) and Uni Research Computing (a research institute, also in Bergen). This centre was awarded CLARIN centre type B status in January 2016.1 Every CLARIN centre of this type is required to run a data repository in accordance with certain criteria for good practice and compatibility with the CLARIN infrastructure.2 The first section of this paper exemplifies the value of sharing technical solutions within the CLARIN community by describing how the repository software from another CLARIN member was cloned and adapted for the CLARINO Bergen Repository. In addition to providing a repository, a CLARIN centre may provide services for language data man- agement, analysis, visualization, etc. In the CLARINO Bergen Centre these services