Distributed Language Resources and Technology in a European Infrastructure
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 1st International Workshop on Language Technology Platforms (IWLTP 2020), pages 28–34 Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC CLARIN: Distributed Language Resources and Technology in a European Infrastructure Maria Eskevich1, Franciska de Jong1, Alexander Konig¨ 1, Darja Fiserˇ 2, Dieter Van Uytvanck1, Tero Aalto3, Lars Borin4, Olga Gerassimenko5, Jan Hajic6, Henk van den Heuvel7, Neeme Kahusk5, Krista Liin5, Martin Matthiesen8, Stelios Piperidis9, and Kadri Vider5 1 CLARIN ERIC, Utrecht, The Netherlands ·2 University of Ljubljana and Jozefˇ Stefan, Ljubljana, Slovenia · 3 The Language Bank of Finland·4 Sprakbanken,˚ University of Gothenburg, Sweden·5 University of Tartu, Estonia· 6 Charles University, Prague, Czech Republic ·7 CLST, Radboud University, Nijmegen, The Netherlands · 8 CSC - IT Center for Science, Espoo, Finland ·9 ILSP/Athena RC, Greece [email protected] Abstract CLARIN is a European Research Infrastructure providing access to digital language resources and tools from across Europe and beyond to researchers in the humanities and social sciences. This paper focuses on CLARIN as a platform for the sharing of language resources. It zooms in on the service offer for the aggregation of language repositories and the value proposition for a number of communities that benefit from the enhanced visibility of their data and services as a result of integration in CLARIN. The enhanced findability of language resources is serving the social sciences and humanities (SSH) community at large and supports research communities that aim to collaborate based on virtual collections for a specific domain. The paper also addresses the wider landscape of service platforms based on language technologies which has the potential of becoming a powerful set of interoperable facilities to a variety of communities of use. Keywords: CLARIN, language resources, research infrastructure, repositories, interoperability 1. Introduction CLARIN1 is a European Research Infrastructure provid- ing access to language resources and tools. It focuses on the widely acknowledged role of language as cultural and social data and the increased potential for compara- tive research of cultural and societal phenomena across the boundaries of languages. Since its establishment as a Euro- pean Research Infrastructure Consortium (ERIC) in 2012, CLARIN has grown both in terms of number of members and observers (21 and 3 respectively in Spring 2020, see Figure 1.), and in terms of the variety of specific communi- ties served (diverse subfields within the humanities and so- cial science, such as literary studies, oral and social history, political studies, historical linguistics, developers of anal- ysis systems based on machine learning, etc.). A strong focus on interoperability between the wide variety of re- sources ensures the steady and reliable development of the infrastructure, which is also reinforced by the polices for Figure 1: Map of CLARIN members, observers, and par- research infrastructures that have been established in align- ticipating centres at the start of 2020. ment with the European Strategy Forum for Research In- frastructures (ESFRI)2. This paper is organised as follows: Section 2. describes the general principles that are to be followed to secure the 2. CLARIN as a FAIR platform interoperability for resources, and provides motivation for The FAIR Guiding Principles for Data Management and the CLARIN use case; and gives an overview of reposi- Stewardship (Wilkinson et al., 2016) provide a universal tory solutions that are being used within CLARIN; Section framework for data management, based on the idea that re- 3. provides a number of examples of data-driven communi- search data should be Findable, Accessible, Interoperable ties that are brought together through the access to language and Reusable. resources that can be explored using approaches and meth- ods of diverse academic fields; and Section 4. outlines the Overall, the FAIR principles are widely being promoted as overall landscape of technical solutions CLARIN works in. part of the Open Science paradigm and are supposed to con- tribute to the ease of discovery and access of research data by researchers and the general public. Reuse of data is fos- 1https://www.clarin.eu tered by promoting the use of widely accepted standards 2https://www.esfri.eu/about both for the data itself and for the metadata describing it. 28 CLARIN is committed to promoting the FAIR data As the next step towards interoperability, a tool has been de- paradigm (de Jong et al., 2018). With the Virtual Lan- veloped (Zinn, 2016) to provide guidance on which service guage Observatory (VLO)3 (see Section 2.2.) CLARIN is recommended for which data, known as the Language provides a search engine that helps exploring over a mil- Resource Switchboard 9. It acts as a simple forwarding ap- lion language resources from dozens of CLARIN Centres plication that, based on the URL of an input file and a few spread over all of Europe and beyond. Apart from a shared simple parameters (language, mimetype, task), allows the metadata paradigm that enables this kind of central discov- user to select relevant NLP web applications that can ana- ery, technical interoperability is ensured by the technical lyze the input provided. specifications for CLARIN Centres (see Section 2.3.) and the accessibility of the data is managed with the help of a 2.3. CLARIN landscape of repositories 4 SAML-based Federated Identity setup. This section contains an overview of repositories used 2.1. From FAIR to actionable throughout the CLARIN infrastructure, and internal tech- nical solutions to support the interoperability within the Both persistent identifiers (PIDs) and the FAIR guidelines network of CLARIN centres. The CLARIN infrastructure have been existing for quite a while. Recent efforts in the 5 backbone is a network of CLARIN Centres that provide ac- context of the Research Data Alliance have paved the way cess to language resources in a multitude of languages from to enhance the already existing Handle infrastructure into 6 European roots and beyond, in a variety of modalities and an ecosystem for FAIR Digital Objects (DOs) that fully formats. Most prominent is the role of the service provid- supports machine-actionability: the capacity of computa- ing centres, called B Centres, which offer services to the tional systems to find, access, interoperate, and reuse data CLARIN community, such as access to linguistic software with none or minimal human intervention. The principle or language data. There are also C Centres which allow the behind FAIR Digital Objects is to enrich Handles with a harvesting of metadata for the language resources and tools core of directly accessible metadata descriptions (the PID by the VLO, but do not offer any additional services. The kernel information, which can have community-specific ex- most important difference between the two types of centres tensions). These metadata elements can be unambiguously is that B centres have to follow precise technical specifica- interpreted with the help of a Data Type Registry, which tions10 and are regularly evaluated and certified. The cer- contains the definitions of the elements. An important dif- tification procedure is led by CLARIN Central Assessment ference with the more extensive metadata provided outside Committee.11. One of the assessment criteria is that an ap- the Digital Objects (as described in Section 2.2.) is the plication needs to be prepared for certification through the speed with which the information can be retrieved and the independent certification organisation CoreTrustSeal.12 cross-community standardization. 7 The CLARIN network currently consists of 23 B and 22 While FAIR DOs are not yet in a production-ready state it C Centres. While C Centre status does not come with the is clearly an initiative gaining a lot of traction (Hodson et expectation of running a research data repository, a lot of al., 2018), with the potential to bring significant progress in them actually do, resulting in a network of 41 centres with the field of language resource processing and beyond. a repository. While the technical specifications (for B Cen- 2.2. Discoverability through the VLO and tres, see above) have some requirements on what such a CMDI metadata repository has to be able to do and the services it has to offer, the individual centres are free in their choice of the Within the concept of FAIR research data, the aspect of actual software they run and this results in a quite varied Findability is the most important one, because data that “repository landscape” within CLARIN. cannot be discovered by interested parties cannot be reused, no matter how well-designed and interoperable the data it- Repository type Number of centres self is. CLARIN has put this aspect front and centre by DSpace 14 making it a hard requirement for their (B and C) centres to Fedora 10 provide metadata about their collections in a well-defined format that is shared within all of CLARIN. A CLARIN META-SHARE 4 centre has to provide its metadata in the CMDI-format Git 2 (Broeder et al., 2012) via the OAI-PMH protocol8. All of LAT 2 these OAI endpoints are regularly checked for updates. Any Dataverse 1 new metadata elements are harvested and fed into the VLO, Custom 8 a facet-based search portal, where the collected metadata TOTAL 41 can be searched by interested users. Table 1: Type of repositories used in CLARIN centers. This information is provided at registration stage. 3https://vlo.clarin.eu 4https://www.clarin.eu/node/3788 5https://www.rd-alliance.org/group/gede-group-european- data-experts-rda/wiki/gede-digital-object-topic-group 9https://switchboard.clarin.eu 6https://fairdo.org/ 10http://hdl.handle.net/11372/DOC-78 7https://pti.iu.edu/centers/d2i/initiatives/rpid.html for a testbed 11https://www.clarin.eu/governance/centre-assessment- implementation.