When Linguistics Meets Web Technologies. Recent Advances in Modelling Linguistic Linked Open Data

Total Page:16

File Type:pdf, Size:1020Kb

When Linguistics Meets Web Technologies. Recent Advances in Modelling Linguistic Linked Open Data Semantic Web 0 (0) 1 1 IOS Press 1 1 2 2 3 3 4 When Linguistics Meets Web Technologies. 4 5 5 6 6 7 Recent advances in Modelling Linguistic 7 8 8 9 Linked Open Data 9 10 10 11 a b c d 11 12 Anas Fahad Khan , Christian Chiarcos , Thierry Declerck , Daniela Gifu , 12 e f g h 13 Elena González-Blanco García , Jorge Gracia , Maxim Ionov , Penny Labropoulou , 13 i j k l 14 Francesco Mambrini , John P. McCrae , Émilie Pagé-Perron , Marco Passarotti , 14 m n 15 Salvador Ros Muñoz , Ciprian-Octavian Truica˘ 15 16 a Istituto di Linguistica Computazionale «A. Zampolli», Consiglio Nazionale delle Ricerche, Italy 16 17 E-mail: [email protected] 17 18 b Applied Computational Linguistics Lab, Goethe-Universität Frankfurt am Main, Germany 18 19 E-mail: [email protected] 19 20 c DFKI GmbH, Multilinguality and Language Technology, Saarbrücken, Germany 20 21 E-mail: [email protected] 21 22 d Faculty of Computer Science, Alexandru Ioan Cuza University of Iasi, Romania 22 23 E-mail: [email protected] 23 24 e Laboratory of Innovation on Digital Humanities, IE University, Spain 24 25 E-mail: [email protected] 25 26 f Aragon Institute of Engineering Research, University of Zaragoza, Spain 26 27 E-mail: [email protected] 27 28 g Applied Computational Linguistics Lab, Goethe-Universität Frankfurt am Main, Germany 28 29 E-mail: [email protected] 29 30 h Institute for Language and Speech Processing, Athena Research Center, Greece 30 31 E-mail: [email protected] 31 32 i CIRCSE Research Centre, Università Cattolica del Sacro Cuore, Milan, Italy 32 33 E-mail: [email protected] 33 34 j Insight SFI Research Centre for Data Analytics, Data Science Institute, National University of Ireland Galway, 34 35 Ireland 35 36 E-mail: [email protected] 36 37 k Wolfson College, University of Oxford, United Kingdom 37 38 E-mail: [email protected] 38 39 l CIRCSE Research Centre, Università Cattolica del Sacro Cuore, Milan, Italy 39 40 E-mail: [email protected] 40 41 m Laboratory of Innovation on Digital Humanities, National Distance Education University UNED, Spain 41 42 E-mail: [email protected] 42 43 n Department of Computer Science and Engineering, Faculty of Automatic Control and Computers, University 43 44 Politehnica of Bucharest, Romania 44 45 E-mail: [email protected] 45 46 46 47 47 48 48 49 49 50 Abstract. 50 51 51 1570-0844/0-1900/$35.00 © 0 – IOS Press and the authors. All rights reserved 2 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 1 This article provides an up-to-date and comprehensive survey of models (including vocabularies, taxonomies and ontologies) 1 2 used for representing linguistic linked data (LLD). It focuses on the latest developments and both builds upon and complement 2 3 previous works covering similar territory. The article begins with an overview of recent trends which have had an impact on 3 4 linked data models and vocabularies, such as the growing influence of the FAIR guidelines, the funding of several major projects 4 5 in which LLD is a key component, and the increasing importance of the relationship of the Digital Humanities with LLD. Next, 5 we give an overview of some of the most well known vocabularies and models in LLD. After this we look at some of the latest 6 6 developments in community standards and initiatives such as OntoLex-lemon as well as recent work which has been in carried 7 7 out in corpora and annotation and LLD including a discussion of the LLD metadata vocabularies METASHARE and lime and 8 language identifiers. Following this we look at work which has been realised in a number of recent projects and which has a 8 9 significant impact on LLD vocabularies and models. 9 10 10 11 Keywords: linguistic linked data, FAIR, corpora, annotation, language resources, OntoLex-Lemon, Digital Humanities, metadata, 11 12 models 12 13 13 14 14 15 15 16 1. Introduction data Section 4.3. Finally there is a section discussing 16 17 projects, Section 5, and the conclusion, Section 6. 17 18 The growing popularity of linked data, and espe- 18 19 cially as linked open data (that is, linked data with an 19 20 open license) as a means of publishing language re- 2. Setting the Scene: An Overview of Relevant 20 21 sources (lexica, corpora, data categories, etc) has led Trends for LLD 21 22 to the need for a greater focus on models for linguistic 22 23 linked data (LLD) since these are key to what makes The trends we have decided to focus on in this 23 24 linked data resources so reusable and interoperable. overview are the FAIRification of data in Section 2.1, 24 25 The purpose of this article is to provide an up-to-date the importance of projects to LLD models in Section 25 26 and comprehensive survey of models (including vo- 2.2, and finally the increasing importance of Digital 26 27 cabularies, taxonomies and ontologies) used for repre- Humanities use cases in Section 2.3. 27 28 senting linguistic linked data. It will focus on the lat- 28 29 est developments and both build upon and complement 2.1. FAIR New World 29 30 previous works covering similar territory, avoiding too 30 31 much repetition and overlap with the latter. In the fol- With the growing importance of Open Science ini- 31 32 lowing Section 2, we give an overview of a number tiatives, and especially those promoting the FAIR 32 33 of trends from the last few years which have had, or guidelines (where FAIR stands for Findable, Accessi- 33 34 which are likely to have, a significant impact on the ble, Interoperable and Reusable) [1] – and the conse- 34 35 definition and/or use of LLD models. We relate these quent emphasis on the modelling, creation and publi- 35 36 trends to the rest of the article by highlighting rele- cation of language resources as FAIR digital resources 36 37 vant sections of the article (in bold). This overview – shared models and vocabularies have begun to take 37 38 of trends will help to locate the present work within on an increasingly prominent role. Although the lin- 38 39 a wider research context, something that is extremely guistic linked data community has been active in pro- 39 40 useful in an area as active as linguistic linked data, as moting shared RDF vocabularies and models for years, 40 41 well as assisting readers in navigating the rest of the ar- this new emphasis on FAIR is likely to have a con- 41 42 ticle. Next, in Section 2.4, we compare the present ar- siderable impact in several ways, not least in terms of 42 43 ticle with other related work, including an earlier sur- the necessity for these models to demonstrate a greater 43 44 vey of LLD models, in order to help clarify the top- coverage, and to be more interoperable one with an- 44 45 ics and approach of the present work. Section 3 gives other. We will look at one series of FAIR related rec- 45 46 an overview of the most widely used models in LLD. ommendations for models in Section 3 and see how 46 47 Then in Section 4, we look at recent developments they might be applied to the case of LLD. However in 47 48 in community standards and initiatives. These include the rest of the subsection we will take a closer look 48 49 the latest extensions of the OntoLex-Lemon model in at the FAIR principles themselves and show why their 49 50 Section 4.1, a discussion of relevant work in copora widespread adoption is likely to lead to a greater role 50 51 and annotations in Section 4.2, and a section on meta- for LLD models and vocabularies in the future. 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 3 1 In The FAIR Guiding Principles for scientific data – F2. data are described with rich metadata. 1 2 management and stewardship [1], the article which – I1. (meta)data use a formal, accessible, shared, 2 3 first articulated the well known FAIR principles, the and broadly applicable language for knowledge 3 4 authors clearly state that the criteria proposed by these representation. 4 5 principles are intended both "for machines and peo- – I2. (meta)data use vocabularies that follow FAIR 5 6 ple" and that they provide "‘steps along a path’ to ma- principles. 6 7 chine actionability", where the latter is understood to 7 8 describe structured data that would allow a "computa- It is important to note that the emphasis placed 8 9 tional data explorer" to determine: on machine actionability in FAIR resources (that is, 9 10 with respect to allowing computational agents to take 10 – The type of a "digital research object" 11 "appropriate action" with respect to a dataset or re- 11 – Its usefulness with respect to tasks to be carried 12 source) gives Semantic Web vocabularies/registries a 12 out 13 substantial advantage over other (non-Semantic Web 13 – Its usability especially with respect to licensing 14 native) standards like the Text Encoding Initiative 14 issues, represented in a way that would allow the 2 15 (TEI) guidelines [2], the Lexical Markup Frame- 15 agent to take "appropriate action".
Recommended publications
  • Towards a Sense-Based Access to Related Online Lexical Resources
    Towards a Sense-based Access to Related Online Lexical Resources Thierry Declerck, Karlheinz Mörth DFKI GmbH, Language Technology Lab, Austrian Centre for Digital Humanities E-mail: [email protected], [email protected] Abstract We present an approach aiming at a method to support sense-based cross-dialectal – and cross-lexicon – access to dictionaries of spoken Arabic varieties. The original lexical data consists of three TEI P5 encoded dictionaries describing varieties spoken in Cairo, Damascus and Tunis. This data is included in the Vienna Corpus of Arabic Varieties (VICAV). We briefly present this data, before summarizing the TEI approach for encoding senses in lexical resources. We discuss certain issues related to the TEI representation when it comes to the possibility to provide for a sense-based access to the lexical data. We investigate the use of the recent W3C Ontology-Lexicon Community Group (OntoLex) modeling work and present first results of the mapping of the TEI encoded data onto its ontolex model and show how this new representation format can efficiently support cross-dictionary sense-based access. Keywords: Senses; ontolex; TEI; SKOS; Arabic Dialects 1 Introduction We present an approach for ensuring sense-based access to distinct but related online lexical resources. The basis of our work consists of three dictionaries of Arabic varieties developed at different times by different colleagues (Mörth et al. 2015; Procházka & Mörth 2013; Mörth et al. 2014) in the context of the Vienna Corpus of Arabic Varieties (VICAV) project1. This data has been encoded in TEI2, following the approach described in (Budin et al.
    [Show full text]
  • Conversion of the English-Xhosa Dictionary for Nurses to a Linguistic Linked Data Framework †
    information Article Conversion of the English-Xhosa Dictionary for Nurses to a Linguistic Linked Data Framework † Frances Gillis-Webber Library and Information Studies Centre, University of Cape Town, Woolsack Drive, Rondebosch, Cape Town 7701, South Africa; [email protected] † This paper is an extended version of the paper presented at the 6th Workshop on Linked Data in Linguistics (LDL-2018), 11th Edition of the Language Resources and Evaluation Conference (LREC 2018). Received: 15 September 2018; Accepted: 2 November 2018; Published: 6 November 2018 Abstract: The English-Xhosa Dictionary for Nurses (EXDN) is a bilingual, unidirectional printed dictionary in the public domain, with English and isiXhosa as the language pair. By extending the digitisation efforts of EXDN from a human-readable digital object to a machine-readable state, using Resource Description Framework (RDF) as the data model, semantically interoperable structured data can be created, thus enabling EXDN’s data to be reused, aggregated and integrated with other language resources, where it can serve as a potential aid in the development of future language resources for isiXhosa, an under-resourced language in South Africa. The methodological guidelines for the construction of a Linguistic Linked Data framework (LLDF) for a lexicographic resource, as applied to EXDN, are described, where an LLDF can be defined as a framework: (1) which describes data in RDF, (2) using a model designed for the representation of linguistic information, (3) which adheres to Linked Data principles, and (4) which supports versioning, allowing for change. The result is a bidirectional lexicographic resource, previously bounded and static, now unbounded and evolving, with the ability to extend to multilingualism.
    [Show full text]
  • Models to Represent Linguistic Linked Data∗
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Repositorio Universidad de Zaragoza Natural Language Engineering: page 1 of 49. c Cambridge University Press 2018 1 doi:10.1017/S1351324918000347 Models to represent linguistic linked data∗ J. BOSQUE-GIL1, J. GRACIA1,2, E. MONTIEL-PONSODA1 and A. G OMEZ-P´ EREZ´ 1 1Ontology Engineering Group, Escuela T´ecnica Superior de Ingenieros Informaticos,´ Universidad Polit´ecnica de Madrid, Spain e-mails: [email protected], [email protected], [email protected] 2Aragon Institute of Engineering Research (I3A), University of Zaragoza, Spain e-mail: [email protected] (Received 13 June 2017; revised 6 August 2018; accepted 10 August 2018 ) Abstract As the interest of the Semantic Web and computational linguistics communities in linguistic linked data (LLD) keeps increasing and the number of contributions that dwell on LLD rapidly grows, scholars (and linguists in particular) interested in the development of LLD resources sometimes find it difficult to determine which mechanism is suitable for their needs and which challenges have already been addressed. This review seeks to present the state of the art on the models, ontologies and their extensions to represent language resources as LLD by focusing on the nature of the linguistic content they aim to encode. Four basic groups of models are distinguished in this work: models to represent the main elements of lexical resources (group 1), vocabularies developed as extensions to models in group 1 and ontologies that provide more granularity on specific levels of linguistic analysis (group 2), catalogues of linguistic data categories (group 3) and other models such as corpora models or service- oriented ones (group 4).
    [Show full text]
  • Towards a Sense-Based Access to Related Online Lexical
    Towards a Sense-based Access to Related Online Lexical Resources Thierry Declerck, Karlheinz Mörth DFKI GmbH, Language Technology Lab, Austrian Centre for Digital Humanities e-mail: [email protected], [email protected] Abstract We present an approach aiming at a method to support sense-based cross-dialectal – and cross-lexicon – access to dictionaries of spoken Arabic varieties. The original lexical data consists of three TEI P5 encoded dictionaries describing varieties spoken in Cairo, Damascus and Tunis. This data is included in the Vienna Corpus of Arabic Varieties (VICAV). We briefly present this data, before summarizing the TEI approach for encoding senses in lexical resources. We discuss certain issues related to the TEI representation when it comes to the possibility to provide for a sense-based access to the lexical data. We investigate the use of the recent W3C Ontology-Lexicon Community Group (OntoLex) modeling work and present first results of the mapping of the TEI encoded data onto its ontolex model and show how this new representation format can efficiently support cross-dictionary sense-based access. Keywords: Venses; ontolex; TEI; SKOS; Arabic Gialects 1 Introduction We present an approach for ensuring sense-based access to distinct but related online lexical resources. The basis of our work consists of three dictionaries of Arabic varieties developed at different times by different colleagues (Mörth et al. 2015; Procházka & Mörth 2013; Mörth et al. 2014) in the context of the Vienna Corpus of Arabic Varieties (VICAV) project1 This data has been encoded in TEI2 following the approach described in (Budin et al.
    [Show full text]
  • Towards Ontolex-Lemon Editing in Vocbench 3
    Towards OntoLex-Lemon editing in VocBench 3 MANUEL FIORELLI, ARMANDO STELLATO, TIZIANO LORENZETTI, ANDREA TURBATI+, PETER SCHMITZ, ENRICO FRANCESCONI, NAJEH HAJLAOUI, BRAHIM BATOUCHE* ABSTRACT: Released in early May 2016, the OntoLex-Lemon model is a suite of agreed-upon RDF vocabularies for the representation of ontology lexicons. The OntoLex-Lemon model, as well as its predecessor Monnet lemon, has also been used (actually “misused”, from a certain point of view) to represent lexical-semantic resources (e.g. wordnets), dictionaries and, more broadly, to serve as a cornerstone of the Linguistic Linked Open Data cloud. The number of users potentially interested in editing or consuming OntoLex-Lemon data is thus very large. However, common ontology and RDF editors are inconvenient because of the complex design patterns embodied in the OntoLex-Lemon model, which relies heavily on reification and indirection. In this paper, we discuss our ongoing work to extend the collaborative thesaurus and ontology editor VocBench 3 with facilities tailored to the OntoLex-Lemon model, while retaining its large feature set and the wide modelling spectrum offered by RDF. Keywords: Lemon, OntoLex, VocBench, Lexicon, RDF. 1. Introduction The W3C Community Group Ontology-Lexicon1 (OntoLex) published its final report2 in early May 2016, defining the OntoLex-Lemon3 model: a suite of RDF vocabularies (called modules) for the representation of lexicons for ontologies, in accordance with Semantic Web4 best practices. The modules of OntoLex-Lemon cover aspects such as morphology, syntax-semantics mapping, variation, translation, and linguistic metadata. This rich linguistic characterization of ontologies is unattainable with widely deployed models on the Semantic Web (e.g.
    [Show full text]