Towards Ontolex-Lemon Editing in Vocbench 3

Towards OntoLex-Lemon editing in VocBench 3 MANUEL FIORELLI, ARMANDO STELLATO, TIZIANO LORENZETTI, ANDREA TURBATI+, PETER SCHMITZ, ENRICO FRANCESCONI, NAJEH HAJLAOUI, BRAHIM BATOUCHE* ABSTRACT: Released in early May 2016, the OntoLex-Lemon model is a suite of agreed-upon RDF vocabularies for the representation of ontology lexicons. The OntoLex-Lemon model, as well as its predecessor Monnet lemon, has also been used (actually “misused”, from a certain point of view) to represent lexical-semantic resources (e.g. wordnets), dictionaries and, more broadly, to serve as a cornerstone of the Linguistic Linked Open Data cloud. The number of users potentially interested in editing or consuming OntoLex-Lemon data is thus very large. However, common ontology and RDF editors are inconvenient because of the complex design patterns embodied in the OntoLex-Lemon model, which relies heavily on reification and indirection. In this paper, we discuss our ongoing work to extend the collaborative thesaurus and ontology editor VocBench 3 with facilities tailored to the OntoLex-Lemon model, while retaining its large feature set and the wide modelling spectrum offered by RDF. Keywords: Lemon, OntoLex, VocBench, Lexicon, RDF. 1. Introduction The W3C Community Group Ontology-Lexicon1 (OntoLex) published its final report2 in early May 2016, defining the OntoLex-Lemon3 model: a suite of RDF vocabularies (called modules) for the representation of lexicons for ontologies, in accordance with Semantic Web4 best practices. The modules of OntoLex-Lemon cover aspects such as morphology, syntax-semantics mapping, variation, translation, and linguistic metadata. This rich linguistic characterization of ontologies is unattainable with widely deployed models on the Semantic Web (e.g. RDFS and SKOS-(XL) labels), and it enables a wide range of ontology-driven NLP applications (e.g. knowledge verbalization, semantic parsing, question answering…)5. Outside of its originally intended scope, OntoLex-Lemon (as well as its predecessors) has also been used to represent and interlink lexicons, lexical-semantic resources and, in general, language resources in the Linguistic Linked Open Data (LLOD) cloud6. There is thus a large and varied group of people potentially interested in consuming, editing or otherwise interacting with datasets that use the OntoLex-Lemon model. Unfortunately, common + Dept. of Enterprise Engineering, University of Rome Tor Vergata, Via del Politecnico 1, 00133 Rome, Italy. {fiorelli, turbati}@info.uniroma2.it, [email protected], [email protected] * Publications Office of the European Union, Luxembourg. {Peter.SCHMITZ, Enrico.FRANCESCONI}@publications.europa.eu, {Najeh.HAJLAOUI, Brahim.BATOUCHE}@ext.publications.europa.eu Institute of Legal Information Theory and Techniques (ITTIG), Consiglio Nazionale delle Ricerche (CNR), Via dei Barucci 20, 50127 Florence, Italy 1 Ontology-Lexica Community Group: <https://www.w3.org/community/ontolex/> (last consulted: 23/01/2018) 2 Lexicon Model for Ontologies: <https://www.w3.org/2016/05/ontolex/> (last consulted: 23/01/2018) 3 J. MCCRAE., J. BOSQUE-GIL, J. GRACIA, et al., The OntoLex-Lemon Model: Development and Applications, in "Proceedings of eLex 2017 conference", Leiden, The Netherlands, 19-21 September 2017, pp. 19-21 4 T. BERNERS-LEE, J.A. HENDLER, O. LASSILA, The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities, in «Scientific American», vol. CCLXXXIV, n. 5, 2001, pp. 34-43, doi: https://doi.org/10.1038/scientificamerican0501-34 5 P. CIMIANO, C. UNGER, J. MCCRAE, Ontology-Based Interpretation of Natural Language, in Synthesis Lectures on Human Language Technologies, vol. VII, n. 2, 2014, pp. 1-178, doi: http://dx.doi.org/10.2200/S00561ED1V01Y201401HLT024 6 Aa.Vv., Linked Data in Linguistics, edited by C. Chiarcos, S. Nordhoff, S. Hellmann, Springer, Berlin, Heidelberg, 2012, doi: https://doi.org/10.1007/978-3-642-28249-2 ontology and RDF editors are highly inconvenient for those purposes, since they lack the facilities and the appropriate abstraction over the OntoLex-Lemon model that are required to easily create and understand the often complex modelling patterns demanded by OntoLex-Lemon. Conversely, systems tailored to (a specific application of) OntoLex-Lemon often lose the flexibility of the RDF model, which allows loading and mixing arbitrary vocabularies, as well as additional features of ontology/RDF editors (unless they are recreated in the purpose-built system). Searching for a solution to the problem above, we explored a third way in which an ontology/RDF editor is extended to better support the OntoLex-Lemon model, while remaining compatible with the underlying dynamics of the editor. Indeed, the exploitation of the existing ecosystem was advocated7 as an important benefit of Linked Data adoption in Linguistics. Specifically, we are working on extending the collaborative ontology and thesaurus editor VocBench 38. In this paper, we describe the work that has already been done, and lay down the design of the features that are still required to achieve our final goal. We first present an early customization effort9 that was actually aimed at validating the extensibility of VocBench (via the so-called custom forms) more than at creating a full-fledged OntoLex-Lemon editor. Subsequently, we report on the integration of a module for effective management of the metadata module (LIME). The lessons we learned with the custom form experiment and the identified gaps motivated the implementation of the support for OntoLex-Lemon as a first-class citizen inside VocBench 3, on par with other modelling vocabularies such as RDFS, OWL, SKOS and SKOS-XL. This work is still in the design phase, which will be presented in this paper. This effort is being done in the context of the Public Multilingual Knowledge Management Infrastructure (PMKI) action, launched by the European Commission (EC) to promote the Digital Single Market in the European Union (EU). PMKI aims to share maintainable and sustainable Language Resources making them interoperable in order to support language technology industry, and public administrations, with multilingual tools able to improve cross border accessibility of digital services The objective of PMKI is to implement a proof-of-concept infrastructure to expose and to harmonize internal (European Union institutional) and external multilingual lexicons aligning them in order to facilitate interoperability. Additionally, the project aims to create a governance structure to extend systematically the infrastructure by the integration of supplementary public multilingual taxonomies/terminologies. 2. State of the Art on Linguistic Resources and Language Representation This is a very broad field, as the following definition suggests: «The term linguistic resources refers to (usually large) sets of language data and descriptions in machine readable form, to be used in building, improving, or evaluating natural language (NL) and speech algorithms or systems»10. 7 C. CHIARCOS, J. MCCRAE, P. CIMIANO, C. FELLBAUM, Towards Open Data for Linguistics: Linguistic Linked Data, in New Trends of Research in Ontologies and Lexical Resources. Theory and Applications of Natural Language Processing, edited by A. Oltramari, P. Vossen, L. Qin, et al., Springer, Berlin, Heidelberg 2013, pp. 7-25, doi: https://doi.org/10.1007/978-3-642-31782-8_2 8 A. STELLATO, A. TURBATI, M. FIORELLI, et al., Towards VocBench 3: Pushing Collaborative Development of Thesauri and Ontologies Further Beyond, in "Proceedings of the 17th European Networked Knowledge Organization Systems (NKOS) Workshop", Thessaloniki, Greece, September 21st 2017, pp. 39-52 VocBench web site: <http://vocbench.uniroma2.it/> (last consulted: 25/01/2017) 9 M. FIORELLI, T. LORENZETTI, M.T. PAZIENZA, et al., Assessing VocBench Custom Forms in Supporting Editing of Lemon Datasets, in Language, Data, and Knowledge, edited by J. Gracia, F. Bond, J.P. McCrae, et al., Lecture Notes in Artificial Intelligence, vol (X)CCCXVIII, Springer, Cham, pp.237-252, doi: https://doi.org/10.1007%2F978-3-319-59888-8_21 10 Aa.Vv., Survey of the State of the Art in Human Language Technology, edited by R.A. Cole, J. Mariani, H. Uszkoreit, et al., Cambridge University Press, Cambridge, UK 1997 There have been multiple efforts in the past aimed at achieving consensus among different theoretical perspectives and design approaches. The Text Encoding Initiative (TEI)11 and the LRE- EAGLES (Expert Advisory Group on Linguistic Engineering Standards) project12 are just a few. They intended to promote the reuse of existing (partial) linguistic resources and the development of new ones for those languages and domains where linguistic resources were unavailable, and creating a cooperative infrastructure to collect, maintain, and disseminate linguistic resources. WordNet13 is lexico-semantic resource for American English, which defined a model for subsequent wordnets in other languages14. These wordnets clearly separate words, senses and glosses, and are characterized by diverse semantic relations like hyponymy and meronymy. Lexical Markup Framework15 (LMF) is a more recent effort, become an ISO standard (LMF; ISO 24613:2008), which supports the representation of monolingual, bilingual or multilingual lexical resources. It covers different aspects, such as morphology, syntax, semantics, and translation. With the advent of the Semantic Web and Linked Open Data, a number of models have been proposed to enrich

Towards Ontolex-Lemon Editing in Vocbench 3

Towards a Sense-Based Access to Related Online Lexical Resources

Conversion of the English-Xhosa Dictionary for Nurses to a Linguistic Linked Data Framework †

When Linguistics Meets Web Technologies. Recent Advances in Modelling Linguistic Linked Open Data

Models to Represent Linguistic Linked Data∗

Towards a Sense-Based Access to Related Online Lexical