<<

Semantic Web 0 (0) 1 1 IOS Press

1 1 2 2 3 3 4When Linguistics Meets Web Technologies. 4 5 5 6 6 7Recent advances in Modelling Linguistic 7 8 8 9Linked Open Data 9 10 10 11a b c d 11 12Anas Fahad Khan , Christian Chiarcos , Thierry Declerck , Daniela Gifu , 12 e f b h 13Elena González-Blanco García , Jorge Gracia , Maxim Ionov , Penny Labropoulou , 13 i j k i 14Francesco Mambrini , John P. McCrae , Émilie Pagé-Perron , Marco Passarotti , 14 l m 15Salvador Ros Muñoz , Ciprian-Octavian Truica˘ 15 16a Istituto di Linguistica Computazionale «A. Zampolli», Consiglio Nazionale delle Ricerche, Italy 16 17E-mail: [email protected] 17 18b Applied Computational Linguistics Lab, Goethe-Universität Frankfurt am Main, Germany 18 19E-mail: [email protected], 19 20E-mail: [email protected] 20 21c DFKI GmbH, Multilinguality and Language Technology, Saarbrücken, Germany 21 22E-mail: [email protected] 22 23d Faculty of Computer Science, Alexandru Ioan Cuza University of Iasi, Romania 23 24E-mail: [email protected] 24 25e Laboratory of Innovation on , IE University, Spain 25 26E-mail: [email protected] 26 27f Aragon Institute of Engineering Research, University of Zaragoza, Spain 27 28E-mail: [email protected] 28 29g Applied Computational Linguistics Lab, Goethe-Universität Frankfurt am Main, Germany 29 30E-mail: [email protected] 30 31h Institute for Language and Speech Processing, Athena Research Center, Greece 31 32E-mail: [email protected] 32 33i CIRCSE Research Centre, Università Cattolica del Sacro Cuore, Milan, Italy 33 34E-mail: [email protected], 34 35E-mail: [email protected] 35 36j Insight SFI Research Centre for Data Analytics, Data Science Institute, National University of Ireland Galway, 36 37Ireland 37 38E-mail: [email protected] 38 39k Wolfson College, University of Oxford, United Kingdom 39 40E-mail: [email protected] 40 41l Laboratory of Innovation on Digital Humanities, National Distance Education University UNED, Spain 41 42E-mail: [email protected] 42 43m Computer Science and Engineering Department, Faculty of Automatic Control and Computers, University 43 44Politehnica of Bucharest, Romania 44 45E-mail: [email protected] 45 46 46 47 47 48 48 49 49 50Abstract. This article provides an up-to-date and comprehensive survey of models (including vocabularies, taxonomies and 50 ontologies) used for representing linguistic linked data (LLD). It focuses on the latest developments in the area and both builds 51 51

1570-0844/0-1900/$35.00 © 0 – IOS Press and the authors. All rights reserved 2 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1upon and complements previous works covering similar territory. The article begins with an overview of recent trends which 1 2have had an impact on linked data models and vocabularies, such as the growing influence of the FAIR guidelines, the funding 2 3of several major projects in which LLD is a key component, and the increasing importance of the relationship of the digital 3 4humanities with LLD. Next, we give an overview of some of the most well known vocabularies and models in LLD. After this 4 5we look at some of the latest developments in community standards and initiatives such as OntoLex-Lemon as well as recent 5 work which has been in carried out in corpora and annotation and LLD including a discussion of the LLD metadata vocabularies 6 6 META-SHARE and lime and language identifiers. In the following part of the paper we look at work which has been realised in 7 7 a number of recent projects and which has a significant impact on LLD vocabularies and models. 8 8 9Keywords: linguistic linked data, FAIR, corpora, annotation, language resources, OntoLex-Lemon, Digital Humanities, metadata, 9 10models 10 11 11 12 12 13 13 141. Introduction tion discussing projects, Section 5, and the conclusion, 14 15Section 6. 15 16The growing popularity of linked data, and espe- 16 17cially of linked open data (that is, linked data with an 17 18open license), as a means of publishing language re- 2. Setting the Scene: An Overview of Relevant 18 19sources (lexica, corpora, data categories, etc.) necessi- Trends for LLD 19 20tates a greater emphasis on models for linguistic linked 20 21data (LLD) since these are key to what makes linked The trends we have decided to focus on in this 21 22data resources so reusable and so interoperable (at a overview are the FAIRification of data in Section 2.1, 22 23semantic level). The purpose of this article is to pro- the importance of projects to LLD models in Section 23 24vide a comprehensive and up-to-date survey of mod- 2.2, and finally the increasing importance of Digital 24 25els used for representing linguistic linked data. It will Humanities use cases in Section 2.3. 25 26focus on the latest developments and will both build 26 27upon as well as trying to complement previous works 2.1. FAIR New World 27 28covering similar territory by avoiding too much repeti- 28 29With the growing importance of Open Science ini- 29 tion and overlap with the latter. 30tiatives, and especially those promoting the FAIR 30 In the following section, Section 2, we give an 31guidelines (where FAIR stands for Findable, Accessi- 31 overview of a number of trends from the last few years 32ble, Interoperable and Reusable) [1] – and the conse- 32 which have had, or which are likely to have, a signifi- 33quent emphasis on the modelling, creation and publi- 33 cant impact on the definition and/or use of LLD mod- 34cation of language resources as FAIR digital resources 34 els. We relate these trends to the rest of the article by 35– shared models and vocabularies have begun to take 35 highlighting relevant sections of the article (in bold). 36on an increasingly prominent role. Although the lin- 36 This overview of trends will help to locate the present 37guistic linked data community has been active in pro- 37 work within a wider research context, something that 38moting shared RDF vocabularies and models for years, 38 39is extremely useful in an area as active as linguistic this new emphasis on FAIR is likely to have a con- 39 40linked data, as well as assisting readers in navigating siderable impact in several ways, not least in terms of 40 41the rest of the article. Next, in Section 2.4, we com- the necessity for these models to demonstrate a greater 41 42pare the present article with other related work, includ- coverage, and to be more interoperable one with an- 42 43ing an earlier survey of LLD models, in order to help other. We will look at one series of FAIR related rec- 43 44clarify the topics and approach of the present work. ommendations for models in Section 3 and see how 44 45Section 3 gives an overview of the most widely used they might be applied to the case of LLD. However in 45 46models in LLD. Then in Section 4, we look at recent the rest of the subsection we will take a closer look 46 47developments in community standards and initiatives. at the FAIR principles themselves and show why their 47 48These include the latest extensions of the OntoLex- widespread adoption is likely to lead to a greater role 48 49Lemon model in Section 4.1, a discussion of relevant for LLD models and vocabularies in the future. 49 50work in copora and annotations in Section 4.2, and a In The FAIR Guiding Principles for scientific data 50 51section on metadata Section 4.3. Finally there is a sec- management and stewardship [1], the article which 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 3

1first articulated the well known FAIR principles, the – I1. (meta)data use a formal, accessible, shared, 1 2authors clearly state that the criteria proposed by these and broadly applicable language for knowledge 2 3principles are intended both "for machines and peo- representation. 3 4ple" and that they provide "‘steps along a path’ to ma- – I2. (meta)data use vocabularies that follow FAIR 4 5chine actionability", where the latter is understood to principles. 5 6describe structured data that would allow a "computa- 6 7tional data explorer" to determine: It is important to note that the emphasis placed 7 8on machine actionability in FAIR resources (that 8 9– The type of a "digital research object" is, on enabling computational agents to find rele- 9 10– Its usefulness with respect to tasks to be carried vant datasets and resources and to take "appropriate 10 11out action" when they find them) gives Semantic Web 11 12– Its usability especially with respect to licensing vocabularies/registries a substantial advantage over 12 13issues, represented in a way that would allow the other (non-Semantic Web native) standards in the 13 14agent to take "appropriate action". field of linguistics like the Text Encoding Initiative 14 3 15The current popularity of the FAIR principles and, (TEI) guidelines [2], the Lexical Markup Framework 15 16in particular, their promotion by governments and re- (LMF) [3] or the Morpho-syntactic Annotation Frame- 16 17search funding bodies, such as the European Commis- work (MAF) [4]. 17 18sion,1 through several national and international ini- For a start, none of these other standards possess 18 19tiatives reflects a wider recognition of the potential of a ‘native’, widely-used, widely technically supported 19 20structured and machine actionable data in changing knowledge representation language for describing the 20 21how research is carried out, and especially in helping semantics of vocabulary terms in a machine readable 21 22to support open science practices. The FAIR ideal, in way, or at least nothing as powerful as the Web Ontol- 22 23short, is to allow machines as much autonomy as pos- ogy Language (OWL)4 or the Semantic Web Rule Lan- 23 24sible in working with data, by the expedient of render- guage (SWRL)5. For instance., there is no standard- 24 25ing as much of the semantics of that data explicit (and ised way of describing the meanings of , 25 26machine actionable) as possible. lexemes, lemmas, etc. in TEI in a machine actionable 26 27Publishing data using a standardised data model like way. 27 28the Resource Description Framework2 (RDF) which The ability to give precise, axiomatic definitions of 28 29was specifically intended to facilitate interoperability terms in a formal knowledge representation (KR) lan- 29 30and interlinking between datasets – along with the guage (allied with already established conceptual mod- 30 31 31 other standards proposed in the Semantic Web stack elling techniques and ontology engineering best prac- 32 32 and the technical infrastructure which has been devel- tises) is especially helpful in humanistic disciplines 33 33 oped in order to support it – obviously goes a long way such as linguistics or literary scholarship, where there 34 34 towards facilitating the publication of datasets as FAIR can often be quite different definitions of the same 35 35 data. In addition, however, it is also vital that there ex- or similar core concepts, e.g., with respect to differ- 36 36 ist specialised vocabularies/terminologies/models and ent scholarly traditions or schools of thought. Using 37 37 data category registries in order to ensure a viable, a machine readable description in OWL, once again, 38 38 domain-wide level of interoperability and re-usability in conjunction with an ontology modelling methodol- 39 39 of data. These former resources serve to describe the 40ogy such as OntoClean [5], and together a more hu- 40 shared theoretical assumptions held by a community of 41man readable description given as documentation, can 41 experts with regard to the semantics of the terms used 42help to clarify (according to the expressive limitations 42 by that community, and do so in a form that is (to vary- 43of OWL) what we mean when we use a concept like 43 ing extents) machine readable to computational agents. 44‘Sense’ or ‘’ in a dataset. It also facilitates 44 The following FAIR principles are especially salient 45the machine readable description of the relationship 45 46here: between different definitions of concepts across lan- 46 47– F2. data are described with rich metadata. guages or traditions. 47 48 48 49 49 1https://ec.europa.eu/info/sites/info/files/turning_fair_into_ 3https://tei-c.org/guidelines/ 50reality_0.pdf 4https://www.w3.org/TR/2012/REC-owl2-overview-20121211/ 50 512https://www.w3.org/TR/rdf-primer/ 5https://www.w3.org/Submission/SWRL/ 51 4 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1Secondly, thanks to the use of a shared data model In order to ensure the continued effectiveness of 1 2and a powerful native linking mechanism, linguistic linked data and the Semantic Web in facilitating the 2 3linked data datasets can be easily (and in a standard creation of FAIR resources, it is vital that pre-existing 3 4way) integrated with/enriched by (linked data) datasets vocabularies/models/data registries be re-used when- 4 5belonging to other disciplines, for instance geograph- ever possible in the modelling of user data; this of 5 6ical and historical datasets or gazetters and authority course also means ensuring that these models have suf- 6 7lists. OWL, and vocabularies, such as PROV-O,6 also ficient coverage and defining extensions when this is 7 8allow us to add information pertaining to when some- not the case, as well as creating training materials suit- 8 9thing happened, or whether we are describing a hy- able for different groups of users. Part of the intention 9 10pothesis7 or not (in which case, also who made it and of this article, together with the foundational work car- 10 11when). Once again all of these things can be described ried out in [7], is to provide an overview of what exists 11 12in a way that makes the semantics of the information out there in terms of LLD-focused models, to look at 12 13(relatively) explicit and machine actionable through the areas which are receiving most attention in order to 13 14the use of pre-existing standards and technologies in- highlight those which are so far underrepresented. In 14 15cluding the Semantic Web query language SPARQL addition in Section 3 we look at the most well known 15 16Protocol and RDF Query Language (SPARQL) as well LLD models in the light of a recent series of recom- 16 17as freely available Semantic Web reasoning engines. mendations on the publication of models as FAIR re- 17 18Moreover the pursuit of the FAIR ideal has opened sources. 18 19the way to new means of publishing datasets which 19 20offer enhanced opportunities for the re-use of such 2.2. The Importance of Projects and Community 20 21data in an automatic or semi-automatic way. These in- Initiatives in LLD 21 22 22 clude for instance nanopublications, cardinal asser- 23One significant indicator of the success which LLD 23 tions and knowlets8. The potential of these new pub- 24has had in the last few years is the variety of new 24 lishing approaches for discovering new facts as well as 25funded projects which have included the publication 25 for comparing concepts and tracking how single con- 26of linguistic datasets as linked data as a core theme. 26 cepts change are well described in [6]. 27These include projects at a continental or transnational 27 The field of language resources offers us a rich array 28level, notably European H2020 projects, ERCs and 28 of highly structured kinds of datasets, structured ac- 29COST actions, as well as projects at the national and 29 cording to a series of widely shared conventions (this 30regional levels. Arguably, this recent success in obtain- 30 is what makes the definition of models and vocabular- 31ing project funding reflects a much wider recognition 31 ies for lexica, corpora, etc, so viable in the first place) – 32of the importance of linked data as a means of ensuring 32 something that would seem to lend itself well to mak- 33the interoperability and accessibility of language re- 33 ing such resources FAIR in the machine-oriented spirit 34sources both to the research community and to a wider 34 of the original description of those principles as well 35public. In addition, it also demonstrates the continuing 35 36as to the new data publication approaches previously maturation of the field as LLD continues to be applied 36 37mentioned. However, the better and more expressive to new domains and use cases, in many cases, within 37 38the underlying models are the more effective they will the context of the projects alluded to above. In addi- 38 39be. tion, these projects also offer us clear examples of the 39 40use of the LLD vocabularies and models we will look 40 416https://www.w3.org/TR/prov-o/ at ‘in the wild’ so to speak and demonstrating their ap- 41 427For which one could use the Semantic Web ontology CR- plication to a wide number of medium to large scale 42 43MInfhttp://www.cidoc-crm.org/crminf/. datasets. 43 8Nanopublications are defined as the "smallest possible machine 44 44 readable graph-like structure that represents a meaningful asser- We have therefore decided to dedicate a section of 45tion" [6] and consist of publishing a single subject-predicate-object the current article, Section 5 to a detailed discussion 45 46triple with full provenance information; a generalisation of this idea of the current situation as regards research projects and 46 47is that of the cardinal assertion where a single assertion is associ- LLD models and vocabularies. This includes a detailed 47 48ated with more than one provenance graph. A knowlet consists of a overview of the area, Section 5.1 along with an ex- 48 collection of multiple cardinal assertions, with the same subject con- 49 49 cept [6] and can be viewed as locating that concept in a rich ‘concep- tended descriptions of a number of projects which we 50tual space’. For instance, this could be a cloud of predicates centered regard as the most significant from the point of view of 50 51around a word or a sense. LLD models and vocabularies. These are (in order of 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 5

1appearance): the Linked Open Dictionaries (LiODi) layout and overall visual appearance.9 In fact, as we 1 2project (Section 5.2.1); the Poetry Standardization discuss in our description of the OntoLex Lexicogra- 2 3and Linked Open Data (POSTDATA) project; the phy module in Section 4.1.1) even the structural divi- 3 4LiLa: Linking Latin ERC project (Section 5.2.4); the sion of lexicographic works into textual units such as 4 5Prêt-à-LLOD project (Section 5.2.5); the European entries and senses is not always isomorphic to the rep- 5 6network for Web-centred linguistic data science resentation of the lexical content of those units using 6 7(NexusLinguarum) COST action (Section 5.2.6). A OntoLex-Lemon classes such as LexicalEntry and Lex- 7 8list of all the projects described in Section 5 can be icalSense. 8 9found in Table 3. We may also wish to model different aspects of the 9 10Note, however, that although the projects which we history of the lexicographic work as physical text.10 10 11will discuss in Section 5 have, in many cases, set the All of this calls for a much richer provision of meta- 11 12agenda for the development of LLD models and vo- data categories than had previously been considered 12 13cabularies, much of the actual work on the definition of for LLD lexica, both at the level of the whole work as 13 14these resources was carried out – and is being carried well as at the level of the entry. It also requires the ca- 14 15out – within community groups, such as the W3C On- pacity to model salient aspects of the same artefact or 15 16toLex group. We therefore include an update on com- resource at different levels of description (something 16 17munity standards and initiatives in Section 4. These which is indeed offered by the OntoLex Lexicogra- 17 18include a subsection on the latest activities in the On- phy module, see Section 4.1.1). We discuss metadata 18 19toLex group (Section 4.1); a discussion of recent work challenges in humanities use cases in Section 4.3.A 19 20on LLD models for corpora and annotation (Section related topic is the relationship between notions such 20 214.2); and similarly for what concerns models and vo- as word from the lexical/linguistic and the philologi- 21 22cabularies for LLD resource metadata (Section 4.3). cal points of view and, more broadly speaking, the re- 22 23Section 5.1.2 features a discussion of the relation- lationship between linguistic and philological annota- 23 24ship between community initiatives and projects. tions of text is a topic which is just starting to gain at- 24 25tention within the context of LLD. It is being studied 25 262.3. The Relationship of LLD to the Digital both at the level of community initiatives (see Section 26 27Humanities 4.2) as well as in projects such as LiLa (see Section 27 285.2.4) as well as POSTDATA (Section 5.2.2). 28 29Several of the projects which we will discuss in this An additional series of challenges arises in the con- 29 30article are related to the area of Digital Humanities sideration of resources for classical and historical lan- 30 31(DH). This is the third major trend which we want guages, or indeed, historical stages of modern lan- 31 32to highlight here, since it represents a move away (or guages. For instance in the case of lexical resources 32 33rather a branching off) from LLD’s beginnings in com- for historical languages we often come up against the 33 34putational linguistics and natural language processing necessity of having to model attestations (something 34 35(although these latter two still perhaps represent the that is discussed in Section 4.1.3) which sometimes 35 36majority of applications of LLD), something that calls cite reconstructed texts, as well as the desirability of 36 37for a shift in emphases in the definition and coverage being able to represent different scholarly and philo- 37 38of LLD models. This overlap between LLD and DH is logical hypotheses for instance when it comes to mod- 38 39particularly apparent in the modelling of corpora an- elling etymologies. The LiLa project [9] (Section 5.2.4 39 40notation (Section 4.2) and in support for lexicographic for a more detailed description) provides a good exam- 40 41use cases (see Section 4.1.1 and Section 5.2.3). Indeed 41 42one obvious example of these shared concerns is the 42 9 43publication of retro-digitised dictionaries as LLD lex- Encompassing what the TEI dictionary chapter guidelines call 43 the typographical and editorial views. See https://www.tei-c.org/ 44 44 ica (a major theme of the ELEXIS project, see Sec- release/doc/tei-p5-doc/en/html/DI.html#DIMV 45tion 5.2.3). The latter use case confronts us with the 10For example, in the case of older resources, annotating instances 45 46challenge of formally modelling both the content of a where the content has been superseded by subsequent scholarly 46 47lexicographic work, that is the linguistic descriptions work. Or we might want to track the evolution of a historically sig- 47 48which it contains, as well as those aspects which per- nificant lexicographic work over the course of a number of editions, 48 in order to see, for example, how changes in entries reflected both 49 49 tain to it as a physical text to be represented in digital linguistic and wider, non-linguistic trends. This was in fact one of 50form. In the latter case, this includes the representation the motivations behind the Nénufar project [8], described in Section 50 51of (elements of) the form of the text, i.e., its structural 5.1.1. 51 6 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1ple of the challenges and opportunities of adopting the ing a basic familiarity with the Resource Description 1 2LLD model to represent linguistic (meta)data for both Framework (RDF), RDF Schema (RDFS) and the Web 2 3lexical and textual resources for a classical language Ontology Language (OWL) – and linguistic linked 3 4(Latin). data in particular. The recently published Linguistic 4 5One extremely important (non RDF-based) standard linked data: representation, generation and applica- 5 6for encoding documents in the Digital Humanities is tions [10] should however give the interested reader 6 7TEI/XML. In the current article we discuss the rela- who is missing this minimal background a comprehen- 7 8tionship between TEI and RDF-based annotation ap- sive introduction to and overview of the latter field, 8 9proaches in Section 4.2.1, and introduce the new lexi- focusing on more established models and vocabular- 9 10cographic TEI-based standard TEI Lex-0 and describe ies and their application rather than on recent devel- 10 11current work on a crosswalk between OntoLex-Lemon opments. Another important new book on the topic of 11 12and the latter in Section 5.2.3. LLD and which has relevance to the current work is 12 13Finally, see Section 5.1.1 for an overview of a num- the collected volume Development of linguistic linked 13 14ber of projects combining DH and LLD. open data resources for collaborative data-intensive 14 15research in the language sciences [11] which aims to 15 162.4. Related Work describe major developments since 2015. It consists 16 17mostly of position papers by researchers from the lin- 17 18The current work is intended, among other things, to guistics and the communities. 18 19both complement as well as to update a previous gen- 19 20eral survey on models for representing LLD, published 20 21by Bosque-Gil et al. in 2018 [7]. Although we are now 3. LLD Models: An Overview 21 22only two years on from the publication of that article, 22 23we feel that enough has happened in the intervening Summary The current section will give an overview 23 24time period to justify a new survey article. In addition of some of the most well known and/or widely used 24 25we believe that we cover a much wider range of topics models and vocabularies in LLD. A summary of the 25 26than the previous article and that our focus is also quite models discussed in the current section (and in the 26 27different. Broadly speaking, that previous work offered whole article) can be found in Tables 1 and 2 (with Ta- 27 28a classification of various different LLD vocabularies ble 1 dealing with published LLD models/vocabular- 28 29 29 according to the different levels of linguistic descrip- ies and 2 with models/vocabularies that are currently 30tion that they covered. The current paper however con- 30 unavailable or no longer updated). An account of some 31centrates more on the use of LLD vocabularies in prac- 31 the latest developments with regards to these models, 32tise and on their availability (this is very much how we 32 on the other hand, can be found in Section 4. 33have approached the survey in Section 3). Moreover, 33 We will classify each of the models described in this 34the present article includes a detailed discussion of re- 34 section according to the scheme given in the linguis- 35cent work in the use of LLD models and vocabularies 35 tic LOD cloud diagram11 (the cloud itself is described 36in corpora and annotation, Section 4.2, as well as an 36 in [12]), namely: 37extensive section on metadata, Section 4.3, neither of 37 38which were given the same detailed level of coverage – Corpora (and Linguistic Annotations)(Section 38 39in [7]. Additionally we also cover the following initia- 3.1) 39 40tives which were not discussed in [7] because they had – Lexicons and Dictionaries (Section 3.2) 40 41not yet gotten underway: – Terminologies, Thesauri and Knowledge Bases 41 42(Section 3.3) 42 – The development of new OntoLex-Lemon mod- 43– Linguistic Resource Metadata (Section 3.4) 43 ules for morphology Section 4.1.2 and frequency, 44– Linguistic Data Categories (Section 3.5) 44 attestations, and corpus Information, described in 45– Typological Databases (Section 3.6) 45 Section 4.1.3 46 46 – An important new initiative in aligning LLD vo- 47For each category we list the most prominent and/or 47 cabularies for corpora and annotation, described 48widely used LLD models/vocabularies that belong to 48 in Section 4.2.5. 49that category (the relevant section is given in paren- 49 50In what follows we will assume that the reader already 50 51has some grounding in linked data in general – includ- 11http://linguistic-lod.org/llod-cloud 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 7

1theses after each category in the list above). These – (P-Rec 4) Publish semantic artefacts and their 1 2models were either originally designed to help encode contents in a semantic repository: in order to be 2 3that kind of dataset or have been widely appropri- able to exploit repository technologies for find- 3 4ated for that end; in the case of the category Linguis- ability and re-use of semantic artefacts ; 4 5tic Data Categories we list linked data linguistic data – (P-Rec 6) Retrievability through search engines ; 5 6categories. For instance, the OntoLex-Lemon model – (P-Rec 10) Use a foundational ontology to align 6 7falls under Lexicons and Dictionaries since it was ini- semantic artefacts (this enhances re-usability); 7 8tially conceived as a means of enriching ontologies – (P-Rec 13) Create documented crosswalks and 8 9with lexical information, that is, of lexicalising onto- bridges 9 10logical concepts, but subsequently gained popularity as – (P-Rec 16) Ensure clear licensing of semantic 10 11a means of encoding linked data lexica although it can artefacts. 11 12 12 also be used for modelling and publishing other kinds To start with the recommendations (P-Rec 2), (P-Rec 13of datasets. Tables 1 and 2 give a summary of the LLD 13 144), and (P-Rec 10) have been followed by none of 14 vocabularies and models covered in this paper (with the models/vocabularies which we look at below. Fol- 15the relevant sections of the article listed). 15 16lowing these three recommendations would, however, 16 Below we describe our methodology for the rest of 17greatly help to make these resources (and the datasets 17 the section. In Section 3.7 we discuss tools and plat- 18they help to encode) more FAIR and we regard their 18 forms for the publication of LLD. 19adoption as desirable future objectives for the mod- 19 20Our Approach to Classification els and vocabularies listed below, bringing them into 20 21This section is intended as an overview so we do not line with the latest thinking on making such kinds of 21 13 22give a detailed description of single models. Several of resource FAIR. In terms of the recommendation (P- 22 23these models are described in more detail in the rest of Rec 13) at the time of writing we can only mention 23 24the article, or in the Appendix in the case of OntoLex- ongoing efforts at developing a TEI Lex-0/OntoLex- 24 25Lemon. Others can be found in the previous survey pa- Lemon crosswalk described in Section 5.2.3. 25 26per, [7]. Instead we will describe here them on the ba- We will use (P-Rec 6) and (P-Rec 16) to help 26 27sis of a number of criteria many of which are related us to analyse the models and vocabularies to fol- 27 28low. For instance several of the models mentioned 28 to their status as FAIR resources, and in particular to 14 29their status as FAIR models and vocabulary. In partic- do exist on the Linked Open Vocabulary (LOV) 29 30search engine15 [15] and the DBpedia archivo on- 30 ular, we will refer to a recent draft survey on FAIR Se- 31tology archive.17 In cases where licensing informa- 31 mantics [13], the result of a dedicated brainstorming 32tion is available as machine actionable metadata, using 32 workshop of the FAIRsFAIR project.12 This report out- 33properties like DCT:license and URI’s such as https:// 33 lined a number of recommendations and best practices 34creativecommons.org/publicdomain/zero/1.0/ we will 34 for FAIR semantic artefacts where these are defined as 35point this out as it enhances the re-usability of those 35 "machine -actionable and -readable formalisation[s] of 36resources. 36 a conceptualisation enabling sharing and reuse by hu- 37In addition to the written descriptions of different 37 mans and machines" (the term includes: taxonomies, 38LLD models given below, we also give a tabular sum- 38 thesauri, ontologies). 39mary of the most significant/stable/widely available18 39 From all of the recommendations listed in [13] we 40of these models in Table 1. This also points, in rel- 40 have selected the following subset on the basis of 41evant cases, to other parts of the sections of the pa- 41 42their salience to the set of models and vocabularies 42 43under discussion (with justifications for recommenda- 43 13The adoption of foundational ontologies, for instance, might 44tions based on those given in [13]): 44 help to alleviate some of the problems raised by the proliferation of 45– (P-Rec 2) Ensure there is a separate URI for the independently developments as described in [7]. 45 14 46metadata and that they are published separately; https://lov.linkeddata.es/dataset/lov 46 4715Note that the LOV site provides a list of criteria for inclusion on 47 this helps in making the resource more findable 16 48their search engine [14] 48 and supports the extraction of this metadata. 17http://archivo.dbpedia.org/ 49 49 18Several of the models which are described in the rest of the sec- 50tion aren’t available, at least anymore, but may be interesting for 50 5112https://www.fairsfair.eu/ historical reasons. 51 8 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1per where a more in-depth description of said model 3.1. Vocabularies and Models for Corpora and 1 2is given. Every one of the models listed in the table Linguistic Annotations 2 3at is an OWL ontology.. We will also list the other 3 4vocabularies which they make use of (aside from the Linguistic annotation, e.g. for digital editions, cor- 4 5vocabularies OWL, RDF, and RDFS which are com- pora, and linking texts with external resources has 5 6mon to all of the vocabularies on the list). These in- long been a topic of interest in the context of RDF 6 7clude the well known ontologies/vocabularies: XML and linked data. Coexisting with relational databases, 7 8Schema Definition19 (XSD); the Friend of a Friend XML-based formats (most notably, TEI, see 4.2) 8 9Ontology20 (FOAF); the Simple Knowledge Organi- or simply text-based formats, RDF-based annotation 9 10 10 sation System21 (SKOS); Dublin Core22 (DC); Dublin models have been steadily undergoing development 1123 11 Core Metadata Initiative (DCMI) Metadata Terms; and are increasingly being used in research and in- 1224 12 the Data Catalog Vocabulary (DCAT), described also dustry. Currently, there are two primary RDF vocab- 1325 13 in Section 4.3; the PROV Ontology (PROV-O). ularies widely used for text annotations: NLP Inter- 14 14 In addition the table also mentions the following vo- change Format (NIF),32 used mostly in the language 15 15 cabularies. 33 16technology sector and Web Annotation, formerly 16 17– Activity Streams(AS): a vocabulary for activity known as Open Annotation (abbreviated here as OA), 17 18streams.26 used in digital humanities, life sciences and bioinfor- 18 19– GOLD: an ontology for describing linguistic data, matics. Both models have their advantages and short- 19 20which is described in Section 3.5. comings, and a number of proposals to extend these 20 21– MARL: a vocabulary for describing and annotat- have been proposed. Most importantly, there is a need 21 22ing subjective opinions.27 for synchronization between the two. Both are avail- 22 34 35 23– ITSRDF: an ontology used within the Internation- able in LOV and archivo (the NIF core in the case 23 36 24alization Tag Set.28 of NIF ). The Web Annotation model, although it is 24 25– The Creative Commons vocabulary29 (CC). covered by a W3C software and document notice and 25 26– VANN: a vocabulary for annotating vocabulary license, does not express this information in the form 26 27descriptions.30 of triples in the resource metadata; NIF on the other 27 28– SKOS-XL: an extension of SKOS with extra hand does express licensing information as machine 28 29support for “describing and linking lexical en- actionable metadata. 29 30tities”.31 SKOS and SKOS-XL are, along with More details about both models and their recent de- 30 31lemon and its successor OntoLex-Lemon, amongst velopments are described in Section 4.2. Other vo- 31 32the most well known ways of enriching linked cabularies described in that section include POWLA, 32 37 33data taxonomies and conceptual hierarchies with CoNLL-RDF and Ligt. The first of these, POWLA, 33 34 34 linguistic information. We will look at the use of is available on archivo,38 the only one of the three to 35 35 a SKOS-XL vocabulary in the context of a project be so available. CoNLL-RDF39 has version info as a 36 36 on the classification of folk tales in Section 5. string using the owl:versionInfo property and is cov- 37ered by a CC-BY 4.0 license as specified in the LI- 37 38CENSE.data page.40 38 39 39 19https://www.w3.org/TR/xmlschema-0/ 4020http://xmlns.com/foaf/spec/ 40 4121https://www.w3.org/2004/02/skos/ 32https://nif.readthedocs.io/en/latest/ 41 4222https://www.dublincore.org/specifications/dublin-core/ 33https://www.w3.org/TR/annotation-model/ 42 34 43dcmi-terms/ https://lov.linkeddata.es/dataset/lov/vocabs/nif and https://lov. 43 23https://www.dublincore.org/specifications/dublin-core/ linkeddata.es/dataset/lov/vocabs/oa 44 44 dcmi-terms/ 35http://archivo.dbpedia.org/info?o=http://www.w3.org/ns/oa 4524https://www.w3.org/TR/vocab-dcat-2/) 36http://archivo.dbpedia.org/info?o=http://persistence. 45 4625https://www.w3.org/TR/prov-o/ uni-leipzig.org/nlp2rdf/ontologies/nif-core 46 4726https://www.w3.org/TR/activitystreams-vocabulary/ 37http://purl.org/powla/powla.owl 47 27 38 48http://www.gsi.dit.upm.es/ontologies/marl/ https://archivo.dbpedia.org/info?o=http://purl.org/powla/ 48 28https://www.w3.org/TR/its20/ powla.owl 49 49 29https://creativecommons.org/ns 39http://purl.org/acoli/conll# 5030https://vocab.org/vann/ 40https://github.com/acoli-repo/conll-rdf/blob/master/LICENSE. 50 5131https://www.w3.org/TR/skos-reference/skos-xl.html data.txt 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 9

1Summary 1 2Name Other Vocab- LLO Category Licenses Versions (at Extended Cov- 2 3ularies/Models time of writing erage in Current 3 Used 26/07/21) Article 4 4 5OntoLex-Lemon CC, DC, FOAF, Lexicons and CC0 1.0 Version 1.0, Section 3.2, 5 SKOS, XSD Dictionaries 2016 (but this is Section 4.3.3 and 6 6 closely based on Appendix A 7the prior lemon 7 8model [16]) 8 9Lexicog DC, LexInfo, Lexicons and CC0 Version 1.0, Section 3.2 and 9 10(OntoLex- SKOS, VOID, Dictionaries (2019-03-08) Section 4.1.1 10 11Lemon) XSD 11 12MMoOn DC, FOAF, Terminologies, CC-BY 4.0 Version 1.0, 2016 Section 3.3 12 GOLD, LexVo, Thesauri and KBs 13 13 OntoLex-Lemon, (Morphology) 14SKOS, XSD 14 15Web Annotation AS, FOAF, Corpora and W3C Software Version Section 3.1 and 15 16Data Model (OA) PROV, SKOS, Linguistic Anno- and Document "2016-11- Section 4.2.3 16 17XSD tations Notice and 12T21:28:11Z" 17 18License 18 19NLP Interchange DC, DCTERMS, Corpora and Apache 2.0 and Version 2.1.0 Section 3.1 and 19 Format (NIF ITSRDF, levont, Linguistic Anno- CC-BY 3.0 Section 4.2.2 20 20 Core) MARL, OA, tations 21PROV, SKOS, 21 22VANN, XSD 22 23POWLA FOAF, DC, DCT, Corpora and NA Last Updated Section 4.2 23 24Linguistic Anno- 2018-04-03 24 25tations 25 26CoNLL-RDF DC, NIF Core, Corpora and Apache 2.0 and Last Updated Section 4.2.4 26 27XSD Linguistic Anno- CC-BY 4.0 2020-05-26 27 tations 28 28 Ligt DC, NIF Core, Corpora and NA Version 0.2 Section 4.2.4 29 29 OA Linguistic Anno- (2020-05-26) 30tations 30 31META-SHARE CC, DC, DCAT, Linguistic Re- CC-BY 4.0 Version 2.0 (pre- Section 3.4 and 31 32FOAF, SKOS, source Metadata release) Section 4.3.2 32 33XSD 33 34OLiA DCT, FOAF, Linguistic Data CC-BY-SA 3.0 Version last up- Section 3.5 34 35SKOS Categories dated 27/02/20 35 36LexInfo CC, Ontolex, Linguistic Data CC-BY 4.0 Version 3.0, Section 3.5 36 37TERMS, VANN Categories 14/06/2014 37 38LexVo FOAF, SKOS, Typological CC-BY-SA3.0 Version 2013-02- Section 3.6 38 SKOSXL, XSD Databases 09 39 39 Table 1 40 40 Summary of published LLD vocabularies 41 41 42 42 433.2. Lexicons and Dictionaries the W3C ontolex working group which manages its 43 44ongoing development and further extension (see Ap- 44 45The most well known model for the creation and pendix A for an introduction to the model with exam- 45 46publication of lexica and dictionaries as linked data ples and Section 4.1 for extensions and further devel- 46 41 47is the OntoLex-Lemon model [17], an output of opments). It is based on a previous model, the LExi- 47 48con Model for ONtologies (lemon) [16]. Like its pre- 48 49decessor, OntoLex-Lemon was designed with the in- 49 41The URI for OntoLex-Lemon is: http://www.w3.org/ns/lemon/ 50ontolex and the OntoLex-Lemon guidelines can be found at https: tention of enriching ontologies with linguistic infor- 50 51//www.w3.org/2016/05/ontolex/. mation and not of modelling dictionaries and lexicons 51 10 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1Summary 1 2Name LLO Category Status (at time of Extended Cov- 2 3writing 26/07/21) erage in Current 3 Article 4 4 5OntoLex-Lemon: Lexicons and Under Develop- Section 4.1.3 5 FrAC Dictionaries ment 6 6 OntoLex-Lemon: Lexicons and Under Develop- Section 4.1.2 7 7 Morphology Dictionaries ment 8 8 PHOIBLE Terminologies, Unavailable Section 3.3 9Thesauri and 9 10KBs 10 11FRED Corpora and Project Specific Section 4.2 11 12Linguistic Anno- Vocabulary 12 13tations 13 14NAF Corpora and Project Specific Section 4.2 14 Linguistic Anno- Vocabulary 15 15 tations 16 16 GOLD Linguistic Data No Longer Up- Section 3.5 17 17 Categories dated 18Table 2 18 19Other LLD Vocabularies Discussed in this Paper 19 20 20 21 21 per se. Thanks to its popularity however, it has come synsem;52 the decomp module.53 Three of its modules 22 22 to take on the status of a de facto standard for the are available on archivo, the core:54 the lime metadata 23 23 55 56 24modelling and codification of lexical resources in RDF module and the Variation and Translation module. 24 25(including, for instance, retrodigitized dictionaries and All of the OntoLex modules have their licenses (CC0 25 26wordnets) in general. Resources which have been mod- 1.0) described with RDF triples using the CC vocabu- 26 57 27elled using OntoLex-Lemon include: the LLD version lary with a URI as an object. Version information is 27 42 28of the Princeton Wordnet, DBnary (the linked data described using owl:versionInfo. 28 29version of Wiktionary) [18], and the massive multilin- The OntoLex-Lemon Lexicography module58 (de- 29 30gual knowledge graph Babelnet [19]. The OntoLex- scribed in more detail in Section 4.1.1) was published 30 31Lemon model is modular and consists of a core mod- separately from OntoLex-Lemon. It is not available on 31 32ule along with modules for Syntax and Semantics,43 LOV yet, however it is available on archivo.59 The li- 32 33Decomposition,44 and Variation and Translation,45 as cense (CC-Zero) is described with RDF triples using 33 34well as a dedicated metadata module, lime46 (all of the CC vocabulary60 and DC61 with a URI as an object. 34 35 35 these modules are described in Appendix A, except for Version information is described using owl:versionInfo. 36 36 lime which is described in Section 4.3.3). 37 37 OntoLex-Lemon is available on LOV as is its prede- 38 38 cessor lemon.47 All of its separate modules are listed 52https://lov.linkeddata.es/dataset/lov/vocabs/synsem 39 39 separately however:48 the core;49 lime;50 vartrans;51 53https://lov.linkeddata.es/dataset/lov/vocabs/lexdcp 4054https://archivo.dbpedia.org/info?o=http://www.w3.org/ns/ 40 41lemon/ontolex 41 4255http://archivo.dbpedia.org/info?o=http://www.w3.org/ns/ 42 42 43http://wordnet-rdf.princeton.edu/about lemon/lime 43 43http://www.w3.org/ns/lemon/synsem 56http://archivo.dbpedia.org/info?o=http://www.w3.org/ns/ 44 44 44http://www.w3.org/ns/lemon/decomp lemon/vartrans 4545http://www.w3.org/ns/lemon/vartrans 57Using the cc:license property 45 4646http://www.w3.org/ns/lemon/lime 58The guidelines for the module can be found at https://www.w3. 46 4747https://lov.linkeddata.es/dataset/lov/vocabs/lemon org/2019/09/lexicog/, the URL for the module is at http://www.w3. 47 48 48See the Appendix for a description of each, aside from lime de- org/ns/lemon/lexicog# 48 scribed in Section 4.3.3 59https://archivo.dbpedia.org/info?o=http://www.w3.org/ns/ 49 49 49https://lov.linkeddata.es/dataset/lov/vocabs/ontolex lemon/lexicog 5050https://lov.linkeddata.es/dataset/lov/vocabs/lime 60Using the cc:license property 50 5151https://lov.linkeddata.es/dataset/lov/vocabs/vartrans 61using dc:rights 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 11

13.3. Vocabularies for Terminologies, Thesauri and 3.4. Linguistic Resource Metadata 1 2Knowledge Bases 2 3Due to the importance of the topic we give a much 3 4 4 The Simple Knowledge Organisation System fuller overview in Section 4.3; here we will only look 5at accessibility issues for the two models for language 5 (SKOS) is a W3C recommendation for the creation of 6resource metadata which we mention there. These are 6 terminologies and thesauri, or more broadly speaking, 7the METASHARE ontology67 and lime. The latter has 7 knowledge organisation systems.62 We will not go into 8been previously introduced and is described in more 8 9any depth into it here since it is a general purpose vo- detail in Section 4.3.3. The former is currently in its 9 10cabulary which is applied well beyond the domain of pre-release version 2.0 (the last update being 2020-03- 10 11language resources. 20). Its license information (it has a CC-BY 4.0 li- 11 12In terms of specialised vocabularies or models for cense) is available as triples using dct:license with a 12 13the modelling of linguistic knowledge bases – and URI as an object. 13 14aside from linguistic data category registries which 14 15will be discussed in Section 3.5 – we can list two. 3.5. Linguistic Data Categories 15 16The first is MMoOn ontology63 which was designed 16 17for the creation of detailed morphological invento- History 17 18ries [20]. It does not currently seem to be available on As of 2010, two major repositories were in widespread 18 19any semantic repositories/archives/search engines but use by different communities for addressing the har- 19 20 20 it does have its own dedicated website64 which offers a monization and linking of linguistic resources via their 21 21 SPARQL endpoint (although this was down at the time data categories. In computational lexicography and 22language technology, the most widely applied termi- 22 of writing). Its license information (it has a CC-BY 4.0 23nology repository was ISOcat [22] which provided 23 license) is available as triples using dct:license with a 24human-readable and XML-encoded information about 24 25URI as an object. linguistic data categories that were relevant for linguis- 25 26PHOIBLE is an RDF model for creating phono- tic annotation, the encoding of electronic dictionaries 26 27logical inventories [7]. As of the time of writing, and language resource metadata via persistent URIs. 27 28PHOIBLE data was no longer available as a complete In the field of language documentation and typol- 28 29RDF graph, but only in its native (XML) format from ogy, the General Ontology of Linguistic Description 29 30which RDF fragments are dynamically generated. The (GOLD) emerged in the early 2000s [23], having been 30 31original data remains publicly available,65 but on the originally developed in the context of the project En- 31 32PHOIBLE website, it is only possible to browse and dangered Metadata for Endangered Languages Data 32 33export selected content into RDF/XML.66 Since it no (E-MELD, 2002-2007).68 GOLD stood out in partic- 33 34longer provides resolvable URIs for its components, ular because of its excellent coverage of low resource 34 35 35 PHOIBLE data does not fit within the narrower scope languages. In the RELISH project, a curated mirror of 36 36 of LLD vocabularies anymore. It does, however, main- GOLD-2010 was incorporated into ISOcat [24]. Un- 37 37 tain a non-standard way of linking, as it has been ab- fortunately, since then, GOLD development has stalled 38and, while the resource is still being maintained by the 38 sorbed into the Cross-Linguistic Linked Data infras- 39LinguistList (along with the data from related projects) 39 tructure [21, CLLD] (along with other resources from 40and still remains accessible,69 it has not been updated 40 the typology domain). CLLD datasets and their RDF 41since [25] (and for this reason we have not included it 41 42exports continue to be available as open data under in our summary table). In parts, its function seems to 42 43https://clld.org/, see below for additional details, Sec- have been taken over by ISOcat, but it is worth point- 43 44tion 3.6. ing out here that the ISOcat registry exists only as a 44 45static, archived resource, but no longer as an opera- 45 46tional system. 46 4762https://www.w3.org/2004/02/skos/ 47 63 48https://github.com/MMoOn-Project/MMoOn/blob/master/core. 48 ttl 67http://www.meta-share.org/ontologies/meta-share/ 49 49 64https://mmoon.org/ meta-share-ontology.owl/documentation/index-en.html 5065https://github.com/clld/phoible/tree/master/phoible/static/data 68http://emeld.org/ 50 5166See, for example, https://phoible.org/inventories/view/161. 69https://linguistlist.org/projects/gold.cfm 51 12 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1The Current Situation 3.6. Vocabularies for Typological Datasets 1 2The ‘official’ successor of ISOcat, the CLARIN 2 3Concept Registry is briefly discussed in Section 4.3 Relevant Resources and Initiatives 3 4below (it is not strictly speaking a linked data vo- Linguistic typology is commonly defined as the 4 5cabulary). Another one of its successors is the Lex- field of linguistics that studies and classifies languages 5 6Info ontology,70 the data category register used in based on their structural features [28]. The field of lin- 6 7OntoLex-Lemon and which re-appropriates many of guistic typology has natural ties with language docu- 7 8the concepts which were contained in ISOcat for use mentation, and accordingly, considerable work on lin- 8 9within the lexical domain (dictionaries, terminologies, guistic typology and linked data has been conducted in 9 10lexica). Currently in its third version, LexInfo can the context of the GOLD ontology (see above, Section 10 71 11be found both on the LOV search engine and on 3.5). We can identify the following relevant datasets. 11 72 12archivo, it appears both times however in its second One of the main contributors and advisors to the sci- 12 13version. Version 3.0 is under development since late entific study of typology is the Association for Lin- 13 73 142019 in a community-guided process via GitHub, guistic Typology (ALT).75 They facilitate the descrip- 14 15and is not registered with either service, yet. LexInfo tion of the typological patterns underlying datasets. 15 16has a (CC-BY 4.0) license. This is described with RDF One of the most well-known resources that ALT makes 16 17triples using the CC vocabulary and DCT with a URI available is the World Atlas of Language Structures 17 18as an object in both cases. Version information is de- (WALS)76 [29, 30] which is a large database of phono- 18 19scribed using owl:versionInfo. logical, grammatical, and lexical properties of lan- 19 20For linguistic data categories in linguistic annota- guages gathered from descriptive materials. This re- 20 21tion (of corpora and by NLP tools), a separate termi- source can both be used interactively online and can be 21 22nology repository exists with the Ontologies of Lin- 77 22 74 downloaded. The CLLD (Cross-Linguistic Linked 23guistic Annotation [26, OLiA]. OLiA has been de- Data) project integrates WALS, thus, offering a frame- 23 24veloped since 2005 in an effort to link community- work that structures this typological dataset using the 24 25maintained terminology repositories such as GOLD, 25 Linked Data principles. 26ISOcat or the CLARIN Concept Registry with an- 26 Another collection that provides web-based access 27notation schemes and domain- or community-specific 27 to a large collection of typological datasets is the Ty- 28models such as LexInfo or the Universal Dependen- 28 pological Database System (TDS) [31, 32]. The main cies specifications by means of an intermediate “Ref- 29goals of TDS are to offer users a linguistic knowledge 29 erence Model”. OLiA consists of a set of modular, in- 30base and content metadata. The knowledge base in- 30 terlinked ontologies and is designed as a native linked 31cludes a general ontology and dictionary of linguis- 31 data resource. Its primary contributions are to provide 32tic terminology, while the metadata describes the con- 32 machine-readable documentation of annotation guide- 33tent of the term ontology databases. TDS supports a 33 lines and a linking with and among other terminology 34unified querying across all the typological resources 34 35repositories. It has been suggested that such a collec- 35 hosted with the help of an integrated ontology. The 36tion of linking models, developed in an open source 36 Clarin Virtual Language Observatory (VLO)78 in- 37process via GitHub, may be capable of circumvent- 37 corporates TDS among its repositories. 38ing some of the pitfalls of earlier, monolithic solutions 38 Finally, another group of datasets relevant for typo- 39of the ISOcat era [27]. At the moment, OLiA cov- 39 logical research include large-scale collections of lex- 40ers annotation schemes for more than 100 languages, 40 ical data, as provided, for example by PanLex79 and 41for morphosyntax, syntax, discourse and aspects of se- 41 Starling.80 An early RDF edition of PanLex has been 42mantics and morphology. OLiA has a (CC-BY 4.0) li- 42 described by [33] and was incorporated in the initial 43cense; this is described using the Dublin Core property 43 version of the Linguistic Linked Open Data cloud dia- 44license with a URI as an object. 44 gram. At the time of writing, however, this early RDF 45 45 4670https://lexinfo.net/ 46 4771https://lov.linkeddata.es/dataset/lov/vocabs/lexinfo 75https://linguistic-typology.org/ 47 72 76 48http://archivo.dbpedia.org/info?o=http://www.lexinfo.net/ https://wals.info/ 48 ontology/2.0/lexinfo 77https://clld.org/ 49 49 73It will be the first version that is compliant with OntoLex- 78https://vlo.clarin.eu/ 50Lemon. 79http://panlex.org 50 5174http://purl.org/olia 80https://starling.rinet.ru/ 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 13

1version does not seem to be accessible anymore. In- are yet highly motivated to create and/or make use of 1 2stead, CSV and JSON dumps are being provided from linked data resources. Such software is especially help- 2 3the PanLex website. On this basis [34] describe a fresh ful when it comes to the validation and post-editing 3 4OntoLex-Lemon edition of PanLex (and other) data as of language resources which have been generated au- 4 5part of the ACoLi Dictionary Graph.81 However, they tomatically or semi-automatically. They also help to 5 6currently do not provide resolvable URIs, but rather make the information in these resources accessible for 6 7redirect to the original PanLex page. The authors men- those who may not yet have learned SPARQL in order 7 8tion that linking would be a future direction, and in to browse or carry out (simple) queries on them. 8 9preparation for this, they provide a TIAD-TSV edi- In terms of existing tools or software which offer 9 10tion of the data along with the OntoLex edition, with dedicated provision for the models which we look at 10 11the goal to adapt techniques for lexical linking devel- in this article, we can mention VocBench and LexO 11 12oped in the context of, for example, the on-going se- for OntoLex-Lemon. Both of these are web-based 12 13ries of Shared Tasks on Translation Inference Across and allow for the collaborative development of com- 13 14Dictionaries.82 As for modelling requirements of lexi- putational resources. In the case of the VocBench 14 15cal datasets in linguistic typology, these are not funda- platform, currently in its third release [36], these re- 15 16mentally different from other forms of lexical data, but sources can be OWL ontologies and SKOS thesauri 16 17they adopt OntoLex, resp. its predecessor, lemon, see as well OntoLex-Lemon lexicons. LexO focuses on 17 18above. They do, however, require greater depth with lexical and ontological resources and was originally 18 19respect to identifying and distinguishing language va- developed in the context of DitMaO a project on 19 20rieties. This was one of the driving forces behind the the medico-botanical terminology of the Old Occitan 20 21development of Glottolog. language [37]. A first, generally available version of 21 22LexO, LexO-lite, is available at the https://github.com/ 22 23Vocabularies for Typological Datasets andreabellandi/LexO-lite. 23 24In terms of linked data vocabularies and mod- Finally, we should mention LLODifier85 a suite of 24 25els which are relevant for the creation of typologi- tools for creating and working with LLD which is cur- 25 83 26cal databases we can identify LexVo [35]. LexVo rently being developed by the Applied Computational 26 27bridges the gap between linguistic typology and the Linguistics Lab of the Goethe University Frankfurt. 27 28LOD community and brings together language re- These include vis for working with NIF and unimorph 28 29sources and the entity relationships provided through which works with CoNLL-RDF. 29 30the Linked Data Web and the Semantic Web. The 30 31project manages to link a large variety of resources 31 32on the Web, besides providing global IDs (URIs) 4. An Overview of Developments in LLD 32 33for language-related objects. LexVo is available on Community Initiatives and Standards 33 34archivo84 but is not yet available on LOV. Further dis- 34 35cussion of LexVo can be found in Section 4.3.4 Summary and Overview The current section com- 35 36prises an extensive overview of developments in vari- 36 373.7. Excursus: Tools and Platforms for the ous different LLD community initiatives and standards 37 38Publishing of LLD relating to LLD models and vocabularies. In particular, 38 39it focuses on the three areas that we believe have been 39 the most active in the last few years (the first two of the 40The availability of tools and platforms for the edit- 40 following) or that are starting to gain greater promi- 41ing, conversion and publication of LLD resources (on 41 nence (the third): lexical resources (Section 4.1), anno- 42the basis of the models which we discuss in this article) 42 tation and corpora (Section 4.2), and finally metadata 43is important for the adoption of those models amongst 43 (Section 4.3). We have referred to these as community 44a wider community of end users. It is especially im- 44 standards/initiatives because they have been pursued 45portant for users who are unfamiliar with the technical 45 or developed as community efforts rather than within 46details of linked data and the Semantic Web and who 46 47a single research group or project. 47 48Membership in these community initiatives is (of- 48 81Data available under https://github.com/acoli-repo/acoli-dicts. 49ten) open to all, rather than being limited to members 49 82https://tiad2021.unizar.es/ 5083http://lexvo.org/ 50 5184http://archivo.dbpedia.org/info?o=http://lexvo.org/ontology 85https://github.com/acoli-repo/LLODifier 51 14 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1of a project or experts nominated by a standards body. – the W3C Community Group Ontology-Lexica,89 1 2The intention being to allow for the participation of a originally working on ontology lexicalization, the 2 3wider range of stakeholders as well as encouraging the group extended their activities after the publica- 3 4collection of a wider variety of use-cases than might tion of the OntoLex vocabulary (May 2016) and 4 5otherwise be possible. now represents the main locus to discuss the mod- 5 6One of the most notable community efforts in the elling of lexical resources with web standards and 6 7context of LLOD (that is Linguistic Linked Open in LL(O)D. 7 8Data) is the Open Linguistics Working Group (OWLG) – the W3C Community Group Linked Data for 8 90 9of Open Knowledge International86 that introduced Language Technology, with a focus on lan- 9 10the vision of a Linguistic Linked Open Data (LLOD) guage resource metadata and linguistic annota- 10 11cloud in 2011 [38], and whose activities, most no- tion with W3C standards 11 12 12 tably the organization of the long-standing series of Most recently, these activities have converged in 13 13 international Workshops on Linked Data in Linguis- funded networks, especially, the Cost Action NexusLin- 14 14 tics (LDL, since 2012), as well as the publication guarum. 15 15 of the first collected volume on the topic of Linked Also, Linked Data plays a certain role in the context 16 16 Data in Linguistics [39], ultimately led to the imple- of older standardization initiatives, e.g., the TEI Con- 17 17 mentation of LLOD cloud in 2012, celebrated with sortium,91 or the ISO TC37/SC4 committee.92 18 18 a special issue of the Semantic Web Journal pub- We take the standards and initiatives proposed by 19 19 lished in 2015 [40]. The LLOD cloud, now hosted un- these communities as our basis of the topics in this sec- 20 20 der http://linguistic-lod.org/, was enthusiastically em- tion, but we will also look at significant developments 21 21 braced, the linguistics category became a top-level cat- respecting these standards and initiatives outside and 22 22 egory in the 2014 LOD cloud diagram, and since 2018, independent of these groups (see Section 4.1.4) in the 23 23 interests of completeness and to understand current 24it represented the first LOD domain sub-cloud. 24 trends. 25At the same time, a number of more specialized ini- 25 Note that our intention has been to go for complete- 26tiatives emerged, as mentioned below, for which the 26 ness in our description of these developments. At the 27Open Linguistics Working Group acted and continues 27 same time, however, as has been mentioned we have 28to act as an umbrella that facilitates information ex- 28 tried not to include too much material that was already 29change among them and between them and the broader 29 available in the sources cited in Section 2.4. 30circles of linguists interested in linked data technolo- 30 In addition, a discussion of the relationship between 31gies and knowledge engineers interested in language. 31 community initiatives and projets can be found in Sec- 32Currently, main activities of the OWLG are the orga- 32 tion 5.1.2 below. 33nization of workshops on Linked Data in Linguistics 33 34(LDL), the coordination of datathons such as Multi- 34 4.1. Lexical Resources: OntoLex-Lemon and its 35lingual Linked Open Data for Enterprises (MLODE 35 Extensions 362012, 2013) and the Summer Datathon in Linguis- 36 37tic Linked Open Data (SD-LLOD, 2015, 2017, 2019), 37 Summary In this section we will describe some of 38maintaining the Linguistic Linked Open Data (LLOD) 38 the most recent work that has been carried out on the 39cloud diagram87 and continued information exchange 39 OntoLex-Lemon model,93 both within the ambit of the 40via mailing list88 40 W3C Ontolex group as well as outside of it. With re- 41Over the years, however, the focus of discussion 41 gards to the former we will discuss three of the lat- 42moved from the OWLG to more specialized mailing 42 est extensions to the model (one of which has been 43lists and communities. At the time of writing, partic- 43 published and two which are still currently under de- 44ularly active community groups concerned with data 44 velopment) in Sections 4.1.1, 4.1.2, and 4.1.3. In Sec- 45modelling include 45 46tion 4.1.4 we look at a number of new extensions to 46 47 47 86 89 48https://linguistics.okfn.org/ https://www.w3.org/community/ontolex/ 48 87http://linguistic-lod.org/ 90https://www.w3.org/community/ld4lt 49 49 88Since early 2020, the mailing list operates via https://groups. 91https://tei-c.org/ 50google.com/g/open-linguistics. Earlier messages are archived under 92https://www.iso.org/committee/297592.html 50 51https://lists-archive.okfn.org/pipermail/open-linguistics/. 93An introduction to the model is given in Appendix A 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 15

1OntoLex-Lemon which have emerged independently associated with the lime class Lexicon (see Section 1 2of the W3C Ontolex group over the last two years and 4.3.3). The former is described as representing “a col- 2 3which moreover have not been discussed in [7] (for lection of lexicographic entries[...]in accord with the 3 4an in-depth discussion of such developments prior to lexicographic criteria followed in the development of 4 52018 please refer to the latter paper). that resource”.96 These lexicographic entries are rep- 5 6Note that the use of OntoLex-Lemon in a number of resented in their turn by another new lexicog class, 6 7different projects is described in Section 5. namely, the class Entry, which is defined as being a 7 8“structural element that represents a lexicographic ar- 8 4.1.1. The OntoLex Lexicography Module (lexicog) 9ticle or record as it arranged in a source lexicographic 9 As mentioned previously, lemon and its successor 10resource”97 (emphasis ours). Furthermore an Entry is 10 OntoLex-Lemon have been widely adopted for the 11related to its source Lexicographic Resource via the 11 modelling and publishing of lexica and dictionaries as 12object property entry. The class Entry is a subclass 12 linked data. The core module has proven to be rea- 13of the more general class Lexicographic Component, 13 sonably effective in capturing some of the most typ- 14defined as "a structural element that represents the 14 ical kinds of lexical information contained in dictio- 15(sub-)structures of lexicographic articles providing in- 15 naries and lexical resources in general (e.g., [41–45]). 16formation about entries, senses or sub-entries", mem- 16 However, there are certain fairly common situations in 17bers of this class “can be arranged in a specific order 17 which the model falls short, most notably in the repre- 18and/or hierarchy”.98 That is, Lexicographic Component 18 sentation of certain elements of dictionaries and other 19allows for the representation of the ordering of senses 19 lexicographic datasets [46]. This, however, is not sur- 20in an entry (and even potentially entries if this is re- 20 prising given that lemon was initially conceived as a 21quired), the arrangement of senses and sub-senses in 21 model for grounding ontologies with linguistic infor- 22a hierarchy, etc. in a published lexicographic resource 22 23mation and not for modelling lexical resources per se. (by making use of the classes and properties we have 23 24In order to adapt OntoLex-Lemon to the modelling looked at above, along with the lexicog object property 24 25necessities and particularities of dictionaries and other subComponent), separately from the representation of 25 26lexicographic resources, the W3C Ontolex commu- the same resource as lexical content. Finally,99 we need 26 27nity group developed a new OntoLex Lexicography 27 94 some way of linking together these two levels of repre- 28Module (lexicog). This module was the result of sentation. This is provided by the lexicog object prop- 28 29collaborative work with contributions from lexicogra- erty describes which relates individuals of class Lexi- 29 30phers, computer scientists, dictionary industry practi- cographic Component, which belong to a specific lex- 30 31tioners, and other stakeholders and was first released icographic resource, “to an element that represents the 31 32in September 2019. As stated in the specification, the actual information provided by that component in the 32 33lexicog module “overcome[s] the limitations of lemon lexicographic resource”.100 See Figure 1. 33 34when modelling lexicographic information as linked As an example we can take the entry for the Ital- 34 35data in a way that is agnostic to the underlying lexico- ian word chiaro ‘clear’ in the popular Italian dictionary 35 36graphic view and minimises information loss”. Treccani.101 The latter lists the word as both an adjec- 36 37The idea is to keep purely lexical content separate tive and a masculine noun, as well as having an adverb 37 38from lexicographic (textual) content. For that purpose, as a subcomponent of the entry, in addition to listing 38 39new ontology elements have been added that reflect the related adverb chiaramente ‘clearly’ as a related 39 40the dictionary structure (e.g., sense ordering, entry hi- entry as well as the diminutive chiaretto. 40 41erarchies, etc.) and complement the OntoLex-Lemon The first two of the (four) subsenses of the entry are 41 42model. The lexicog module have been validated with classed as adjectives, the third as a noun, and the fourth 42 43real enterprise-level dictionary data [47] and can be 43 considered in a stable status right now. 4496 44 In lexicog the structural organisation of a lexico- https://www.w3.org/2019/09/lexicog/#lexicographic-resource 4597https://www.w3.org/2019/09/lexicog/#Entry 45 46graphic resource is now associated with the class Lex- 98https://www.w3.org/2019/09/lexicog/ 46 95 47icographic Resource (a subclass of the VoID class #lexicographic-component 47 99 48Dataset) whereas the lexical content is (as previously) Note that we do not cover all the classes and properties in the 48 module in this brief description, please see the guidelines for a com- 49 49 prehensive description with examples. 5094https://www.w3.org/2019/09/lexicog/ 100https://www.w3.org/2019/09/lexicog/#describes-0 50 5195https://www.w3.org/TR/void/ 101https://www.treccani.it/vocabolario/chiaro/ 51 16 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30Fig. 1. The lexicog module (taken from the guidelines). 30 31 31 32 32 as an adverb. We will simplify this for the purposes of lime:language "it" ; 33 33 exposition by assuming that the first subsense is an ad- lime:entry :chiaro1_adj, :chiaro2_n, 34:chiaro3_adv . 34 jective, the second a noun, and the third an adverb. This 35 35 can be represented in the following way. First we rep- :chiaro1_adj a ontolex:LexicalEntry . 36:chiaro2_n a ontolex:LexicalEntry . 36 resent the encoding of the Treccani dictionary struc- :chiaro3_adv a ontolex:LexicalEntry . 37 37 ture itself, and the different sub-components of the en- 38 38 try for chiaro: Finally we bring the two resources together using 39 39 :treccaniRDF a lexicog:LexicographicResource; the describes property. 40 40 dc:language "it" ; 41lexicog:entry :chiaro_entry . :chiaro1_comp lexicog:describes :chiaro1_adj . 41 :chiaro2_comp lexicog:describes :chiaro2_n. 42 42 :chiaro_entry a lexicog:Entry ; :chiaro3_comp lexicog:describes :chiaro3_adv 43rdfs:member :chiaro_1_comp, 43 :chiaro_2_comp, 44 44 :chiaro_3_comp. 454.1.2. OntoLex Morphology Module 45 :chiaro1_comp a lexicog:LexicographicComponent; 46Since November 2018, the W3C Ontolex commu- 46 :chiaro2_comp a lexicog:LexicographicComponent; 47:chiaro3_comp a lexicog:LexicographicComponent. nity group has been developing another extension of 47 48the core model that would allow for better representa- 48 49Next we encode a lexicon which represents the con- tion of morphological data in lexical resources. 49 50tent of the resource in the last listing. Morphology plays an important role in many lan- 50 51:myItLexicon a lime:Lexicon; guages, and its description has also played an impor- 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 17

1tant role in the work of lexicographers. The extent of 1 :lex_drive_v a ontolex:LexicalEntry . 2its presence in concrete resources can vary, ranging :lex_driver_n a ontolex:LexicalEntry ; 2 3from the sporadic indication of certain specific forms decomp:constituent :component_drive, 3 :component_er . 4in a dictionary (e.g. plural form for some nouns) to 4 5:component_drive a decomp:Component ; 5 electronic resources which provide tables with entire decomp:correspondsTo :lex_drive_v . 6inflectional paradigms for every word.102 :component_er a decomp:Component ; 6 7The core OntoLex-Lemon model, together with decomp:correspondsTo :suffix_er . 7 8LexInfo (see Section 3.5), provides the means of en- :suffix_er a morph:AffixMorph . 8 9coding basic morphological information: for lexical 9 10entries, morphosyntactic categories such as part of Inflection is modelled as follows: every instance of 10 11speech can be provided and basic inflection informa- Form has properties morph:consistsOf which point to 11 103 12tion (i.e. morphological relationship between a lexical instances of morph:Morph. These instances can have 12 13entry and forms) can be modelled by creating any addi- morphosyntactic properties expressed by linking to an 13 14 14 tional inflected forms with corresponding morphosyn- external vocabulary, e.g. LexInfo: 15 15 tactic features (e.g. case, number, etc.). However this :lex_drive_v a ontolex:LexicalEntry ; 16ontolex:otherForm :form_drives . 16 only covers a small portion of the potential morpho- 17 17 logical data present in many lexical resources. Neither :form_drives a ontolex:Form ; 18consistsOf :stem_drive_v, :suff_s . 18 derivation (i.e. morphological relationships between 19:suff_s a morph:AffixMorph ; 19 20lexical entries) nor additional inflectional information lexinfo:number lexinfo:Plural . 20 21(e.g. declension type for Latin nouns) can be prop- 21 104 22erly modelled with the core model. The new OntoLex The module has not yet been published and is 22 23Morphology module has been proposed to address still very much under development by the W3C group. 23 24these limitations. The scope of the module is threefold: At the time of writing, a consensus was reached on 24 the first two parts of the module, and their overview 25– Representing derivation: for a more sophisticated 25 has been published in [48]. The third part, which con- 26description of the decomposition of lexical en- 26 27cerns the automatic generation of forms is currently 27 tries; being discussed, and the next step will be validating 28– Representing inflection: introducing new ele- 28 29the model by creating resources using the module. 29 ments to represent paradigms and wordform- 30 30 building patterns; 4.1.3. OntoLex-FrAC: Frequency, Attestations, 31 31 – Providing means to create wordforms automati- Corpus Information 32 32 cally based on lexical entries, their paradigms and In parallel with the development of the Morphol- 33 33 inflection patterns. ogy Module, the OntoLex W3C group has also started 34developing a separate module that would allow for 34 35Figure 2 presents the diagram for the module. the enrichment of lexical resources with information 35 36The central class of the module, and which is used drawn from corpora and other language resources. 36 37in the representation of both derivation and inflection, Most notably, this includes the representation of at- 37 38is Morph with subclasses for different types of mor- testations (used for instance as illustrative exam- 38 39phemes. ples in a dictionary). These were originally discussed 39 40For derivation, elements from the decomp module within lexicog (See 4.1.1), but this discussion quickly 40 41are reused. A derived lexical entry has Components for outgrew the confines of computational lexicography 41 42each of the morphemes of which it consists. A stem alone. Furthermore, it was observed that OntoLex 42 43corresponds to a different lexical entry whereas mor- lacked any support for corpus-based statistics, a cor- 43 44 44 phemes which do not correspond to any headwords, nerstone not only of empirical lexicography, but also 45 45 correspond to an object of a Morph class. A derived of computational philology, and lan- 46 46 lexical entry has constituent properties pointing to ob- 47 47 jects of the Component class: 103 48One of the problems with this approach is that the order of 48 the affixes is undefined, there are several possible solutions for this, 49 49 e.g. a property next between two morphs, but currently there is no 50102For example, Wiktionary, https://en.wiktionary.org/wiki/ consensus in the community on how to model the order. 50 51Buch#Declension. 104https://www.w3.org/community/ontolex/wiki/Morphology 51 18 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19Fig. 2. Preliminary diagram for the Morphology Module. 19 20 20 21 21 guage technology, and thus, again, beyond the scope should be defined. This instance must implement the 22 22 of the lexicog module. Finally, the community group properties corpus and rdf:value:105 23 23 epsd:kalag_strong_v a ontolex:LexicalEntry; 24felt the need to specifically address the requirements of 24 frac:frequency [ 25modern language technology by extending its expres- a frac:CorpusFrequency; 25 rdf:value "2398"^^xsd:int; 26sive power to corpus-based metrics and data structures 26 frac:corpus 27like word embeddings, collocations, similarity scores 27 ]. 28and clusters, etc. 28 29The development of the module has been use-case- 29 The usage recommendation is to define a subclass 30based, which has dictated the order and development 30 of CorpusFrequency for a specific corpus when rep- 31for various parts of the module. 31 resenting frequency information for many elements in 32So far, two papers have been published on parts 32 the same corpus. 33of the module in the course of its development, and 33 For attestations, i.e. corpus evidence in FrAC, it is 34these parts can thus considered to be relatively stable. 34 defined as “a special form of citation that provide ev- 35They include the representation of (absolute) frequen- 35 idence for the existence of a certain lexical phenom- 36cies and attestations, and, by analogy, any use case that 36 ena; they can elucidate meaning or illustrate various 37requires pointing from a into an anno- 37 38linguistic features”. As with frequency, there is a class 38 tated corpus or other forms of external empirical evi- Attestation, an instance of which should be an object 39dence [49]. 39 40of a property attestation. It should have two properties: 40 The central element which has been introduced in 41attestationGloss – text of the attestation and locus – 41 this module is frac:Observable. Since the type of ele- 42location where the attestation can be found: 42 ments for which corpus-based information can be pro- 43diamant:sense_1 a ontolex:LexicalSense; 43 vided is not limited to an entry, form, sense, or concept frac:attestation diamant:attestation_1 ; 44diamant:attestation_1 a frac:Attestation ; 44 45but can be any of these, Observable was introduced as cito:hasCitedEntity diamant:cited_document_1 ; 45 a superclass for all these classes. cito:hasCitingEntity diamant:sense_1; 46frac:locus diamant:locus_1 ; 46 47The module provides means to model only absolute frac:quotation "... dat men licht yemant de cat 47 aen het been kan werpen," . 48frequency, because “relative frequencies can be de- 48 49rived if absolute frequencies and totals are known” [49, 49 50p. 2]. To represent frequency, a property frequency 50 51with an instance of CorpusFrequency as an object 105Examples in this section are taken from [49]. 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 19

1The FrAC module does not provide an exhaustive lishing topical thesauri, where the latter are defined 1 2vocabulary and instead promotes reuse of external vo- as lexical resources which are organised on the ba- 2 3cabularies, such as CITO [50] for a citation object and sis of meanings or topics.107 The lemon-tree model 3 4NIF or WebAnnotation (see 4.2) to define a locus. has already been used to publish the Thesaurus of 4 5Another, more recent paper focused on represent- Old English [53] and reveals the flexibility of the 5 6ing embeddings in lexical resources is [51]. It should OntoLex-Lemon/LLD approach in enabling the mod- 6 7be noted that the term embedding is used here in a elling of more specialised kind of linguistic informa- 7 8broader sense than is usual in the field of natural lan- tion. As does lemonEty [54] another ‘unofficial’ ex- 8 9guage processing, namely as a morphism Y ( f : X → tension of the OntoLex-Lemon model which has been 9 10Y 106 Embedding 10 ). Therefore, the class has subclasses proposed as a means of encoding etymological infor- 11for modelling bags of words and time series. 11 mation. Namely, both the kinds of etymological in- 12The main motivation to model embeddings as a part 12 formation contained in lexica and dictionaries as well 13of this module is to provide metadata as RDF for pre- 13 in other kinds of resources (such as articles or mono- 14computed embeddings, therefore a word vector itself 14 graphs). The lemonEty model does this by exploiting 15is stored as a string with an embedding vector: 15 the graph-based structure of RDF data and by render- 16:embedding a 16 17frac:FixedSizeVector; ing explicit the status of etymologies as prospective 17 dc:extent "300"^^xsd:int; hypotheses. 18rdf:value "0.145246 0.38873 ..."; 18 19In both of these cases, the RDF data model along 19 20As with modelling frequency, the recommendation with the various different standards and technologies 20 21is to define a subclass for the specific type of embed- which make up the Semantic Web stack as a whole 21 22ding concerned in order to make the RDF less verbose. permit us to structure such information in salient and 22 23Figure 3 presents a diagram of the latest version of ‘meaningful’ ways that help to enhance the machine 23 24the module. actionability of strongly heterogeneous linguistic data. 24 25At the times of writing, module development is fo- This is something which the author of [55] attempts to 25 26cused on collecting and modelling various use-cases. demonstrate by way of a proposal to extend OntoLex- 26 27Among the many use-cases that were proposed during Lemon with temporal information in a ontologically 27 28this phase, one stood out in particular and seemed to well motivated way (while being careful to remain 28 29be more challenging than the others: this was related to within the expressive limitations of RDF in order to ex- 29 30the modelling of sign language data. Given the nature ploit standard technologies for that framework includ- 30 31 31 of the data (video clips with signs and/or time series ing reasoning tools for OWL), and allowing the inte- 32 32 of key coordinates for preprocessed data), it was de- gration of lexical data with data relating to textual at- 33 33 cided that although the use-case was out of the scope testations and other historical information. 34of the FrAC module, it did indeed raise serious inter- 34 35est within the community, and therefore discussion on 35 36whether it will be developed as a separate module in 4.2. Annotation and Corpora 36 37the future, is now underway. The question of the scope 37 38of this new module and, more generally, its connection Summary In this section we give an overview of a 38 39to OntoLex-Lemon, is currently subject to discussion. number of LLD vocabularies for the annotation of 39 40texts. In Section 4.2.1 we give a detailed overview 40 4.1.4. Selected individual contributions 41of this topic. Then we look at the two most popular 41 ‘Unofficial’ OntoLex extensions pursued by indi- 42such vocabularies the NLP Interchange Format (Sec- 42 vidual research groups are manifold, and while these 43tion 4.2.2) and Web Annotation (Section 4.2.3). Next 43 44are not yet being pursued as candidates for future On- 44 toLex modules in the W3C Communitiy Group On- we look at two domain specific vocabularies, Ligt and 45CoNLL-RDF in Section 4.2.4. Finally in Section 4.2.5 45 46toLex, they may represent a nucleus and a cumulation 46 we look at the prospects of a convergence between the 47point for future directions. 47 vocabularies which we have discussed. 48Selected recent extensions include lemon-tree [52], 48 49an OntoLex-Lemon and SKOS based model for pub- 49 50107The lemon-tree specifications can be found here https://ssstolk. 50 51106An injective structure-preserving map. github.io/onto/lemon-tree/ 51 20 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25Fig. 3. Preliminary diagram for the FrAC Module. 25 26 26 27 27 284.2.1. Introduction and Overview veloped over the years. Notable pre-RDF vocabularies 28 29Linguistic annotation by NLP tools and within include those developed by ISO TC37/SC4, in partic- 29 30corpora has long been a topic of discussion within ular, the Linguistic Annotation Framework (LAF) 30 31linked data and RDF circles, with different propos- that represents “universal” data structures shared by 31 32als grounded in traditions from natural language pro- the various, domain- and application specific ISO stan- 32 cessing [56], web technologies [57], knowledge ex- 33dards [63]. Following the earlier insight that a labelled 33 34traction [58], but also from linguistics [59], philol- 34 directed multigraph can represent any kind of linguis- 35ogy [60], and the development of corpus management 35 tic annotation LAF produces concepts and definitions 36systems [61, 62]. 36 37A practical introduction to the various different vo- for four main aspects of linguistic annotation: anchors 37 38cabularies used (by various different communities, for and regions elements in the primary data that annota- 38 39different purposes and according to different capabil- tions refer to, markables (nodes) elements that con- 39 40ities) for linguistic annotation in RDF today is given stitute and define the scope of the annotation by refer- 40 41over the course of several chapters in [10]. In brief ence to anchors and regions; values (labels) elements 41 42however the RDF vocabularies which are most widely that represent the content of a particular annotation. re- 42 43used for this purpose are the NLP Interchange For- lations (edges) links (directed relations) that hold be- 43 44mat (NIF, in language technology) and Web Annota- 44 tween two nodes and can be annotated in the same was 45tion (OA, in bioinformatics and digital humanities), as 45 as markables. 46well as customizations of these. We describe NIF in 46 Note that in relation to Web Annotation anchors 47Section 4.2.2 and Web Annotation in Section 4.2.3. In 47 48what follows we look at the relationship between RDF rougly correspond to Web Annotation selectors (or tar- 48 49and two other pre-RDF vocabularies, namely LAF and get URIs); markables roughly correspond to annota- 49 50TEI/XML. Next we look at some platform specific tion elements; values to the body objects of Web Anno- 50 51RDF vocabularies for annotations that have been de- tation. However in Web Annotation, relations as data 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 21

1structures are not foreseen.108 As for NIF, its rela- Linked Data, even if they used and provide resolvable 1 2tion with LAF is more complex. Like Web Annota- URIs (TEI pointers).111 2 3tion, NIF does not provide a counterpart of LAF re- The annotation of rather than within TEI doc- 3 4lations, but more importantly, the roles of regions and uments, however, has been pursued by Pelagios/- 4 5markables is conflated in NIF: Every markable must Pleiades, a community interested in the annotation 5 6be a string (character span), and for every character of historical documents and maps with geographical 6 7span, there exists exactly one potential markable (URI, identifiers and other forms of geoinformation (though 7 8 8 or, a number of URIs with different schemes that are this does not yet run to linguistic annotations). One re- 9 9 owl:sameAs). sult of these efforts is the development of a specialised 10 10 At the moment, direct RDF serializations of the editor called Recogito, and its extension to TEI/XML. 11In this case the annotation is not part of the TEI doc- 11 12LAF do not seem to be widely used in an LLOD con- 12 text. The reason is certainly that the dominant RDF ument, but stored as standoff annotation in a JSON- 13LD format, and thus, is in compliance with established 13 vocabularies for annotations, despite their deficiencies, 14web standards and re-usable by external tools and ad- 14 cover the large majority of use cases. One notable 15dressable as Linked Data. However, this approach is 15 RDF serialisation of LAF however is POWLA [64], an 16restricted to cases in which the underlying TEI doc- 16 17OWL2/DL serialization of PAULA, a standoff-XML 17 ument is static and no longer changes.112 Therefore, 18format that implemented the LAF as originally de- 18 there is a need for encoding RDF triples directly in- 19scribed by [65]. POWLA complements LAF core data line in a TEI document. Happily, it has been demon- 19 20 20 structures with formal axioms and a slightly more re- strated that this can be done in a W3C- and XML- 21 21 fined data structures that support, for example, effec- compliant way by incorporating RDFa attributes into 22 22 tive navigation of tree annotations. On current applica- TEI [66, 67]. As a result and after more than a decade 23 23 tions of POWLA see the CoNLL-RDF Tree Extension of discussions, the TEI started in May 2020 to work on 24109 24 below. a customization that allowed the use of RDFa in TEI 25 25 It is also worth mentioning TEI/XML in the con- documents.113 26 26 text of this discussion. The standard, widely used in Over the years, several platforms, projects and tools 27 27 the digital humanities and in computational philology, have come up with their own approaches for modelling 28 28 only comes with partial support for RDF and does not 29annotations and corpora as linked data. Notable exam- 29 30represent a publication format for Linked Data. Tra- ples include the RDF output of machine reading and 30 31ditionally there has been an acknowledgement on the NLP systems such as FRED [68], NewsReader [69] or 31 32part of the TEI community of the value in being able the LAPPS Grid [70]. FRED provides output based 32 33to link from a digital edition (or another TEI/XML on NIF or EARMARK [71], with annotations par- 33 110 34document) to a knowledge graph. Linking between 34 35(elements of) electronic editions created with the TEI 111This may not considered to be drastic for electronic editions 35 36was addressed by means of specialised XML attributes of historical manuscripts which one could conceivably complement 36 37with narrowly defined semantics. Accordingly, elec- with information drawn from the LLOD cloud. The situation is, 37 38tronic editions in TEI/XML do not normally qualify as however, quite different for dictionaries whose content could easily 38 be made accessible and integrated with other lexical resources on 39 39 the LLOD cloud, e.g., for future linking. The situation has, however, 40begun to change over the last few years and long-standing efforts to 40 41108Although Web Annotation lacks any formal counterpart of develop technological bridges between both TEI and LOD are be- 41 42edges or relations as defined by LAF there have been attempts to de- ginning to yield concrete results. For instance, different tools for the 42 43fine a vocabulary that extends Web Annotation with LAF data cate- conversion of lexical resources in different TEI dialects to OntoLex- 43 gories [57], but this has apparently never been applied in practice. Lemon have been presented in the last years. Among others, this in- 44 44 109Others include [61] utilised an RDF graph, with a RDF vocab- cludes a converter for TEI Dict/FreeDict dialect, https://github.com/ 45ulary for nodes, labels and edges to express linguistic data structures acoli-repo/acoli-dicts/tree/master/stable/freedict [34]. For ELEXIS 45 46over a corpus backend natively based on an RDBMS; a prototypi- related developments TEI/RDF related developments, see Section 46 47cal extension of Web Annotation with an RDF interpretation of the 5.2.3. 47 112 48LAF described by [57], which and the LAPPS Interchange Format, Otherwise, the efforts for synchronization will by far outweigh 48 conceptually and historically an instance of LAF, which has see the any benefit that the use of W3C standards for encoding the annota- 49 49 discussion below on platform-specific vocabularies. tion brings 50110This is useful for instance for managing prosopographical, bib- 113For the current status of the discussion, cf. https://github.com/ 50 51liographical or geographical information TEIC/TEI/issues/311 and https://github.com/TEIC/TEI/issues/1860 51 22 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1tially grounded in DOLCE [72], but enriched with lexi- 4.2.2. NLP Interchange Format 1 2calised ad hoc properties for aspects of annotation cov- The NLP Interchange Format (NIF),117 developed at 2 3ered by these.114 The NewsReader Annotation For- AKSW Leipzig, was designed to facilitate the integra- 3 4mat (or NLP Annotation Format) NAF, is an XML- tion of NLP tools in knowledge extraction pipelines, 4 5standoff format for which a NIF-inspired RDF export as part of the building of a Semantic Web tool chain 5 6has been described [73], and LIF, the LAPPS Inter- and a technology stack for language technology on the 6 7change Format [74], a JSON-LD format used for NLP web [58]. NIF provides support for a broad range of 7 8 8 workflows by the LAPPS Grid Galaxy Workflow En- frequently occurring NLP tasks such as part of speech 9tagging, lemmatization, entity linking, coreference res- 9 gine [75]115. 10olution, , and, to a limited extent, 10 Both LIF and NAF-RDF are, however, not generic 11syntactic and semantic . In addition to provid- 11 formats for linguistic annotations but rather, provide 12ing a technology for integrating NLP tools in semantic 12 13(relatively rich) inventories of vocabulary items for web annotations, NIF also provides specifications for 13 116 14specific NLP tasks. Neither seem to have been used web services. 14 15as a format for data publication, and we are not aware A core feature of NIF is that it is grounded in a for- 15 16of their use independently from the software they have mal model of strings, and the obligatory use of String 16 17originally been created for or are being created by. URIs as fragment identifiers for anything annotatable 17 18Aside from software- or platform-specific formats, by NIF. Every element that can be annotated in NIF 18 19a number of vocabularies has been developed that ad- has to be a string.118 NIF does support different frag- 19 20dress specific problems or user communities. In Sec- ment identifier schemes, e.g., the offset-based scheme 20 21tion 4.2.4 we describe two such examples, the Ligt vo- defined by RFC 5147. [77] As a consequence, any two 21 22cabulary that addresses the gap that community stan- annotations that cover the same string are bound to the 22 23 23 dards such as NIF and Web Annotation have with re- same (or owl:sameAs) URI. While this has the advan- 24 24 spect to morphology and the annotation of morpholog- tage of being able to implicitly merge the output of 25 25 ically rich langugages, and CoNLL-RDF, a vocabulary different annotation tools, this limits the applicability 26 26 and an associated library that aims to mirror the struc- of NIF to linguistically annotated corpora. As an ex- 27ample, NIF does not allow us to distinguish multiple 27 ture and contents of popular NLP formats in RDF and 28syntactic phrases that cover the same token. Consider 28 to provide a round-tripping between these formats and 29the sentence “Stay, they said.”119 The Stanford PCFG 29 RDF graphs. 30parser120 analyzes Stay as a verb phrase contained 30 31Finally, the prospects for convergency between the in (and only constituent of) a sentence. In NIF, both 31 32solutions discussed in the whole of Section 4.2 are de- would be conflated. Likewise, zero elements in syntac- 32 33scribed in Section 4.2.5. (Note that in this section, we tic and semantic annotation cannot be expressed. An- 33 34only discuss vocabularies that define data structures other limitation of NIF is its insufficient support for 34 35for linguistic annotation by NLP tools and in anno- annotating the internal structure of words. It is thus 35 36tated corpora. Linguistic categories and grammatical largely inapplicable to the annotation of morphologi- 36 37features, as well as other information that represents cally rich languages. Overall, NIF fulfills its goals to 37 38the content of an annotation are assumed to be pro- provide RDF wrappers for off-the-shelf NLP tools, but 38 39 39 vided by a(ny) repository of linguistic data categories it is not sufficient for richer annotations as are fre- 40 40 (see above)). quently found in linguistically annotated corpora. Nev- 41 41 42 42 117 43https://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/ 43 114For the rendering of discourse relations, for example, it pro- nif-core.html 44 44 duces properties such as fred:becauseOf (apparently extrapolated 118In particular, this includes the classes nif:Phrase and nif:Word. 45from the surface string, so, not ontologically defined). With the introduction of support for provenance annotations, NIF 2.0 45 46115A more recent development in this regard is that efforts have also introduced nif:Annotation which can be attached as a property 46 47been undertaken to establish a clear relation between LIF and pre- to a NIF string. However, it is to be noted that the linguistic data 47 48RDF formats currently used by CLARIN [76]. structures defined by NIF 2.0 are not subclasses of nif:Annotation, 48 116Historically, LIF is grounded in LAF concepts and has been but of nif:String. 49 49 developed by the same group of people, however, no attempt seems 119From Stephen Dunn (2009), ‘Dont Do That’, poem published 50to have been made to maintain the level of genericity of the LAF. in the New Yorker, June 8, 2009. 50 51Instead, application-specific aspects seem to have driven LIF design. 120http://nlp.stanford.edu:8080/parser/index.jsp 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 23

1ertheless, NIF has been used as a publication format Source property that – optionally – provides a value 1 2for corpora with entity annotations.121 for the annotation, e.g., as a literal. The target can be 2 3NIF continues to be a popular component of the DB- a URI (IRI) or a selector, i.e., a resource that identifies 3 4pedia technology stack. At the same time, active devel- the annotated element in terms of its contextual prop- 4 5opment of NIF seems to have slowed down since the erties, formalised in RDF, e.g., its offset or character- 5 6mid-2010s, whereas limited progress on NIF standard- istics of the target format. By supporting user-defined 6 7ization has been achieved. A notable exception in this selectors and a broad pool of pre-defined selectors for 7 8regard is the development of the Internationalization several media types, Web Annotation is applicable to 8 9Tag Set [78, ITS] that aims to facilitate the integration any kind of media on the web. Targets can also be more 9 10of automated processing of human language into core compact String URIs, as introduced, for example, by 10 11Web technologies. A major contribution of ITS 2.0 has NIF. NIF data structures can thus be used to comple- 11 12been to add an RDF serialization into NIF as part of ment Web Annotation [58]. 12 13the standard. Web Annotation can be used for any labelling or 13 14More recent developments of NIF include exten- linking task, e.g., POS tagging, lemmatization, entity 14 15sions for provenance (NIF 2.1, 2016) and the develop- linking. It does, however, not support relational anno- 15 16ment of novel NIF-based infrastructures around DB- tations such as syntax and semantics, nor (like NIF) 16 17pedia and Wikidata [79]. In parallel to this, NIF has the annotation of empty elements. The addition of such 17 18been the basis for the development of more specialised elements from LAF has been suggested [57], but does 18 19vocabularies, e.g., CoNLL-RDF for linguistic annota- not seem to have been adopted, as labelling tasks dom- 19 20tions originally provided in tabular formats, see Sec- inate the current usage scenarios of Web Annotation. 20 21tion 4.2.4. Unlike NIF, Web Annotation is ideally suited for 21 22 22 4.2.3. Web Annotation the annotation of multimedia content or entities that 23 23 The Web Annotation Data Model is an RDF-based are manifested in different media simultaneously (e.g., 24 24 approach to standoff annotations (in which annotations in audio and transcript). As a result, it has become 25popular in the digital humanities, e.g., for the annota- 25 26and the material to be annotated are stored separately) 26 proposed by the Open Annotation community.122 It is a tion of geographical entities with tools such as Recog- 27ito [83], especially since support for creating standoff 27 28flexible means of representing standoff annotation for 28 any kind of document on the web. Although the most annotations for static TEI/XML documents was added 29(around March 2018 [84, p.247]). 29 30common use case of Web Annotation is the attaching 30 31of a piece of text to a single web resource, it is in- 4.2.4. Domain-specific solutions: Ligt and 31 32tended to be applicable across different media formats. CoNLL-RDF 32 33So far Web Annotation has been primarily applied to Interlinear glossed text (IGT) is a notation where ad- 33 34linguistic annotations in the biomedical domain, al- ditional annotations are placed, as the name suggests, 34 35though other notable apploications include NLP [57] between the lines of a text to be annotated. With the 35 36or Digital Humanities [82]. Web Annotation recom- purpose of helping readers to understand and inter- 36 37mends the use of JSON-LD to add a layer of standoff pret linguistic phenomena, the notation is frequently 37 38annotations to documents and other resources accessi- used in education and various language sciences such 38 39ble over the web, with primary data structures defined as language documentation, linguistic typology, and 39 40by the Web Annotation Data Model, formalised as an philological studies (for instance, it is commonly used 40 41OWL ontology. to gloss linguistic examples). Moreover, IGT data can 41 42The core data structure of the Web Annotation Data consist of different layers, including translation and 42 oa:Annota- 43Model is the annotation, i.e., instances of transliteration layers and usually contains layers for 43 tion oa:hasTarget 44that have an property that identifies ensuring morpheme-level alignment. This is not sup- 44 oa:has- 45the element that carries the annotation, and the ported by any established vocabularies for representing 45 46annotations on linguistic corpora. And although there 46 47121The most prominent example, the NIF edition of the Brown exist several specialised formats which are specifically 47 48corpus published in 2015, formerly available from http://brown. designed for the storage and exchange of IGT, these 48 nlp2rdf.org/, does not seem to be accessible anymore. Attempted to 49 49 access on Jan 23, 2021. formats are not re-used across different tools, limit- 50122The Web Annotation data model and vocabulary were pub- ing the reusability of annotated data. In order to help 50 51lished as W3C recommendations in 2017 [80, 81]. overcome this situation and improve data interoper- 51 24 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1ability, the RDF vocabulary Ligt [85] has been pro- Task of the SIGNLL Conference on Computational 1 2posed for representing IGT as linked data. Ligt is a Natural Language Learning (CoNLL-05): 2 3tool-agnostic representation model for IGT which in 3 # WORD POS PARSE 4addition to structural interoperability also enables the The DT (S (NP * 4 5use of LLD vocabularies and terminology repositories. spacecraft NN *) 5 ... 6The Ligt vocabulary was developed as a generalisa- 6 7tion over the data structures employed by established Here, the wordform is provided in the first column, 7 8tools for creating IGT annotations, most notably Tool- the second column provides part-of-speech tag. The 8 123 9box [86], FLEx [87] and Xigt [88]. Ligt is intended PARSE column contains a full parse in accordance 9 10to facilitate a pivot format that faithfully captures lin- with the Penn [91].The CoNLL-RDF library 10 11guistic information produced by these tools in a uni- reads such data as a continuous stream, every sequence 11 12form way for subsequent processing. Notably, since its of rows enclosed in empty lines will be processed as a 12 13publication Ligt has been adopted by third party users block, assigned a URI and the type nif:Sentence, every 13 14to model and annotate IGT from 280 endangered lan- row will be assigned a URI and the type nif:Word, and 14 15guages and their publication as Linked Open Data [89]. the annotation of every column will be stored as value 15 16Although Ligt was designed for very specific set of of a property in the conll: namespace that is generated 16 17domain requirements it can be considered a useful con- from the column label.126 Links between and among 17 18tribution to LLD vocabularies for textual annotation. sentences and words are encoded in accordance with 18 19This is because it provides data structures that are rel- NIF: 19 20evant for low-resource and morphologically rich lan- 20 :s1_1 a nif:Word; nif:nextWord :s1_2; 21 21 guages but which had been neglected by earlier RDF conll:WORD "The"; conll:POS "DT"; 22vocabularies for linguistic annotation on the web, in conll:PARSE "(S (NP *". 22 23particular, by NIF and Web Annotation.124 23 24Another domain specific RDF-based vocabulary Amongs other things, a CoNLL-RDF edition of the 24 127 25which aims to provide a serialisation-independent corpora is available in the 25 26way of dealing with textual annotations is CoNLL- LLOD cloud diagram. The corpora are linked with 26 27RDF [90]. This latter vocabulary is based on the so- the OLiA ontologies; further linking with additional 27 28called “CoNLL formats”, a family of a tab-separated LLOD resources, in particular, lexical resources, has 28 29values (TSV) based-formalisms used to represent lin- not been explored so far. CoNLL-RDF has also been 29 30guistically annotated natural language in fields such as applied to the linking of corpora to dictionaries [92] 30 31NLP,125 corpus linguistics, and more generally in the and knowledge graphs [93]. It has also formed the ba- 31 32language sciences. CoNLL-RDF [90] provides a data sis of work on the syntactic parsing of historical lan- 32 33model and a programming library that aim to facili- guages [94, 95], the consolidation of syntactic and 33 34tate the processing and transformation of such data re- semantic annotations [96], corpus querying [97], and 34 35gardless of the original order and number of columns, language contact studies [98] . In addition to the stor- 35 36whether the source format used fixed-size tables (as for ing of syntactic parses as plain strings a further exten- 36 37most CoNLL dialects) or variable size tables (such as sion of CoNLL-RDF adds native support for tree struc- 37 38all CoNLL formats that contain semantic role annota- ture [99], extending NIF/CoNLL-RDF data structures 38 39tions). Sentence after sentence is converted to an RDF with POWLA [100]. As a result, the phrase structure 39 40graph in accordance to the label information provided of the example above can now be decoded: 40 41by the user. The listing below provides a slightly sim- :s1_1 a nif:Word; nif:nextWord :s1_2; 41 42plified annotation from the 2005 edition of the Shared conll:WORD "The"; conll:POS "DT"; 42 powla:hasParent _:np. 43_:np a conll:PARSE; rdf:value "NP"; 43 44powla:next _:vp; 44 123One should note that these tools are currently incompatible 45with each other and information can only exchanged between them 45 46if manual corrections are applied. 126The columns HEAD (for dependency annotation) and PRED- 46 47124However it would be possible to encode Ligt information with ARGS (for semantic role annotations) are treated differently as they 47 48a generic LAF-based vocabulary such as POWLA produce object properties, i.e., links, rather than datatype proper- 48 125Indeed in NLP the CoNLL formats have become de-facto stan- ties. Similarly, the column ID receives special handling. If provided 49 49 dards for the most frequently used types of annotations having been as column label, as its value is used to overwrite the offsets that 50popularised in a long-standing series of shared tasks over the last CoNLL-RDF normally adopts for creating word (row) URIs. 50 51two decades 127https://universaldependencies.org/ 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 25

1powla:hasParent _:s. consistent choice between, say, a oa:Annotation (from 1 2_:s a conll:PARSE; rdf:value "S". 2 ... Web Annotation), a powla:Node (from POWLA), a 3nif:String and a nif:Annotation, has to be made. 3 4The CoNLL-RDF tree extension uses a minimal frag- Generally speaking, this situation is intractable, and 4 5ment of POWLA, the properties powla:hasParent thus, the W3C Community Group Linked Data for 5 6(pointing to the parent node in a DAG) and powla:next Language Technology (LD4LT) is currently engaged 6 7(pointing to the following sibling in a tree). The in a process to develop a harmonization of these vocab- 7 8class powla:Node, implicit in the listing above, can be ularies. While this has been under development since 8 9RDFS-inferred from the use of these properties. about mid-2018, regular discussions via LD4LT be- 9 10gan in early 2020 only. Concrete results so far in- 10 114.2.5. Towards a Convergency clude a survey over requirements for any vocabulary 11 12The large number of vocabularies mentioned above for linguistic annotation on the web and on the degree 12 13already reveals something of a problem, that is, that to which NIF, Web Annotation and other vocabular- 13 14applications and data providers may choose from a ies support these at the moment.128 So far, 51 require- 14 15broad range of options, and depending on the expec- ments have been identified, clustered in 6 groups: 15 16tations and requirements of their users, they may even 16 1. LLOD compliancy (adherence to web standards, 17need to support multiple, different output formats, pro- 17 18tocols and service specifications that could potentially compatibility with community standards for lin- 18 19be mutually incompatible. So far, no clear consensus guistic annotation) 19 20has emerged, albeit NIF and Web Annotation appear to 2. expressiveness (necessary data structures to rep- 20 21enjoy relatively high popularity in their respective user resent and navigate linguistic annotations) 21 22communities. However, they are not compatible with 3. units of annotation (addressing primary data and 22 23each other and nor do they support linguistic annota- annotations attached to it) 23 24tion to the same or even a sufficient extent, thus moti- 4. sequential data structures (preserving and navi- 24 25vating the continuous development of novel, more spe- gating sequential order) 25 26cialised vocabularies. Synergies between Web Anno- 5. relations (annotated links between different units 26 27tation and NIF were explored relatively early on [58], of annotation) 27 28and Cimiano et al. [101, p.89-122] describe how they 6. support for/requirements from specific applica- 28 29can be used in combination with each other, more spe- tions and use cases (e.g., intertextual relations, 29 30cialised vocabularies such as CoNLL-RDF, and more linking with lexical resources, alignment, dialog 30 31general vocabularies such as POWLA to model data in annotation). 31 32a way that suits the following criteria: So far, this is still work in progress, but if indeed, 32 33– it is applicable to any kind of primary data, in- these challenges can be resolved at some point in the 33 34cluding non-textual data (via Web Annotation se- future, and a coherent vocabulary for linguistic anno- 34 35lectors); tations emerges, we expect a similar rise in popular- 35 36– it can also express reference to primary data in a ity for the adoption of the Linked Data paradigm for 36 37compact fashion (via NIF String URIs); encoding linguistic annotations as we have seen in the 37 38 38 – permits round-tripping between RDF graphs and last years for lexical resources. Also, this was largely 39 39 conventional formats (via CoNLL-RDF and the driven by the existence of a coherent and generic vo- 40 40 CoNLL-RDF library); cabulary, and indeed, the drift in application that the 41 41 – it supports generic linguistic data structures (via OntoLex model has recently faced very much reflects 42 42 POWLA, resp., the underlying LAF model). the need for consistent, generic data models. 43A question at this point may be what the general 43 44However, while the combination of these various com- benefit of modelling annotations as linked data may be 44 45ponents is possible and in principle operational, this in comparison to other conventional solutions, and dif- 45 46also means that a user or provider of data needs to un- ferent user communities may have different answers 46 47derstand and develop a coherent vision of at least five 47 48different data models: Web Annotation, NIF, CoNLL- 48 128The survey can be accessed via https://github.com/ld4lt/ 49 49 RDF, POWLA and the original or conventional struc- linguistic-annotation/blob/master/survey/required-features.md, 50ture of the data. Moreover, the data structures of these also compare the tabular view under https://github.com/ld4lt/ 50 51formats are parallel, in parts, and then, a principled and linguistic-annotation/blob/master/survey/required-features-tab.md. 51 26 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1to that. It does seem, though, that one potential killer Google dataset search service131. Metadata play an in- 1 2application can be seen in the capability to integrate, strumental role in the discovery, interoperability and 2 3use and re-use pieces of information from different hence (re-)use of digital objects, and indeed that they 3 4sources. In linguistic annotation, a still largely un- act as the intermediary between consumers (humans 4 5solved problem is how to efficiently process standoff and machines) and digital objects. For this reason, the 5 6annotation, and indeed, the application of RDF and/or FAIR principles [1] include specific recommendations 6 7Linked Data has long been suggested as a possible so- for metadata (see also section 1). Of particular rele- 7 8lution [56, 59, 61, 64], but only recently, have systems vance to this section is principle R1.3 which recom- 8 9that support RDF as an output format emerged [62]. mends that "(Meta)data meet domain-relevant com- 9 10While it is clear that standoff is a solution it is also munity standards". According to this principle, the 10 11true that, the different communities involved have not adoption of community standards or best practices for 11 12agreed on commonly used standards to encode and data archiving and sharing, including "documentation 12 13exchange their respective data. In DH and BioNLP, (metadata) following a common template and using 13 14Web Annotation and JSON-LD seems to dominate; in common vocabulary" facilitates the re-use of data. We 14 15knowledge extraction and language technology, NIF thus take a closer look at metadata models commonly 15 16(serialised in JSON-LD or Turtle) seem to be more used for language resources in the linguistics, digital 16 17popular; for digital humanities, the TEI is currently re- humanities and language technology communities. 17 18vising XML standoff specifications,129 and support for Although the focus of this section is on commu- 18 19RDF serializations (RDFa) or standoff (Web Annota- nity models, we cannot leave the most popular gen- 19 20 20 tion in JSON-LD) also seems to be growing, as men- eral purpose models for dataset description out of this 21 21 tioned above. overview. Language is an essential part of human cog- 22nition and expression and thus present in all types of 22 23data; research on language and language-mediated re- 23 4.3. Metadata 24search is carried out on data from all domains and hu- 24 25man activities. All of this obviously extends the search 25 Summary In the first section, Section 4.3.1, we give 26space for data to catalogues other than the purely lin- 26 an introduction and overview of metadata trends in 27guistic ones. The three models that currently dominate 27 28LLD and related areas. Next we give a detailed de- the description of datasets are DCAT132, schema.org133 28 29scription of two important metadata resources for and DataCite134. DCAT profiles are used in various 29 30LLD. These are META-SHARE, described in Sec- open data catalogues, such as the EU Open Data por- 30 31tion 4.3.2, and the OntoLex-Lemon lime module, de- tal135, while schema.org is used for the Google dataset 31 32scribed in Section 4.3.3. The latter section also fea- search engine; finally, DataCite, a leading provider of 32 33tures a discussion of future metadata challenges for persistent identifiers (namely DOIs), has developed a 33 34LLD language resources. Finally in Section 4.3.4 we schema with a small set of core properties which have 34 35address the important challenge of language identifi- been selected for the accurate and consistent identi- 35 36cation which is an essential part of the metadata of a fication of a resource for citation and retrieval pur- 36 37language resource. poses. There are various initiatives for the collection 37 38 38 4.3.1. Introduction of crosswalks of community-specific metadata models 39136 39 The rise of data-driven approaches that use Machine with these models , as well as recommendations for 40137 40 Learning, and in particular recent breakthroughs in the extensions for specific data types (e.g., CodeMeta 41and Bioschemas138 for source code software and life 41 field of Deep Learning, have secured a central place 42science resources respectively). Of course, these mod- 42 for data in all scientific and technological areas. Cross- 43 43 44disciplinary research has also boosted the sharing of 44 data arising across different communities. Thus, a huge 131https://toolbox.google.com/datasetsearch 45132 45 volume of data has become available through vari- https://www.w3.org/TR/vocab-dcat-2/ 46133https://schema.org/ 46 47ous repositories, but also via aggregating catalogues, 134https://schema.datacite.org/ 47 130 135 48such as the European Open Science Cloud and the https://data.europa.eu/euodp/en/data/ 48 136See for instance, https://rd-alliance.github.io/ 49 49 Research-Metadata-Schemas-WG/ 50129See https://github.com/TEIC/TEI/issues/1745 for pointers. 137https://codemeta.github.io/ 50 51130https://www.eosc-portal.eu 138https://bioschemas.org/ 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 27

1els are not intended to capture all the specificities re- guage processing [108]. The first version of MS-OWL 1 2quired for the description of linguistic features and, was (semi-)automatically created from the META- 2 3thus, we do not go into further details for them in this SHARE XSD schema [108, 109] (which was designed 3 4paper. to support the META-SHARE infrastructure [110]) 4 5Among models for the description of language and discussed in the framework of the LD4LT group. 5 6resources in general (and not just LLD resources), The second version, which is described here, has 6 7the Component Metadata Infrastructure (CMDI) pro- evolved from it, taking into account advancements in 7 8files [102, 103], and the TEI guidelines (introduced the Language Technology domain and related meta- 8 9above) stand out. CMDI is a framework designed to data requirements (such as the necessity for the de- 9 10describe and re-use metadata; "profiles" can be con- scription of workflows, interoperability issues between 10 11structed on the basis of building blocks ("compo- language processing tools and processing resources, 11 12nents") that group semantically related metadata ele- etc.) as well as current trends in the overall metadata 12 13ments (e.g., address, identity, etc.) and can be used as landscape [107]. 13 14ready-made templates catering for specific use cases MS-OWL has been constructed according to three 14 15(e.g., for lexica, for linguistic corpora, for audio cor- key concepts: resource type, media type and distribu- 15 16pora, etc.). CMDI profiles are used by various hu- tion. These give rise to the following basic classes: 16 17manities and social sciences communities within the 17 LanguageResource 18CLARIN139 research infrastructure. The TEI standard – , with four subclasses derived 18 19specifies an encoding scheme for the representation of from the notion of resource type: 19 20texts in digital form, chiefly in the humanities, social * Corpus: for structured collections of pieces of 20 21sciences and linguistics; it includes specific elements language data, typically of considerable size 21 22for the description of texts at the collection and indi- and which have been selected according to cri- 22 23vidual text levels. Both models, however, are XSD- teria external to the data (e.g., size, language, 23 140 24based , and therefore not discussed further in this domain, etc.) with the aim of representing as 24 25section. comprehensively as possible a specific object 25 26We should also mention the CLARIN Concept Reg- of study; 26 27141 27 istry (CCR) , which is a collection of linguistic con- * LexicalConceptualResource: covering resources 28cepts [105, 106]. It is the successor to the ISOcat data such as term glossaries, word lists, semantic 28 29category registry (described in Section 3.5) and is cur- lexica, ontologies, etc., organised on the ba- 29 30rently maintained by CLARIN. The CCR is imple- sis of lexical or conceptual units (lexical items, 30 31mented in SKOS and includes a concept scheme for terms, concepts, phrases, etc.) along with sup- 31 32metadata, but is a structured list without ontological plementary information (e.g., grammatical, se- 32 33relations, either internally or externally to other vocab- mantic, statistical information, etc.); 33 34ularies. It mainly serves as the semantic interoperabil- 34 * LanguageDescription: for resources which are 35ity layer of CLARIN; this is achieved through linking intended to model a language or some aspect(s) 35 36metadata fields included in CMDI profiles to concepts of a language via a systematic documentation 36 37from the CCR. of linguistic structures; members of this class 37 38 38 4.3.2. Language Resource Metadata: The are typically statistical and machine learning- 39 39 META-SHARE ontology computed language models and computational 40 40 142 grammars; 41The META-SHARE (or MS-OWL in short) model [107], 41 ToolService: for any type of software that per- 42implemented as an OWL ontology, has been designed * 42 forms language processing and/or related op- 43specifically for language resources, including data re- 43 erations (e.g., annotation, , 44sources (structured or unstructured datasets, lexica, 44 , speech-to-text synthesis, 45language models, etc.) and technologies used for lan- 45 visualization of annotated datasets, training of 46 46 corpora, etc.); 47139https://www.clarin.eu 47 140 48The conversion of CMDI metadata records offered in CLARIN – MediaPart: this is a parent class for a number of 48 into RDF [104] should not be confused with the construction of an 49other subclasses, combining together the notions 49 RDF model for CMDI profiles 50141https://concepts.clarin.eu/ccr/browser/ of resource and media type; it is not meant to 50 51142http://w3id.org/meta-share/meta-share be used directly in the description of language 51 28 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1resources. The media type refers to the form/- tations), and information on corpus contents. It should 1 2physical medium of a data resource (i.e., mem- be noted that the language of the resource’s contents, 2 3ber of one of the first three subclasses above) a piece of metadata of particular relevance to all lan- 3 4and it can take the values text, audio, image, or guage resources, is encoded in the media part sub- 4 5video. To cater for multimedia/multimodal lan- classes rather than the top LanguageResouce class; 5 6guage resources (e.g. a corpus of videos and their this is in line with the principles adopted for the rep- 6 7subtitles, or corpus of audio recordings and their resentation of multimedia/multimodal resources con- 7 8transcripts), language resources are represented sisting of parts with different languages (e.g. a cor- 8 9as consisting of at least one media part: the me- pus of video recordings in one language, its subti- 9 10diaPart property is used to link an instance of tles in the same language and their translations in an- 10 11the class Corpus to instances of CorpusTextPart, other language). Finally, the two distribution classes 11 12CorpusAudioPart, and so on; similarly, Lexical- (DatasetDistribution and SoftwareDistribution) provide 12 13ConceptualResource is linked to LCRTextPart, information on how to access the resource (i.e., how 13 14LCRVideoPart, etc. and where it can be accessed), technical features of 14 15– DatasetDistribution and SoftwareDistribution: these the physical files (such as size, format, character en- 15 16are conceived as subclasses of dcat:Distribution, coding) and licensing terms and conditions. A ded- 16 17which represents the accessible form(s) of a re- icated module has been devised for the structured 17 18source. For instance, software resources may be representation of licenses commonly used for lan- 18 19distributed as web services, executable files or guage resources, reusing existing vocabularies and ex- 19 144 20source code files, while data resources as PDF, tending the Open Digital Rights Language core 20 21CSV or plain text files or through a user interface. model [111]. 21 22To better illustrate the structure of the MS-OWL, 22 23MS-OWL caters for the description of the full life- figure 4 depicts a subset of the mandatory and recom- 23 24cycle of language resources, from conception and mended properties for the description of a corpus. 24 25creation to integration in applications and usage in One of the additions made between the two ver- 25 26projects as well as recording relations with other re- sions of the MS ontology is the development of another 26 27sources (e.g., raw and annotated versions of corpora, vocabulary, again implemented as an OWL ontology, 27 28tools used for their processing, models integrated in OMTD-SHARE145 [112]. OMTD-SHARE can be con- 28 143 29tools, etc.) and related/satellite entities . sidered as complementary to MS-OWL. It covers func- 29 30The properties recommended for the description of tions (tasks performed by software components), an- 30 31language resources are assigned to the most relevant notation types (types of information extracted or an- 31 32class. Thus, the LanguageResource class groups prop- notated by such software), methods (classification of 32 33erties common to all resource/media types, such as the theoretical method used in the algorithm), and data 33 34those used for identification purposes (title, descrip- formats of the resources that can be processed by such 34 35tion, etc.), recording provenance (creation, publica- software. The ontology was begun within the frame- 35 36tion dates, creators, providers, etc.), contact points, etc. work of the OpenMinTeD project 146, which focused 36 37More technical features and classification elements, on Text and Data Mining resources, and has been en- 37 38that depend on resource/media types, as well as in- riched afterwards. The class Operation has been ex- 38 39stances of MediaPart and Distribution are attached to tended to cover Language Technology (LT) operations 39 40the respective LanguageResource subclasses. Thus, at large (now also referred to as "LT taxonomy"). Spe- 40 41properties for LexicalConceptualResource encode the cific properties of MS-OWL make reference to the 41 42subtype (e.g. computational lexicon, ontology, dictio- OMTD-SHARE classes. Operation is used for describ- 42 43nary, etc.), and the contents of the resource (unit of ing the function of tools/services as well as for appli- 43 44description, types of accompanying linguistic and ex- cations for which a data resource can be used or has 44 45tralinguistic information, etc.); properties for Corpus already been used. The annotationType for annotated 45 46include corpus subclass (raw, annotated corpus, anno- corpora takes values from the AnnotationType class; 46 47linguistic annotation types are linked to the OLIA on- 47 48 48 143The current work discusses only the core part of MS-OWL tar- 49 49 geting the description of language resources and leaves aside the 144https://www.w3.org/ns/odrl/2/ 50representation of satellite entities (persons, organizations, projects, 145http://w3id.org/meta-share/omtd-share/ 50 51etc.) 146https://www.openminted.eu 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 29

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34Fig. 4. Simplified subset of the MS-OWL for corpora. 34 35 35 36 36 37tology (work in progress), while domain-specific an- catalogues [113, 114], while the second version, the 37 38notation types for neighbouring domains are also fore- one described here, is used in the European Language 38 39seen (e.g., for elements in the document structure of 39 Grid148, which is a platform for language resources 40publications, biomedical entities, etc.). 40 41Both the MS-OWL and OMTD-SHARE ontolo- with a focus on industry-relevant Language Technol- 41 42gies have been published and are currently under- ogy in Europe [115]. Among the immediate plans, 42 43going evaluation and improvements. They are de- crosswalks with DCAT and schema.org are a prior- 43 44 44 ployed in the description of language resources in ity, to ensure wider uptake and interoperability with 45catalogues of language resources. More specifically, 45 (meta)data from other communities. 46the first version of MS-OWL is used in LingHub147, 46 47a data portal aggregating metadata records for lan- 47 48guage resources hosted in various repositories and 48 49 49 50 50 51147http://linghub.org/ 148https://live.european-language-grid.eu/ 51 30 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

14.3.3. Linguistic Metadata for Lexical Resources: taken from the W3C guidelines152. This can be seen in 1 2lime diagrammatic form in Figure 6. 2 3Another metadata model that has relevance here is Figure 6 corresponds to the following listing (again 3 4OntoLex-Lemon’s149 own dedicated metadata module based on that give in the W3C guidelines153). 4 5that, in keeping with the overall citric theme, is called :lexicon a lime:Lexicon; 5 6lime or the LInguistic MEtadata module [116]. lime:language "en"; 6 lime:entry :lex_high; 7The module is illustrated in Figure 5. Before we go lime:entry :lex_cat; 7 8onto describe the module in more detail, it is impor- lime:entry :lex_marry; 8 lime:entry :lex_intangible_assets. 9tant to point out that lime focuses on providing meta- ]. 9 10data descriptions at the "level of lexicon-ontology in- 10 11terface"150. That is, it concentrates on how ontological Here we can see how the lime properties and class 11 12concepts in a so-called reference dataset (e.g. defined introduced above can help us to describe some of 12 13as an ontology that describes "the semantics of the the most fundamental lexicon-specific metadata cat- 13 14domain") are lexicalised or given a linguistic ground- egories of a lexical resource. We can of course use 14 15ing in a lexicon (viewed as a collection of lexical en- Dublin Core properties such as description and cre- 15 16tries). The OntoLex-Lemon guidelines further refer to ator to further flesh out this metadata description. 16 17a concept set which is a set of individuals of class The lime:LexicalizationSet class (again a subclass of 17 18Lexical Concept (described as potentially "bearing a void:Dataset) represents a collection of lexicalizations, 18 19conceptual backbone to a lexicon"). The aim of the- defined as pairs consisting of a lexical entry and an 19 20lime module then is to provide quantitative and qual- associated entry in the reference dataset (this could 20 21itative (metadata) information about the relations be- be an OWL ontology but as the guidelines specify 21 22tween these (the links between them). In other words might be any "RDF dataset which contains references 22 23many, though not all as we will see below, of its classes to objects of a domain of discourse"). The metadata 23 24and properties will not apply in cases where OntoLex- properties on the lime:LexicalizationSet allow us to 24 25Lemon is only used to encode a lexicon, and where describe, among other things154 how many entities 25 26entries and their senses aren’t linked to either Lexical have been lexicalised (by at least one entry), how 26 27Concept individuals or to ontology entities (such as many pairs of entries and ontology elements there 27 28is true of an increasing number of lexicon-centric use are, as well as how many ontology elements have 28 29cases). been lexicalised on average. lime also defines the 29 30More generally useful classes and properties, how- class Lexical Linkset (lime:LexicalLinkSet, subclass of 30 31ever, include the lime:Lexicon class. This is defined as void:Dataset), individuals of which are links between 31 151 32a subclass of void:Dataset ), and represents a set of a set of lexical concepts (i.e., members of the class on- 32 33individuals of the class Lexical Entry which are linked tolex:LexicalConcept) and the reference dataset (ontol- 33 34to lime:Lexicon via the property lime:entry. The whole ogy). For this class, lime defines properties describ- 34 35lexicon, as well as individual entries, can be assigned ing, for example, the number of links between the 35 36to a certain language, as specified by the datatype prop- two resources in question. Last, the Conceptualiza- 36 37 37 erty lime:language(the guidelines also recommend the tion Set (lime:ConceptualizationSet) is analogous to 38 38 use of the Dublin Core property and the use of ei- the lime:LexicalizationSet but caters for the links be- 39ther lexVo or Library of Congress language tags, see tween the the lexicon and the concept set. 39 40Section 4.3.4 for an extended discussion of language 40 41tags). The property lime:linguisticCatalog) specifies the Future Metadata Challenges for Lexical Resources 41 42linguistic model, i.e. the catalogue of linguistic cate- The lime classes which we have just looked at, along 42 43gories used for the annotation of the lexical entries. with several others, enable users to give a detailed 43 44This could be, for instance LexInfo. metadata description of the lexicalisation relationship 44 45In order to show the use of these more general lime of a lexicon with a semantic resource. The META- 45 46classes in practise we will look at a simple example SHARE ontology offers a wide number of classes and 46 47properties for encoding common metadata properties 47 48 48 149This section assumes some familiarity with OntoLex-Lemon; 49 49 an introduction to the model is given in Appendix A 152https://www.w3.org/2016/05/ontolex/#metadata-lime 50150See https://www.w3.org/2016/05/ontolex/#metadata-lime 153https://www.w3.org/2016/05/ontolex/#metadata-lime 50 51151See https://www.w3.org/TR/void/ 154see https://www.w3.org/2016/05/ontolex/ for a full description. 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 31

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18Fig. 5. The lime module. 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29Fig. 6. lime Example (diagram taken from OntoLex-Lemon guidelines). 29 30 30 31 31 of all kinds of language resources. There are several There already exist vocabularies for doing a lot of 32 32 other linked data vocabularies which are relevant for these things (vocabularies not specialised to the lan- 33 33 34the description of the metadata of language resources. guage resources domain), however. These include the 34 35Aside from the well known Dublin Core vocabulary, Semantic Publishing and Referencing suite of ontolo- 35 155 36as well as the PROV ontology for provenance infor- gies for bibliographic information , which can be 36 37mation, there are more specialised vocabularies which used in conjunction with META-SHARE and lime and 37 38can be applied in specific use cases. others in creating metadata solutions, and potentially 38 39For instance in the case of the the publication of application profiles, for the cases which we have men- 39 40retrodigitised dictionaries and the modelling of histor- tioned. They also include the CIDOC-CRM family of 40 41ical and scholarly lexical resources as LLD, there is a ontologies. It does however require that whatever solu- 41 42need for extensive metadata provision at both the lex- tions are proposed are then made accessible to a wider 42 43 43 ical and at the individual entry level in order to reflect community of users via the use of guidelines contain- 44 44 bibliographic and historic informatio and the represen- ing typical examples as well as and/or design patterns 45 45 46tation of scientific-scholarly hypotheses. In the case of (for a discussion of the potential use of ontology de- 46 47retrodigitised LLD dictionaries, metadata for the re- sign patterns in the context of OntoLex-Lemon see 47 48sources in question should contain information on who Section 6.1.2). 48 49compiled the original ‘paper’ version of the dictionary 49 50and when, what edition of the dictionary the resource 50 51is based on, who digitised it, what tools were used, etc. 155http://www.sparontologies.net/ 51 32 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

14.3.4. Language Identification or RDF data,159 provides an extensive set of two let- 1 2The reliable identification of languages and lan- ter codes for major languages that date back to the be- 2 3guage varieties is of the utmost importance for lan- ginning of the modern-age of computing, but long be- 3 4guage resources. For applications in linguistics and fore the emergence of the internet. ISO 639-1 codes are 4 5lexicography it defines the very scope of investigation composed of two lower-case letters with values from 5 6and the data provided by a language resource; for ap- a to z each. In theory, such a system is sufficient to 6 7plications in language technology and knowledge ex- identify up to 676 languages. 7 8traction, language identifiers define the suitability of Yet, with language technology developing into a 8 9training data or the applicability for a particular tool truly global phenomenon, it became clear that two- 9 10for the data at hand. letter codes were not sufficient to reflect the linguistic 10 11There are two different ways of encoding language diversity of the world both past and present – and in 11 12identification information currently being used in RDF. the present case this diversity is estimated to comprise 12 13The first is a URI-based mechanism that uses terminol- more than 6,000 language varieties. As a response to 13 14ogy repositories, the other is by attaching a language this, ISO 639-2 provides a set of three-letter codes for 14 15tag to a literal to indicate its language. In the latter (theoretically) up to 17,576 languages. Again, the Li- 15 16case the language tag is treated similarly to a data type. brary of Congress acts as maintainer and provides the 16 160 17Language information provided in this way does not data in human-readable form and as RDF . However, 17 18entail an additional RDF statement over the literal, al- it has to be recognised that the primary use case of 18 19 19 lowing a compact, readable and efficient identification ISO 639-2 was library-based and focused on languages 20 20 with minimal overhead on data modelling. Note that with an extensive literature, whereas the demands of 21 21 the RDF specifications [117] already include provision linguistics and lexicography, especially historical lin- 22 22 for the use of language identification via the attach- guistics and language documentation, exceed far be- 23 23 ment of language tags to strings. yond this. Indeed they comprise languages that are pri- 24 24 In the former case, there exist a number of RDF vo- marily spoken, not written, but for which field record- 25ings, text books, grammars or word lists must never- 25 cabularies which provide the means to mark the lan- 26theless be identifiable in order to be retrieved from 26 guage of a resource explicitly using RDF triples, i.e., 27metadata portals such as, as an example, the Open Lan- 27 using properties such as dc:language (for language 28guage Archives Community (OLAC)161. 28 URIs or string representations) or lime:language (for 29For applications in linguistics, SIL International acts 29 string representations). We elaborate on the differences 30as maintainer of ISO 639-3, another, and more ex- 30 in practice below. 31tensive set of three-letter codes. In distinction to ISO 31 RDF language codes are defined by BCP47156 and 32639-1 and 2 codes, which are meant to be stable and 32 the IANA157 registry on the basis of the ISO 639 stan- 33develop at a slow pace, if at all,162 ISO 639-3 codes 33 dard for language codes158. 34are actively maintained by the research community 34 35For application to RDF data, ISO provides three rel- and a continuous process of monitoring, approving 35 36evant subsets of language tags: ISO 639-1, maintained (or rejecting) updates, additions and deprecation re- 36 37by the Library of Congress and available as plain text quests is in place. At the moment, ISO 639-3 codes 37 38are published by means of human-readable code ta- 38 39163 39 156https://tools.ietf.org/rfc/bcp/bcp47.txt bles only, along with their history and associated 40157https://www.iana.org/ documentation, but not in any machine-readable form. 40 41158The need for the provision of machine readable identifiers for Within the LLOD community, it is a common prac- 41 42single languages or language varieties is clear from instances where 42 43a language has more than one name. For instance, the Manding 43 language Bamanakan (bm) which is also known as Bambara. It is 159https://id.loc.gov/vocabulary/iso639-1.html 44 44 also essential for dealing with cases where the same language name 160https://id.loc.gov/vocabulary/iso639-2 45is used to refer to what are quite different varieties. Take, for in- 161http://www.language-archives.org/ 45 46stance, the case of Saxon which as well as being an English heavy 162Changes in ISO 639-1 and 639-2 codes are very rare and oc- 46 47metal band has also been used to designate both Old English (Anglo- cur mostly as a result of political changes, e.g., after the split of Yu- 47 48Saxon, ISO 639-3 ang) as well as a number of varieties of Low goslavia, Serbian (sr, srp) and Croatian (hr, hrv) were to be con- 48 German, both historical and modern (Old Saxon, osx; Low Saxon, sidered independent languages (with two tags) whereas they were 49 49 nds), along with various different dialects of High German (Up- previously considered dialects of a single language, Serbo-Croatian 50per Saxon, sxu; Transylvanian Saxon [currently no ISO language (language tag sh, deprecated in 2000). 50 51code]). 163https://iso639-3.sil.org/ 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 33

1tice to apply the ISO 639-3 codes provided as part of very notion of language tags has been criticised as be- 1 2LexVo[118] whenever language URIs are required and ing both too inflexible as well as unable to address the 2 3ISO 639-3 codes are sufficient. However, it is to be needs of linguistics, e.g., recently by [120, 121], and 3 4noted that, unlike SIL code tables, LexVo identifiers alternatives are being explored [122]. 4 5are not authoritative and may not be up to date with the URI-based language identification represents a nat- 5 6latest version of SIL. ural alternative in such cases, as these are not tied to 6 7But ISO 639-3 only represents the basis for lan- any single standardization body or maintainer, but al- 7 8guage tags as specified by BCP47 [119, Best Common low the marking of both the respective organization or 8 9Practices 47, also referred to as IETF language tags or maintainer of the resource (as part of the namespace) 9 10RFC 4646] that are incorporated into the RDF specifi- and the individual language (in the local name). As a 10 11cation: consequence they would naturally support to shift from 11 12BCP47 defines how ISO 639 language tags can be one provider to another, if this were required for a par- 12 13extended with information regarding geographical use, ticular task. 13 14script, among other variables as follows: Finally, another provider of language identifiers 14 15language(-script)(-region)(-variant)* which is relevant for the current discussion is Glot- 15 16(-extension)*(-x-privateuse) tolog [123],167. This is a repository of identifiers for 16 17language varieties with a specific focus on (although 17 where: 18by no means restricted to) low-resource languages and 18 19– language: this is an ISO 639-1 tag if this is avail- with an eye to applications in linguistic typology and 19 20able or an ISO 639-3 tag otherwise; language documentation. Glottolog maintains an inde- 20 21– Script (optional): an ISO 15924 4-letter code, for pendent set of language variety identifiers accessible 21 22instance the code for Latin is Latn; in human- and machine-readable (RDF) form via re- 22 23– region (optional): this is an ISO 3166 2-letter re- solvable URIs, along with additional metadata, an as- 23 24gion code or a UN M.49 3-number) code, for in- sociated bibliography and a view on the phylogenetic 24 25stance either US or 840 for the United States of structure of specific varieties168. In order to avoid any 25 26America of the unintended political connotations that inevitably 26 27– variant: zero or more registered variants taken arise from the use of the term ‘language’169, Glottolog 27 28from the current list of registered variants pro- uses the more neutral (though rather uglier) term lan- 28 29vided by IANA164. guoid, where this latter is defined as a language variety 29 30– extension: aero or more extensions in one or about which, or in which, there is exists some kind of 30 31more custom schemes literature. 31 32– private use (optional): for use for internal notes A Glottolog ID for a languoid, then, consists of a 32 33about about identification in a single application. 4-letter alphabetic code followed by a 4-character nu- 33 34 34 The W3C provides means for validating BCP 47 merical code; for instance the Glottolog ID for stan- 35 35 language tags, part of the specification is also that lan- dard English is stan1293. These are the basis 36 36 guage tags should be registered at the Internet As- of resolvable URIs, for instance http://glottolog.org/ 37 37 signed Numbers Authority. The IANA language sub- resource/languoid/id/stan1293, which once resolved 38 38 tag registry165 currently provides registered language provide links to other relevant resources such as ISO 39 39 tags in XML, HTML and plain text. As of 2020, dis- 639. Note that Glottolog maintains a certain bias to- 40 40 cussions around the provision of a machine-readable wards endangered modern languages and therefore re- 41 41 view in RDF and by means of resolvable URIs have mains rather sketchy for what concerns the histori- 42 42 undergoing and are expected to bear fruit in the coming 43 43 years. We expect that, by then, the IANA registry will 167https://glottolog.org/ 44 44 supersede LexVo as a default provider of ISO 639(-3) 168That is, Glottolog allows for the specification of the phy- 45 45 language URIs166. However, it should be noted that the logenetic relationships between different varieties, specifying En- 46glish, for instance, as a subconcept of the category ‘Macro-English’ 46 47(macr1271), which groups together Modern Standard English and 47 164 48https://www.iana.org/assignments/language-subtag-registry/ a number of English Pidgins: and relating it in its turn to narrower 48 language-subtag-registry (accessed 10-07-2019) subconcepts such as Indian English (indi1255) and New Zealand 49 49 165https://www.iana.org/assignments/lang-subtags-templates/ English (newz1240). 50lang-subtags-templates.xhtml 169Recall Max Weinreich’s famous observation that “a language 50 51166Cf. https://github.com/w3c/i18n-discuss/issues/13 . is a dialect with an army and a navy”. 51 34 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1cal dimension. Yet the popularity of Glottolog and well as through a study of the literature and by request- 1 2fact that it has already had a wide uptake beyond ing input from other participants of the NexusLin- 2 3the language documentation community, including in guarum COST action172. Our project survey also in- 3 4Wikipedia, would suggest that provision of identifiers cluded an analysis of influential survey publications 4 5for historical varieties is only a matter of time (pardon and anthologies in this area of linguistic linked data 5 6the pun). (such as [124, 125]) and an analysis of the programs 6 7of the major conferences in the sector of language 7 8resources173. This may of course have, inadvertently, 8 5. Projects 9led us towards a natural selection bias in the project 9 10 10 Summary In this section we will give an overview of overview, namely, towards projects that tended to pub- 11 11 a range of different projects that have had an impact lish their results at these venues. In addition it should 12 12 or which are currently having an impact on the use or also be noted that since our most important sources of 13 13 definition of LLD vocabularies. Table 3 gives a sum- project information were the CORDIS and OpenAIRE 14 14 mary of the projects discussed below. In Section 5.1 we project platforms, both of which have a severely lim- 15 15 give a detailed overview of this topic; this overview in- ited coverage of national and non-European projects, 16 16 cludes a subsection on recent projects which combine we were at a disadvantage with respect to information 17 17 LLD and DH (Section 5.1.1) and and introduction and with regards to the latter. We were however able to par- 18 18 description of an LLD project matrix given as Figure tially compensate for this by information retrieved via 19 19 7 (Section 5.1.2). Next we describe a series of selected the active consultation of our respective networks. 20 20 projects in detail. These are (in order of appearance): Based on this exploratory work we were able to 21 21 make a number of observations. Probably the most 22– LiODi Section 5.2.1) 22 important of these is that the effort towards the def- 23– POSTDATA (Section 5.2.2) 23 inition of common models for linguistic linked data 24– ELEXIS (Section 5.2.3) 24 has never been dependent on any single, large-scale 25– Prêt-à-LLOD (Section 5.2.5) 25 project, but was conducted within the confines of a 26– NexusLinguaram (Section 5.2.6) 26 27much broader community, one which overlapped with 27 a number of funded projects, often carried out in par- 285.1. An Overview 28 29allel. Over and above this, the community was also 29 30As we mentioned in the introduction, the funding of maintained by other kinds of networks and initiatives. 30 31an ever increasing number of projects in which LLD What also came through quite strongly, however, both 31 32plays a key role at the international, transnational (in- from the research carried out as part of the survey and 32 33cluding European), national, and regional levels, is ev- the personal experience of the individual authors of 33 34idence of the success of LLD as a means of publish- the current work, is that international (and especially 34 35ing language resources. These projects also provide European level) projects played a crucial role in sup- 35 36us with an important picture of the use which is be- porting and sustaining LLD models and vocabular- 36 37ing made of LLD models and vocabularies across dif- ies, once they had been proposed. 37 38ferent disciplines and use cases as well as indicating 38 39 39 where future challenges may lie. Therefore as part of a 172As part of the preparation for the survey, we set up a Wikipedia 40task currently being undertake in the NexusLinguarum page on OntoLex,(https://en.wikipedia.org/wiki/OntoLex) and ex- 40 41COST action (see Section 5.2.6) several of the authors tended another Wikipedia page on Linguistic Linked Open 41 42of the current article decided to carry out a survey of Data (https://en.wikipedia.org/wiki/Linguistic_Linked_Open_Data. 42 43research projects in which a significant part of each We also encouraged partners from our respective networks to con- 43 tribute and extend those pages, especially with respect to applica- 44project was dedicated to making language resources 44 tions of OntoLex and LLOD in general. Information retrieved as part 45available using linked data or which had LLD as one of this process was used to complement the survey described above. 45 46of its main themes. 173In particular, the Language Resource and Evaluation Confer- 46 47The survey has so far been carried out via queries on ence (LREC) series and associated workshops as well as domain- 47 48CORDIS170 and the OpenAIRE explorer site171, as specific events (workshops on Linked Data in Linguistics (LDL), 48 conferences on Language, Data and Knowledge (LDK), lexico- 49 49 graphic events such as EURALEX, ASIALEX, and GLOBALEX as 50170https://cordis.europa.eu/projects well as the eLex series of electronic lexicography conferences, and 50 51171https://explore.openaire.eu/ associated workshops. 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 35

1Summary 1 2Project Name Duration Type Coverage in 2 3Current Article 3 4EAGLES 1993-1995 European Project (FP3) Section 5.1 4 5ISLE 2000-2002 European Project (FP5) Section 5.1 5 6E-MELD 2007-2012 American National Project (NSF) Section 5.1.2 6 7MONNET 2010-2013 European Project (FP7) Section 5.1 7 8SemaGrow 2012-2015 European Project (FP7) Section 5.1 8 9CLLD 2013-2016 German Project (Max Planck) Section 4.2 9 10LIDER 2013-2015 European Project (FP7) Section 5.1 10 11QTLeap 2013-2016 European Project (H2020) Section 5.1 11 12TDWM 2014-2029 German Regional Project Section 5.1.1 12 13FREME 2015-2017 European Project (H2020) Section 5.1 13 14LiODi 2015-2022 German Project Section 5.2.1 14 15Lynx 2017-2021 European Project (H2020) Section 5.1 15 16DiTMAO 2016-2019 German-Italian (funded by Deutsche Section 5.1.1 16 17Forschungsgemeinschaft (DFG)) 17 18POSTDATA 2016-2022 European Project (H2020-ERC) Section 5.2.2 18 19MTAAC 2017-2020 International (funding from DFG, SSHRC Section 5.1.1 19 20and NEH) 20 21Nénufar 2017- French Project (mixed funds) Section 5.1 21 22ELEXIS 2018-2022 European Project (H2020-ERC) Section 5.2.3 22 23LiLa 2018-2023 European Project (H2020-ERC) Section 5.2.4 23 24Prêt-à-LLOD 2019-2022 European Project (H2020-ERC) Section 5.2.5 24 25NexusLinguaram 2019-2023 EU Cost Action Section 5.2.6 25 26ItAnt 2020-2023 Italian National Project (PRIN) Section 5.1.1 26 27MORdigital 2021-2024 Portuguese National Project Section 5.1.1 27 28Table 3 28 29Projects Discussed in the Current Article 29 30 30 31This can be demonstrated by the development his- 2000-2002)176. LMF was subsequently further devel- 31 32tory of OntoLex-Lemon, probably the most popular oped within ISO TC37. And it continues to be so: the 32 33 33 and well known of the LLD specific models featured latest version of LMF is a multi-part standard of which 34 34 in this article. The origins of this model ultimately go the first four parts have been all been published at the 35 35 back to the Lexical Markup Framework (LMF) [3], a time of writing [126]. 36 36 conceptual Uniform Markup Language (UML)-based The project Multilingual Ontologies for Networked 37 37 174 Knowledge (MONNET, 2010-2013)177 subsequently 38model for representing NLP-lexicons and machine- 38 developed the original lemon model on the basis of 39readable dictionaries which was develeoped over the 39 LMF. In 2011, MONNET project members initiated 40course of a number of projects that covered lexical re- 40 the formation of W3C Community Group Ontology- 41sources in NLP and related use cases, most notably 41 42Expert Advisory Group on Language Engineering Lexica. OntoLex-Lemon was subsequently developed 42 175 43Standards (EAGLES, 1993-1995) , and Interna- within the ambit of this community group as a revision 43 44tional Standards for Language Engineering (ISLE, of the original lemon model for the specific application 44 45use case of ontology lexicalization. OntoLex-Lemon 45 46was further developed in the subsequent LIDER 46 178 47174LMF also had an official XML serialization was included as project (2013-2015) . LIDER contributed to the 47 48part of the standard. Attempts towards a RDF/OWL serialization 48 were made by Gil Francopoulo and can be found linked under 49 49 http://www.lexicalmarkupframework.org/, but have not been other- 176http://www.ilc.cnr.it/EAGLES96/isle/ISLE_Home_Page.htm 50wise published. 177https://cordis.europa.eu/project/id/248458 50 51175http://www.ilc.cnr.it/EAGLES/home.html 178http://lider-project.eu/lider-project.eu/index.html 51 36 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1formation of numerous W3C community groups as ans and archaeologists, as well as language learners. In 1 2a means of proiving a long-term perspective for the case of the classics (the case of LiLa in particular 2 3its activities. As far as lexical resources are con- Section 5.2.4), there is also a reliance on a strong and 3 4cerned, this included the a W3C Community Group on lengthy tradition of previous scholarship. 4 5Best Practices for Multilingual Linked Open Data However by making it easy to, for instance, empha- 5 6(BP-MLOD) which, among other contributions, de- sise different kinds of connections both within and be- 6 7veloped guidelines for the application of OntoLex- tween different past civilizations, their languages and 7 8Lemon for modelling lexical resources (dictionar- cultures LLD offers a powerful and effective solu- 8 9ies and terminologies) independently from ontolo- tion to the challenges of modelling heterogeneous hu- 9 10gies. This represents the basis for most modern uses manities data and making it findable and interopera- 10 11of OntoLex-Lemon, and its development towards a ble. In particular LLD is well placed to facilitate the 11 12general-purpose community standard for publishing integration of historical and geographical with lexi- 12 13lexical resources on the Semantic Web. cographic and linguistic information – using the ap- 13 14Monnet and LIDER were seminal in their impact propriately defined classes and properties that is. The 14 15on the development of LLD models and vocabular- use of linked data made in DH projects such as Pela- 15 16ies. Other important (European) projects in this re- gios [82], Mapping Manuscript Migrations [128] and 16 17gard include the FP7 project Eurosentiment179 which in the Finnish Sampo datasets [129], among others 17 18leveraged lemon to model language resources for sen- very clearly demonstrates this. 18 19timent analysis. They also include FREME180 which In the rest of this section, then, we will provide sum- 19 20explored the application of the NIF and lemon; and maries of a number of small and medium scale projects 20 21SemaGrow181 which, along with the LIDER project, at the regional, national and trans-national levels which 21 22helped to support the development of the lime meta- are notable for bringing together, or being at the over- 22 23data module). lap of, LLD and DH. 23 24Other projects with a significant recent impact on At a national level, we can list the French project 24 25the application of LLD vocabularies include the Hori- Nénufar, already mentioned above, which aims to- 25 26zon 2020 project Lynx: Building the Legal Knowl- wards the creation of successive early 20th century 26 27edge Graph for Smart Compliance Services in Mul- editions of the French language Le Petit Larousse Il- 27 28tilingual Europe (2017-2021) [127] which has con- lustré dictionary in both TEI/XML and in RDF us- 28 29tributed to data modelling in the area of machine- ing OntoLex-Lemon [130]182, along with the Ger- 29 30readable licensing, a topic that is much broader than man project Linked Open Dictionaries which is de- 30 31the area covered by our survey. They also include the scribed in detail in Section 5.2.1. In addition we can 31 32project Quality Translation by Deep Language En- also mention the Italian project (part of the Progetti 32 33gineering Approaches (QTLeap), 2013-2016 which di Rilevante Interesse Nazionale or PRIN program) 33 34has primarily focused on Natural Language Process- Languages and Cultures of Ancient Italy. Histori- 34 35ing. Due to their more recent impact on the definition cal Linguistics and Digital Models (ItAnt) (currently 35 36and use of LLD models and vocabularies we will dedi- ongoing) which aims to publish a linked data lexicon 36 183 37cate specific sections to the following European H2020 of the ancient Italic languages using the OntoLex- 37 38projects ELEXIS(Section 5.2.3), Prêt-à-LLOD (Sec- Lemon model and its extensions. Also relevant here 38 39tion 5.2.5), and the ERC projects LiLa (Section 5.2.4) is the Italo-German project DiTMAO, funded by 39 40and POSTDATA (Section 5.2.2) below. the DFG (Deutsche Forschungsgemeinschaft), (com- 40 41pleted) which produced a lexicon of Old Occitano 41 425.1.1. Recent Projects combining LLD and DH medical terminology and which also proposed an ex- 42 43As is typical of DH projects, those which we de- tension of lemon to deal with the specifics of this use- 43 44scribe in this section, along with ELEXIS, LiLa and case184 [131]. Another national project worth men- 44 45POSTDATA (described in their own sections below), tioning here is the recently initiated Portuguese na- 45 46aim to engage with a wide and diverse scholarly com- tional project MORdigital [132] which has the aim of 46 47munity, which includes linguists, philologists, histori- 47 48 48 182Despite the best of intentions however the RDF part isn’t cur- 49 49 179https://cordis.europa.eu/project/id/296277 rently very well developed. 50180https://cordis.europa.eu/project/id/644771 183https://www.prin-italia-antica.unifi.it/ 50 51181https://cordis.europa.eu/project/id/318497 184https://www.uni-goettingen.de/en/ditmao/487498.html 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 37

1digitising the historically significant 18th century Por- clusion of new data during dictionary development. In 1 2tuguese language dictionary O Diccionario da Lingua the case of the Mayan the situation is even more dif- 2 3Portugueza by António de Morais Silva with the in- ficult due to the signs not yet having been fully deci- 3 4tention of producing digital editions of this important phered. In order then to deal with the challenges which 4 5lexicographic work both in TEI Lex-0 and OntoLex- arise from the existence of different sign catalogues 5 6Lemon. The MORdigital project will be an impor- (which might cluster different signs into meanings dif- 6 7tant test case for understanding both the coverage of ferently) and the necessity of linking with other cat- 7 8already existing LLD vocabularies when it comes to alogues which have been developed in the field, the 8 9retrodigitized dictionaries, and the advantages and dis- project’s sign catalogue has been formalised in SKOS 9 10advantages of using linked data as a means of publish- in addition to using properties and concepts from the 10 11ing such data. CIDOC-CRM vocabulary187 and GOLD (mentioned 11 12Many of the projects we have mentioned have used above in Section 3.5). The TDWM project is also de- 12 13OntoLex-Lemon or its predecessor lemon. However a veloping its own vocabulary for identifying signs, link- 13 14lightweight alternative to these vocabularies, and one ing them to different sign catalogues, possible read- 14 15which enables the multilingual annotation of concep- ings, graphical variants, etc. At the time of writing, nei- 15 16tual hierarchies, is SKOS-XL. For instance SKOS- ther the sign catalog nor any texts are publicly avail- 16 17XL has been used in a number of related projects at able, but Diehr et al. [137] provide a detailed descrip- 17 18the Computational Linguistics lab of the University of tion188. 18 19Saarland, part of a major effort towards the transforma- Finally, another recent project which used a range of 19 20tion of several influential classification schemes in the different LLD vocabularies is Machine Translation 20 21field of folk literature185 (including among others folk- and Automated Analysis of Cuneiform Languages 21 22tales, ballads, myths, fables) into Semantic Web repre- (MTAAC). This was a Data international funded 22 23sentation languages in order to support interoperabil- project which saw the collaboration of specialists of 23 24ity between those schemes; see [136]. The terms used cuneiform languages and computational linguists in 24 25in the original classification schemes were transformed the development of cutting edge tools for the annota- 25 26into (multilingual) SKOS-XL labels. These were used tion and distribution of linguistic data of the cuneiform 26 27for encoding folktale text sequences, extracted from corpus. Although the project’s overall objective was 27 28a manually annotated multilingual folktale corpus and to open the way to the development of tools and the 28 29which had been identified as representing motifs listed production of richer linguistic data for all cuneiform 29 30in [134]. The use of SKOS-XL meant that motifs could languages, its specific focus was on a group of unan- 30 31be annotated in different multilingual versions of tales. notated Sumerian texts issued from the bureaucratic 31 32Staying with the theme of the use of a range of apparatus of the Ur III period (21th century BC)189; 32 33different models/vocabularies for the encoding of lin- in addition, another corpus composed of royal inscrip- 33 34guistic data as LLD (and not necessarily the ones we tions in the Sumerian language [138], annotated with 34 35have focused on in this survey) another project worth morphology, was also employed. Amongst the several 35 36mentioning here is Text Database and Dictionary of 36 37186 37 Classic Mayan (TDWM) (2014-2029). This latter 187 38http://www.cidoc-crm.org/ 38 aims to develop a corpus-based dictionary of Mayan 188It is interesting to note that TDWM stands in a longer tra- 39 39 hieroglyphic writing alongside a near-exhaustive cor- dition of projects in the Digital Humanities that aim to comple- 40pus of Classic Mayan which would allow for the verifi- ment a TEI/XML edition with terminology management using an 40 41cation of different textual interpretations and aid in fin- ontology. Similar ideas have already been driving force behind the 41 42ishing the decipherment of Maya writing. The project project Sharing Ancient Wisdoms (SAWS, 2010-2013)(http://www. 42 43ancientwisdoms.ac.uk/), a joint project at King’s College London, 43 faces the problem, typical of ancient languages, of the UK, the Newman Institute in Uppsala, Sweden, and the University 44 44 need to represent multiple interpretations of characters of Vienna, Austria, funded in the context of the Humanities in the 45and texts, in part due to damaged sources in concomi- European Research Area (HERA) program to facilitate the study and 45 46tance with the need to update this data with the in- electronic edition of ancient wisdom literature. Both projects employ 46 47resolvable URIs, but the linking is expressed by means of narrowly 47 48defined TEI/XML attributes rather in terms of RDF semantics. In 48 185The classification schemes in question were those proposed by that regard, the data published in accordance with these guidelines 49 49 Vladimir Propp [133], Stith Thompson [134] and Anti Aarne, Stith does not qualify as Linked Data, but can still be converted to Linked 50Thompson and Hans-J. Uther [135]. Data with moderate effort. 50 51186Based at the University of Bonn, Germany. 189These texts were extracted from CDLI 51 38 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1objectives of the project [139] was the aim of formal- to be added in the future (Pleiades, perio.do, etc.). The 1 2ising the new data produced by the project by utilising model developed is currently being integrated into the 2 3Linked Open Data (LOD, including Linguistic LOD) CDLI platform. 3 4vocabularies, and fostering the practices of standard- 4 5.1.2. An LLD Project Matrix; The Relationship 5isation, open data and LOD as integral to projects in 5 between Projects and Community Initiatives 6digital humanities and computational philology. The 6 In Figure 7, we provide an overview in the form of a 7project, which ended in 2020 successfully achieved 7 matrix of a number of selected projects on the basis of 8these aims, including making new data in the form of 8 the kinds of contributions they have made to a number 9linguistic annotations and translations available under 9 of LLD vocabularies. We distinguish three varieties of 10open licenses190; this will soon be accessible through 10 contribution: namely, a project is said to have 11the new web platform of the Cuneiform Digital Library 11 12Initiative (CDLI https://cdli.ucla.edu) in many forms, Developed (deep green) a vocabulary if the develop- 12 13including (L)LOD. ment of that vocabulary was a designated project 13 14CoNLL was chosen as as a flexible and robust goal, to have 14 15format for storing the multi-layer annotations which Contributed (light green) to a standard if vocabulary 15 16were produced and worked on as part of the project development was not a designated project goal, 16 17191. However CoNLL-RDF was also employed in the but the project provided a use case or application 17 18project in order to ensure integration with LLOD, as that was discussed in the process of its develop- 18 19well as for easier querying and transformation and ment, or to have 19 20was used to link annotations, lexical information, and Used (yellow) a vocabulary if they applied an existing 20 21metadata. The ETCSRI morphological annotations192 vocabulary, worked with or produced data of that 21 22were mapped to Unimorph193 using Turtle-RDF194, type 22 23rendering Sumerian material accessible for cross- 23 Note that this survey, and indeed any survey which 24linguistic queries. SPARQL was leveraged through 24 focuses on projects, will provide a partial view only. In 25CoNLL-RDF for syntactic annotation which was mapped 25 particular, contributions by community groups (Open 26to Universal Dependencies for POS and dependency 26 Linguistics Working Group, W3C working groups, 27labels. Lexical data was linked to guide word entries 27 etc.) are not explicitly covered in this section (although 28through the employement of an OntoLex-Lemon com- 28 they are described in some depth in Section 4 and their 29pliant index. Metadata concerning the analysis of the 29 30contribution is also discussed in Sections 2.2 and 5.1). 30 medium of the text and other meta classifications of For instance the reader will notice that very few of the 31the texts were mapped to the CIDOC-CRM. Over- 31 32projects in Fig. 7 address the area of LLD for linguistic 32 all, MTAAC suceeded in preparing a (L)LOD edi- typology. In fact the interaction between linguistic ty- 33tion and linking of Sumerian language corpora. The 33 34pology and language technology operates primarily on 34 model can be extended in part to other cuneiform lan- informal contacts on mailing lists and via workshops 35guages. Various Assyriological resources had been in- 35 36and less in terms of large-scale infrastructural projects, 36 tegrated using (L)LOD [98]: The CDLI data, (CoNLL- and that, thus, the development of standard (computa- 37RDF plus CIDOC-CRM), ORACC:ETSCRI (by con- 37 38tional) models and vocabularies has only rarely a pri- 38 version; CoNLL-RDF), ePSD (by conversion and links ority in typological projects195. For such discussions, 39to HTML; lemon) and ModRef & BM (by federa- 39 40tion; CIDOC-CRM). Other vocabularies are planned 40 41195There are notable exceptions here the E-MELD project (http: 41 42//emeld.org/), for example, developed the GOLD ontology as part of 42 190 43https://gitlab.com/cdli/framework; https://github.com/cdli-gh an attempt to improve interoperability and sustainability of language 43 191A derivative internal format, called CDLI-CoNLL is employed documentation material. But while several typological projects de- 44 44 to store the data locally – this was an essential step to support the veloped ontologies and RDF vocabularies, and have been actively 45preservation of domain specific annotation which are richer than contributing to the community, esp., in the Open Linguistics work- 45 46their counterparts found in linguistic all-encompassing models. But ing group, we see a very limited degree of linking between such 46 47this can be exported in CoNLL-U format, as well in Brat Standalone resources. The Cross-Linguistic Linked Data project (CLLD, 47 48format, for better compatibility. https://clld.org/), for example, does provide an RDF view on their 48 192http://oracc.museum.upenn.edu/etcsri/parsing/index.html data, but linking is primarily internal, and neither complete data 49 49 193http://unimorph.org/. dumps nor a SPARQL end point or any form of an API is provided. 50194https://github.com/cdli-gh/mtaac_work/blob/master/lod/ Instead, their RDF data seems to be generated on the fly, without any 50 51annotations/um-link.ttl. links to external resources. We take this to reflect the fact that for 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 39

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19Fig. 7. Usage of and contribution to major LLOD vocabularies by selected research projects 19 20 20 21more informal networks present great opportunities termined the success and then the subsequent main- 21 22(and act as a driver) for experts to participate in the tenance of the LLDs models and vocabularies. The 22 23Linguistic Linked Open Data movement, whereas the importance of funded projects is clear for the devel- 23 24chances for acquiring substantial funding directed to- opment of tools and hosting solutions for Linguistic 24 wards vocabulary development and community partic- 25Linked (Open) Data which are not yet in place; open 25 ipation are rather unreliable (if past experience is any- 26community initiatives have also proven themselves vi- 26 27thing to go). 27 tal for dissemination and wider community engage- 28Also note that in this section, we have concentrated 28 29on research projects with a specific focus on linguis- ment. With the increasing maturity of OntoLex-Lemon 29 30tic linked (open) data – many of them, indeed, with and the convergence between existing solutions in lin- 30 31industry partners involved – but which do not, for guistic annotation, the necessary requirements for de- 31 32the most part, directly target industrial applications. veloping large-scale Linguistic Linked (Open) Data in- 32 33More industry-focused LLD projects do exist, how- frastructures and their respective linking are in place, 33 34ever, and are the basis for businesses specialising in now. In Section 6.1.1 below we take a brief look at the 34 35text analytics [140], terminology and knowledge man- 35 prospects for the involvement of research infrastruc- 36agement [141] or lexicography [142]. But linked data 36 tures in the kinds of initiatives mentioned in this sec- 37in these contexts tends to be viewed as a technical facet 37 tion. 38that has an impact on interoperability, (re)usability and 38 39information aggregation rather than being fundamental In what follows we will give extended descriptions 39 40for any existing business model. With the increasing of six ongoing projects. We have chosen these projects 40 41maturity of the technology, however, this may change the basis of their impact and their scale, as well as 41 42over the longer term, especially in the area of establish- for their importance in the development of LLD mod- 42 43ing interoperability between AI platforms [143], their els and vocabularies and/or in their innovative use 43 44providers and users and data provided and exchanged of such. These are LiODi in Section 5.2.1; POST- 44 45between them [144]. 45 DATA, in Section 5.2.2; Prêt-à-LLOD in Section 46To conclude then, it has really been the combination 46 5.2.5; ELEXIS in Section 5.2.3 and finally NexusLin- 47of open community initiatives and projects that has de- 47 48guarum in Section 5.2.6. Please note that the length of 48 49the following project descriptions will vary on the ba- 49 this community, interoperability is a priority, but also, to maintain 50control over internal data and independence from external contribu- sis of their relevance to some of the models and vocab- 50 51tions. ularies discussed in the rest of this paper. 51 40 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

15.2. Innovative Projects tilingual WordNets203, the Intercontinental Dictio- 1 2nary Series, XDXF204 and StarDict205 – the latter 2 35.2.1. LiODi (2015-2022) only to the extent that the copyright could be clarified 3 4The Linked Open Dictionaries project (LiODi)196 and an open license was confirmed). 4 5aims to develop LLOD-enabled methodologies and in- More significant than lexical resources and novel 5 6frastructures to facilitate language research for low- vocabularies, however, are the contributions of LiODi 6 7resource languages, validating these developments to the development of community standards for LLD 7 8mostly on the languages of the Caucasus. Within the vocabularies. This includes, among other aspects, sig- 8 9project, a set of loosely connected tools are being cre- nificant contributions to the emerging OntoLex Mor- 9 10ated with the aim of facilitating language contact stud- phology module (Section 4.1.2), initiating and moder- 10 11ies over lexical and corpus data. One of the primary ating the development of the OntoLex FrAC module 11 12development goals of the project is an environment (Section 4.1.3) and the LD4LT initiative on harmoniz- 12 13for detecting semantically and phonologically similar ing vocabularies for linguistic annotation on the web. 13 14words across different languages in order to facilitate Furthermore, LiODi has a strong dedication to 14 15the detection of possible cognates. Other tools include the dissemination and promotion of linked data ap- 15 16interfaces for converting, validating, and exploring lin- proaches for linguistics. The project co-organised two 16 17guistic data to aid in linguistic research both within summer schools, SD-LLOD 2017 and SD-LLOD 17 18and outside of the project. Tool development and lin- 2019; two conferences LDK 2017 and LDK 2019; 18 19guistic research are both integral parts of LiODi and three workshops LDL 2016, LDL 2018, and LDL 19 20the tools and pipelines implemented are tested on the 2020; and collaborated with international partners and 20 21data generated and used in the project [98, 145]. the Prêt-à-LLOD project (see Section 5.2.5) in the pub- 21 22The most important contributions of LiODi from a lication of the first monograph on the topic [124] along 22 23modelling perspective relate to the fact that its mem- with a number of edited volumes (not counting the five 23 24bers have developed, and are in the course of de- volumes of proceedings which resulted from the afore- 24 25veloping, LLD vocabularies for a wide-range of ap- mentioned events, including a collection on linked data 25 26plications in the language sciences: in particular, vo- for collaborative, data-intense research in the language 26 27 27 cabularies with an emphasis on the requirements of sciences [146]). 28 28 low-resource languages and especially morphologi- Outside of conjoined activities at summer schools 29 29 cally rich languages which are not well served by ex- and datathons, the project supports numerous external 30 30 isting formats. These vocabularies include individual, partners in expertise with data modelling and language 31 31 task-specific vocabularies such as Ligt and CoNLL- resource management. Indeed LiODi has close ties 32 32 RDF (see 4.2.4), but also an extension of OntoLex for with most of the projects listed here. To mention one 33 33 diachronic relations (cognate and loan relations) [44]. notable example here, a collaboration with the POST- 34 34 In addition to that, the LiODi project (along with DATA project (see next section) and the Academy of 35 35 Prêt-à-LLOD, see 5.2.5) is the main contributor to Sciences in Heidelberg, Germany, led to the first prac- 36 36 the ACoLi Dictionary Graph [34]197. To the best of tical applications of RDFa within TEI editions in the 37 37 our knowledge, the ACoLi Dictionary Graph currently Digital Humanities [67, 147], and ultimately to the de- 38 38 velopment of an official TEI+RDFa customization (see 39represents the most extensive collection of machine- 39 above). 40readable bilingual open source dictionaries available, 40 with currently more than 3000 substantial data sets for 415.2.2. POSTDATA (2016-2021) 41 more than 430 ISO 639-3 languages (including full 42The Poetry Standardization and Linked Open 42 OntoLex-Lemon editions of PanLex198, Apertium199, 43Data (POSTDATA) project206, seeks to bridge the 43 FreeDict200, MUSE201, Wikidata202, the Open Mul- 44digital gap between traditional cultural assets and the 44 45growing sophistication of data modelling and publica- 45 46196https://acoli-repo.github.io/liodi/ tion practices in the field of the Digital Humanities. 46 47197https://github.com/acoli-repo/acoli-dicts 47 198 48https://panlex.org/ 48 199https://www.apertium.org/ 203http://compling.hss.ntu.edu.sg/omw/ 49 49 200https://freedict.org 204https://sourceforge.net/projects/xdxf/ 50201https://github.com/facebookresearch/MUSE 205http://stardict.sourceforge.net/ 50 51202https://www.wikidata.org/ 206http://postdata.linhd.uned.es 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 41

1It focuses on poetry analysis, bringing Semantic Web The second pillar of POSTDATA the use of NLP 1 2standards and technologies for enhanced interoperabil- tools, represented by PoetryLab, encompasses the sev- 2 3ity to bear on numerous different poetry-related re- eral different levels of poetry scholarship, from the 3 4sources. The project is founded upon two central pil- most formal analyses relating to scansion, to more 4 5lars: the use of linked open data and the implementa- cognitive levels which concern the understanding of 5 6tion and use of a set of dedicated Natural Language metaphor as well as others related to knowledge and 6 7Processing (NLP) tools, PoetryLAb. In particular one subjective perception involving AI techniques. POST- 7 8of the aims of the project is to share scholarly knowl- DATA has already implemented the first level of NLP 8 9edge about the domain of poetry and publish literary algorithms for poem analysis. These allow for the auto- 9 10works on the linked open data cloud. mated extraction of information from poems at differ- 10 11In order to fulfil this aim, POSTDATA is developing ent levels of description and include an Name-Entity 11 12a poetry ontology. This ontology is based on the anal- Recognition system (NER) for medieval place names 12 13ysis and comparison of different data structures and and organizations, [157] as well as automatic enjamb- 13 14metadata arising from eighteen projects and databases ment analysis and basic metrical scansion tools (which 14 15devoted to poetry in different languages at the Euro- allow for lexical syllabification and the recognition of 15 16pean level, [148–151]. The POSTDATA ontology is an stressed and unstressed syllables) testing different ap- 16 17encapsulated ontology model, where domain knowl- proaches. These latter range from traditional ruled- 17 18edge is implemented in 3 layers: Postdata-core, Post- based systems to the latest deep learning based tech- 18 19data metrical and literary analysis and Postdata- niques, [158–160]. The goal in this case is to use the 19 20transmission. This layered ontology is based on the results of these tools in order to build an RDF knowl- 20 21re-use of other ontologies relevant to the project’s do- edge graph that is compliant with Postdada ontology. 21 22main of interest and covers different levels of descrip- 22 5.2.3. ELEXIS (2018-2022) 23tion from the abstract concept of the poetry work to its 23 Following on from the European Network for e- 24bibliographic representation [152–156]. The model is 24 Lexicography COST Action207, the ELEXIS project 25intended to support tasks associated with the analysis 25 is currently undertaking the construction of a Euro- 26of poetry such as close reading, distant reading or crit- 26 pean infrastructure for electronic lexicography [161]. 27ical analysis. All of these ontologies are modelled in 27 LLD will play a key role in the ELEXIS infrastruc- 28OWL and will be exposed via SPARQL endpoints. 28 ture, as means of connecting dictionaries and other lex- 29The POSTDATA metrical layer represents knowl- 29 icographic resources both within and across language 30edge pertaining to the poetical structure and prosody of 30 boundaries. Indeed, the idea of ELEXIS is to eventu- 31a poem and contains salient (general) linguistic, pho- 31 ally construct an network of interlinked electronic lex- 32netic and metrical concepts. From the metrical point of 32 ica and other lexicographic and language resources in 33view, a poem is formed by stanzas that contain lines, 33 several different languages, a network that the project 34where the latter is understood as a list of words. The 34 calls a Matrix Dictionary. Another relevant aspect of 35concept, word, is present in OntoLex-Lemon and in 35 the project concerns the conversion of legacy lexico- 36NIF. In both cases, the definition of the concept is 36 graphic resources into structured data, and potentially, 37not sufficient to capture all the knowledge needed for 37 linked data in order to feed into this Matrix Dictionary. 38the analysis and description of a word from a metri- 38 The main models being used in the project are 39cal point of view. From this latter point of view the 39 OntoLex-Lemon and the TEI Lex-0 model mentioned 40concept word is associated both with more general lin- 40 above [162]. Here it will perhaps be useful to give a 41guistic information (such as its lemma or headword) 41 brief description of the latter. 42as well as more specific phonetic features such as syl- 42 43lable, foot, feet type onset or coda as well as other TEI Lex-0 and ELEXIS TEI Lex-0 is a custmomiza- 43 44types of metrical information. However, the intention tion of the TEI schema208 that is adapted to the encod- 44 45is to link the Word concept in the POSTDATA metri- ing of lexical resources. In particular it was designed 45 46cal ontology with the OntoLex-Lemon concept Word to enhance the interoperability of such datasets by lim- 46 47through the property wordsense, allowing us to cap- iting the range of encoding possibilities (offered by the 47 48ture the range of meanings of the concept. Moreover, 48 49 49 the POSTDATA Word class will also be linked to the 207https://www.elexicography.eu/ 50NIF Word class due to the shared relationship of both 208https://dariah-eric.github.io/lexicalresources/pages/TEILex0/ 50 51of them to NLP operations. TEILex0.html 51 42 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1current TEI guidelines) in the representation of lexi- some final set of guidelines, and its use or appropri- 1 2cal content (for instance TEI Lex-0 has deprecated el- ation (along with other pertinent models and vocabu- 2 3ements such as superEntry or entryFree). This laries) in a specialist domain or task can be bridged 3 4makes the possibility of a crosswalk from (at least a by such targeted training materials (and also by de- 4 5subset of) TEI Lex-0 to OntoLex-Lemon more feasi- sign patterns see Section 6.1.2. This is also one of the 5 6ble than, say, a crosswalk from say, a minimal cus- motivations behind the strong emphasis on training in 6 7tomisation of TEI based on the TEI dictionary guide- NexusLinguaram (see Section 5.2.6). 7 8lines to OntoLex-Lemon. TEI Lex-0 is being devel- Both the production of training materials and the 8 9oped by a special working group which (pre-Covid) push to promote OntoLex-Lemon as a common serial- 9 10organised regular in-person training schools with sup- isation format for a standard for e-lexicography seems 10 11port from ELEXIS too. Both OntoLex-Lemon and TEI to promise much in terms of the future use of linked 11 12Lex-0 have been previously used for smaller lexicog- data in the context of electronic lexicography. It is in- 12 13raphy projects, but never in a project with such wide evitable that the experiences of lexicographers and lin- 13 14coverage in terms of the languages and kinds of lexi- guists in using OntoLex-Lemon (and its lexicographic 14 15cographic resource under consideration. ELEXIS has extension, see Section 4.1.1) both within and outside 15 16provided support to the development of both OntoLex- of the ELEXIS project to create and edit lexicographic 16 17Lemon as well as TEI Lex-0 and a joint workshop was resources will have an important impact on the use of 17 18held between these projects at the 2019 edition of the the model and also, potentially, on future extensions 18 19e-lexicography convention eLex. Work is also under- and/or versions of OntoLex-Lemon. 19 20way on a crosswalk between TEI Lex-0 and OntoLex- 20 21Lemon. The latest version of a proposed TEI Lex-0 to 5.2.4. LiLa (2018-2023) 21 210 22OntoLex converter can be found at https://github.com/ The LiLa: Linking Latin ERC project aims to 22 23elexis-eu/tei2ontolex. connect language resources developed for the study of 23 24The project is also promoting the standardisation of Latin, bringing the worlds of textual corpora, digital 24 25OntoLex-Lemon and TEI Lex-0 through the OASIS libraries, lexica and tools for Natural Language Pro- 25 26working group on Lexicographic Infrastructure Data cessing together. To this end, LiLa makes use of the 26 27Model and API (LEXIDMA)209, which will lead to a LOD paradigm and of a set of ontologies from the LLD 27 28new unifying standard for lexicographic data that will cloud to build an interoperable network of resources. 28 29be serialised in both OntoLex-Lemon and TEI Lex-0. LiLa’s ambition is to create an infrastructure where 29 30researchers and students of Latin can find answers to 30 31The Impact of ELEXIS on the Use of OntoLex-Lemon complex questions that involve multiple layers of lin- 31 32ELEXIS aims to provide support for the creation and guistic annotations and knowledge, such as: what sub- 32 33editing of dictionary resources using both models. To jects are constructed with verbs formed with a certain 33 34this end extensive teaching materials are also being de- prefix [163]? What WordNet synsets do they belong to? 34 35veloped as part of the project with the aim of intro- As Latin is characterised by a very rich morphology 35 36ducing lexicographers to linked data and the OntoLex- (where, for instance, a single verb can potentially yield 36 37Lemon model. It should be noted that the availabil- more than 100 forms, excluding the nominal inflection 37 38ity of manuals and targeted teaching materials plays of participles), LiLa focuses on lemmatization as the 38 39an important factor in increasing the uptake of mod- key task that allows for a meaningful and functional 39 40els such as OntoLex-Lemon and technologies such as connection between the different layers of annotation 40 41linked data, (as of course is the case with new tech- and information involved in the project. Indeed, while 41 42nologies and new technological approaches in gen- lemmas are used by lexica to label entries, lemmatiza- 42 43eral), especially amongst users who haven’t had much tion is often performed in digital libraries of Latin texts 43 44previous exposure to linked data or conceptual mod- to index words and is included in most NLP pipelines 44 45elling. The original designers of such models are usu- (like e.g. UDPipe)211 as a preliminary step for more 45 46ally unable to take into consideration of every kind of advanced forms of analysis212. 46 47use-case for which the model might be used. The gap 47 48between a general purpose model as it is presented in 48 210http://lila-erc.eu 49 49 211https://ufal.mff.cuni.cz/udpipe. 50209https://www.oasis-open.org/committees/tc_home.php?wg_ 212For the state of the art in automatic lemmatization and PoS tag- 50 51abbrev=lexidma ging for Latin, see the results of the first edition of EvaLatin, a cam- 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 43

1LLD standards such as OntoLex-Lemon (see Sec- The property lemma variant connects two lemmas that 1 2tion 4.1) provide an adequate framework to model the can be alternatively used to lemmatize forms of the 2 3relations between the different classes of resources same words. Hypolemma is a sub-class of the Lemma 3 4via lemmatization, while also offering a robust solu- that groups forms (e.g. participles) that can be either 4 5tion for modelling most lexica. The central compo- promoted to canonical or be lemmatised under a hy- 5 6nent in LiLa’s framework, the gateway between the perlemma (e.g. the main verb); hypolemmas are con- 6 7different projects, is the collection of canonical forms nected to their hyperlemma via the is hypolemma prop- 7 8that are used to lemmatize texts (called lemma bank). erty. 8 9This collection was created starting from the lexical Currently, the canonical forms in the LiLa lemma 9 10database of the morphological analyzer Lemlat [165], bank connect lexical entries of four lexical resources. 10 11and currently includes a set of about 190,000 forms Two lexica provide etymological information, which 11 12that can potentially be used as lemmas in corpora or was modelled using the OntoLex-Lemon extension 12 13lexica213. lemonEty [54], respectively on the lexicon inher- 13 14The forms in the lemma bank are described in an ited from proto-Indo-european214 [168] and loans 14 15OWL ontology that reuses several concepts from the from Greek215 [169]. The polarity lexicon LatinAf- 15 16LLD standards discussed in the previous sections. The fectus connects a polarity value (expressed using the 16 17canonical forms are instances of the class Lemma, Marl ontology216) to a general sense for 1,998 en- 17 18which is defined as a subclass of the Form from tries217 [170]. Finally, 1,421 verbs from the Latin 18 19the OntoLex-Lemon vocabulary. The part-of-speech WordNet have been manually revised and published as 19 20and morphological annotations in the Lemlat database LOD218 [171]. 20 21have been included in the ontology and linked to the In addition to lexica, two annotated corpora are 21 22OLiA reference model (see Section 3.5). For a selec- currently linked to the LiLa lemma bank. The Index 22 23tion of circa 36,000 lemmas, the lemma bank also in- Thomisticus Treebank219 provides morpho-syntactic 23 24cludes derivational information, listing the morphemes annotation for 375,000 tokens from the Latin works of 24 25(i.e. the prefixes, affixes and lexical bases) that can be Thomas Aquinas (13th century CE), while the Dante 25 26identified in each lemma [166]. Search corpus220 includes the lemmatised text of four 26 27The fact that OntoLex-Lemon forms are allowed to Latin works of Dante Alighieri (14th century), which 27 28have multiple written representations is a particularly are currently undergoing a process of syntactic anno- 28 29helpful feature for a language which has been attested tation following the Universal Dependencies annota- 29 30across circa 25 centuries and a wide spectrum of gen- tion style [172]221. The POWLA ontology was used to 30 31res, and which is, moreover, characterised by a sub- represent texts and annotations for both corpora. How- 31 32stantial amount of spelling variation. Harmonizing dif- ever, the link between a corpus token and a lemma 32 33ferent lemmatization solutions adopted by corpora and of the LiLa collection was expressed using a custom 33 34NLP tools, however, requires practitioners to deal with property has lemma defined in the LiLa ontology222, 34 35other kinds of variation as well [167]. In the case of which takes an instance of the Lemma class as its 35 36words with multiple inflectional paradigms or forms range, since no existing vocabulary provided a suitable 36 37which may be interpreted as either autonomous words way to express this relation. 37 38or inflected forms of a main lemma (such as partici- 38 5.2.5. Prêt-à-LLOD (2019-2022) 39ples, or adverbs built from adjectives: see e.g. English 39 The goal of the Prêt-à-LLOD project is to make 40“quickly” from “quick”), projects may vary consid- 40 linguistic linked open data ‘ready-to-use’ and part of 41erably in the adopted strategies. For this reasons, the 41 42LiLa ontology introduces one sub-class of the Lemma 42 214 43and two object properties that connect forms to forms. https://lila-erc.eu/data/lexicalResources/BrillEDL/Lexicon 43 215https://lila-erc.eu/data/lexicalResources/IGVLL/Lexicon 44 44 216http://www.gsi.upm.es:9080/ontologies/marl/ 45paign devoted to the evaluation of NLP tools for Latin [164]. The 217https://lila-erc.eu/data/lexicalResources/LatinAffectus/ 45 46first edition of EvaLatin focused on two shared tasks (i.e. lemma- Lexicon 46 47tization and PoS tagging), each featuring three sub-tasks (i.e. Clas- 218http://lila-erc.eu/data/lexicalResources/LatinWordNet/Lexi- 47 48sical, Cross-Genre, Cross-Time). These sub-tasks were specifically con 48 designed to measure the impact of genre variation and diachrony on 219http://lila-erc.eu/data/corpora/ITTB/id/corpus 49 49 NLP tool performances. 220http://lila-erc.eu/data/corpora/DanteSearch/id/corpus 50213The lemma bank can be queried using the lemmaBank 221https://universaldependencies.org/guidelines.html 50 51SPARQL endpoint of the project: https://lila-erc.eu/sparql/. 222https://lila-erc.eu/lodview/ontologies/lila/ 51 44 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1this mission is to contribute to the development of ware repository as well as the linking technolo- 1 2new vocabularies for linguistic linked data in appli- gies to provide a single authoritative source of in- 2 3cation scenarios that facilitate the development of a formation about language resources across a wide 3 4next-generation multilingual internet. Several aspects range of languages. 4 5of linked data technology are being pursued in this 5 6context. This includes, without being restricted to The key priority of Prêt-à-LLOD, however, is less to 6 7develop novel vocabularies, than to develop technical 7 8linking In its linking aspect, Prêt-à-LLOD explores solutions on that basis. Accordingly, Prêt-à-LLOD in- 8 9technologies to facilitate the linking between and volves four industry-led pilot projects that are designed 9 10among lexical, terminological and ontological re- to demonstrate the relevance, transferability and appli- 10 11sources. In this context, it has provided signifi- cability of the methods and techniques under develop- 11 12cant support to the development of the OntoLex- ment in the project to concrete problems in the lan- 12 13Lemon, including the development of a module guage technology industry. The pilots showcase po- 13 14for lexicography, a module for morphology, and tentials in the context of various sectors: technology 14 15corpus information (all of which are discussed in companies, open government services, pharmaceutical 15 16Section 4.1). Further extensions for terminologies industry, and finance, details of which are described 16 17and linking metadata (Fuzzy Lemon) have been in [174] As overarching challenges, all pilots are ad- 17 18proposed in the context of the project, as well. dressing facets of cross-language transfer or domain 18 19In addition, the project is contributing models for adaptation to varying degrees. Particularly relevant to 19 223 20dataset linking to the Naisc project that pro- LLOD, the project is developing tools that are help- 20 21vides a toolkit for generic dataset linking. ful to practical lexicographic applications including for 21 22transformation Prêt-à-LLOD provides a generic frame- the Oxford Dictionaries [175]. 22 23work for transforming, enriching and manipulat- Notable project results in the context of this pa- 23 24ing language resources by means of RDF tech- per are a Report on Vocabularies for Interoperable 24 25nology [173]. The idea here is to transform a lan- Language Resources and Services that gives a brief 25 26guage resource into an equivalent RDF represen- overview over standards for language resources as of 26 27tation, to manipulate and enrich it with SPARQL 2019225 and the publication of the first monograph on 27 28transformation and external knowledge, and to se- LLOD technologies [124]. Whereas the latter builds 28 29rialize the result in RDF or non-RDF formats. To on long-standing collaborations between its authors in 29 30the extent that different formats can be mapped previous projects and community groups, it was final- 30 31to or generated from the same RDF representa- ized with support from the Prêt-à-LLOD project. 31 32tion, they can be transformed one into another. 32 5.2.6. NexusLinguarum (2019-2023) 33For lexical data, the OntoLex model and its afore- 33 34mentioned extensions represent a de facto stan- The European network for Web-centred linguis- 34 226 35dard and are being used as such. For linguistic an- tic data science (NexusLinguarum) is a COST 35 36notations, several competing standards exist, and Action project that involves researchers from 42 coun- 36 37Prêt-à-LLOD contributes to on-going consolida- tries. The network started in October 2019 and will 37 38tion efforts within the W3C CG Linked Data for continue its activities for four years. The COST Ac- 38 39Language Technology with case studies on and tion promotes synergies across Europe between lin- 39 40support for CoNLL-RDF, NIF, Ligt, POWLA, guists, computer scientists, terminology experts, lan- 40 41and OLiA (see Sect. 4.2). guage professionals, and other stakeholders from both 41 42metadata Prêt-à-LLOD provides a workflow manage- industry and society, in order to investigate into and 42 43ment system, a metadata repository for language to extend the areas of applicability of linguistic data 43 44resources, and machine-readable license informa- science in a Web-centred context. Linguistic data sci- 44 45tion. In that regard, it also contributes to the devel- 45 46opment of metadata standards. This work is lead- 225Christian Chiarcos, Philipp Cimiano, Julia Bosque-Gil, 46 224 47ing to a new version of the Linghub site [113] , Thierry Declerck, Christian Fäth, Jorge Gracia, Maxim Ionov, John 47 48that is based around the DSpace open source soft- McCrae, Elena Montiel-Ponsoda, Maria Pia di Buono, Roser Saurí, 48 Fernando Bobillo, Mohammad Fazleh Elahi (2020), Report on 49 49 Vocabularies for Interoperable Language Resources and Services, 50223https://github.com/insight-centre/naisc available from https://cordis.europa.eu/project/id/825182/results 50 51224https://linghub.org 226https://nexuslinguarum.eu/ 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 45

1ence is concerned with providing a formal basis for the data with the support of members of the COST action, 1 2analysis, representation, integration and exploitation of we will look at the corpora being produced in a Roma- 2 3linguistic data for language analysis (e.g. syntax, mor- nian language project. 3 4phology, terminology, etc.) and language applications The ReTeRom (Resources and Technologies for De- 4 5(e.g. machine translation, speech recognition, senti- veloping Human-Machine Interfaces in Romanian) 5 6ment analysis, etc.). NexusLinguarum seeks to iden- project229 is working towards adding the Romanian 6 7tify several key technologies to support such a study, language to the multilingual Linguistic Linked Open 7 8including language resources, data analysis, NLP, and Data cloud230. There are four different ReTeRom com- 8 9LLD. The latter is considered to be a cornerstone for ponents. These are CoBiLiRo, SINTERO231, TEPRO- 9 10the building of an ecosystem of multilingual and se- LIN232 and TADARAV233. We will focus on Co- 10 11mantically interoperable linguistic data technologies BiLiRo, 11 12and resources at a Web scale. Such an ecosystem is CoBiLiRo (Bimodal Corpus for Romanian Lan- 12 13 13 needed to foster the systematic cross-lingual discovery, guage), coordinated by the “Alexandru Ioan Cuza” 14 14 exploitation, extension, curation and quality control of University from Iai (UAIC), is working with a large 15 15 linguistic data. collection of parallel speech/text data [179]. This col- 16 16 One of the main research coordination objectives of lection is annotated on different levels on both the 17NexusLinguarum is to propose, agree upon and dis- 17 acoustic and the linguistic components [180], which 18seminate best practices and standards for linking data 18 19and services across languages. In that regard, an active 19 20collaboration has been established with W3C commu- 229https://www.racai.ro/p/reterom/index_en.html/ 20 21nity groups for the extension of existing standards such 230Note that several Romanian language resources (e.g. Ro- 21 22as OntoLex-Lemon as well as for the convergence of manian WordNet (RoWN), Romanian Reference Treebank 22 23standards in language annotation (see Section 4). Sev- (RoRefTrees or RRT), Corpus-driven linguistic data, etc.) are 23 currently in the process of conversion to LLD. The converter imple- 24 24 eral surveys of the state of the art are also being drafted mentation is open source (https://github.com/racai-ai/RoLLOD/) 25by the NexusLinguarum community covering differ- 231SINTERO (Technologies for the Realization of Human- 25 26ent salient aspects of the domain (e.g., multilingual Machine Interfaces for Text-to- with Expressivity), 26 27linking across different linguistic description levels). coordinated by Technical University of Cluj-Napoca (UTCN), pri- 27 28marily aims to implement a text-speech synthesis system in Roma- 28 A number of activities organised by NexusLinguarum nian that allows the modelling and control of prosody (intonation 29 29 have been planned with the aim of fostering collabo- in speech) in an appropriate way of natural speech. Secondly, SIN- 30ration and communication across communities. These TERO aims is to create as many voices synthesised in Romanian as 30 31include scientific conferences (e.g., LDK 2021227), and possible (in this project at least 10 voices), so that they too can be 31 32training schools (e.g., EuroLAN 2021228), where lin- used by an extended community, including in commercial applica- 32 33tions [176] 33 guistic linked data will take a central role. Finally, 232TEPROLIN (Technologies for Processing Natural Language 34 34 NexusLinguarum is also devoted to the collection and - Text) which is coordinated by the Research Institute for Artici- 35analysis of relevant use cases for linguistic data sci- fial Intelligence Mircea Drgnescu (ICIA), aims to create Roma- 35 36ence and to developing prototypes and demonstrators nian technologies that can be readily used by the 36 37that will address a selection of prototypical cases. In an other component-projects of ReTeRom. For instance, higher layers 37 38of annotation may be performed using TEPROLIN services: on the 38 initial phase, the definition of use cases will cover Hu- speech component - the prosodic annotation (e.g. decrease of the 39 39 manities and Social Sciences, Linguistics (Media and fundamental frequency) and on the textual component sub-syntactic 40Social Media, and Language Acquisition), Life Sci- (e.g. clauses) and syntactic annotation (e.g. parsing trees). TEPRO- 40 41ences, and Technology (Cybersecurity and FinTech). LIN works inside a major language processing and plat- 41 42form such as UIMA, GATE or TextFlows [177] 42 NexusLinguarum also places a strong emphasis on 233 43TADARAV (Technologies for automatic annotation of au- 43 lesser resourced languages. dio data and for the creation of automatic speech recognition in- 44 44 terfaces), coordinated by the University Politehnica of Bucharest 45A NexusLinguarum Use Case: ReTeRom (UPB), primarily aims to develop a set of advanced technologies for 45 46As an example of the kinds of complex, heteroge- generating transcripts aligned correctly with the voice signal from 46 47neous resources which have been proposed by consor- the body collected in the CoBiLiRo component project. Secondly, 47 48tium members as candidates for publication as linked TADARAV aims to increase the accuracy of the current SpeeD au- 48 tomatic speech recognition system [178] by requalifying its acoustic 49 49 model based on the entire body of speech collected and using more 50227http://2021.ldk-conf.org/ powerful language models generated in the TEPROLIN component 50 51228http://eurolan.info.uaic.ro/2021 project. 51 46 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1facilitates searching, editing and statistical analysis op- scription of the current state of affairs with respect to 1 2erations over it. Three types of formats pairing speech the use, definition and availability of models and vo- 2 3and text components were identified in the building cabularies for Linguistic Linked (Open) Data. We have 3 4of the CoBiLiRo repository: (1) PHS/LAB, a format also gone into some detail as to the role of these models 4 5which separates text, speech and alignment in dif- in various important initiatives, both past and present. 5 6ferent files; (2) MULTEXT/TEI, a format described Moreover, as we hope that the article has demon- 6 7initially in the MULTEXT project and later used by strated, this is an extremely active and dynamic area 7 8various language resource builders; (3) TEXTGRID, of research, with numerous projects and initiatives un- 8 9a format supported by a large community of Euro- derway, or due to commence in the short term, which 9 10pean developers and used in a large set of existing re- promise to bring further updates and improvements in 10 11sources. In order to share and distribute these bimodal expressivity and coverage in addition to the changes 11 12resources, a standard format for CoBiLiRo has been discussed here. For this reason, and in a vain attempt 12 13proposed, inspired by the TEI-P5.10 standard [181] to stave off the dangers of rapid obsolescence, we have 13 14and based on the idea of alignment between speech tried to situate our descriptions of recent advances in 14 15and the text components, taking into consideration sev- the field within a discussion of more general, ongo- 15 16eral annotation conventions proposed in 2007 by Li ing trends. This was indeed our specific intention with 16 17and Zhi-gang [182]. At present, the header of this for- Section 2 and in many other parts of the article. We 17 18mat includes the following metadata: source of the hope that this survey will give the reader a good idea 18 19object stored; speakers gender; speakers identity (if both of the future challenges which have yet to be fully 19 20she/he agreed to this); vocal type (spontaneous or in- confronted as well as the areas of immense opportunity 20 21reading); recording conditions; duration; speech file which currently remain untapped. 21 22type; speech-text alignment level, etc. Moreover, the In this rest of this section we will summarise the 22 23CoBiLiRo format allows for three types of segmen- future prospects/challenges described in this paper. In 23 24tation ("file - adequate for resources held in multiple the next and final subsection, Section 6.1, we focus on 24 25files, "startstop - adequate for resources that include two particular areas and suggest a possible future trend 25 26only one speech file, and "file-start-stop a combina- and a proposal for a further direction of research. 26 27tion of the two types described before) and speech-text In Section 3, we gave an overview of the most 27 28alignment, marked using tags. A tag in- well known and widely used models for linguistic 28 29cludes two child nodes: the that names the linked data, emphasising their FAIR-ness, and in par- 29 30file containing the speech component and the ticular their accessibility via ontology search engines, 30 31that points to the corresponding textual transcription whether and how licensing information is made avail- 31 32file. able, and how versioning is handled; we saw how, in 32 33As the preceeding example (one of many within many cases, there still remained work to be done in 33 34the project, falling within several different disciplines these areas. We classified these models into different 34 35or technical domain) demonstrates NexusLinguaram’s subject areas based on the LLOD cloud, which helped 35 36potential as a testing ground for many of the new vo- us to describe how different areas of linguistics were 36 37cabularies and modules mentioned above (as well as covered by the models with some areas clearly better 37 38for the potential of linked data as a paradigm for the served than others. We also briefly discussed the pro- 38 39modelling and publication of language data) through vision of dedicated tools for LLD models; again this is 39 40the analysis of complex multifaceted use cases involv- an area which is still very much under development. 40 41 41 ing several different types of language resources. Next, in Section 4 we looked at the latest develop- 42ments in LLD community standards. This section was 42 43divided into a subsection discussing OntoLex-Lemon 43 44related developments (Section 4.1), a section on the 44 6. Conclusions and Discussions of Future 45latest developments regarding LLD models for annota- 45 Challenges 46tion (Section 4.2), and a section on metadata (Section 46 474.3). Each of these sections features a detailed descrip- 47 We have attempted in the present article to give 48tion of different initiatives in their respective areas (in- 48 a comprehensive survey and a near-exhaustive234 de- 49cluding those still in progress), including in the case of 49 50Section 4.2 and Section 4.3 discussions of future trends 50 51234We were certainly exhausted after writing it. and prospects (Section 4.2.5 and Section 4.3.3 respec- 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 47

1tively). The main challenge in the case of LLD vocab- ciplinary boundaries via training and teaching ini- 1 2ularies for annotation is to respond to the need for a tiatives. In other words, the Linguistic Linked Data 2 3convergence of vocabularies. In the case of metadata community could exploit both the technical and the 3 4vocabularies we looked at coverage issues, especially knowledge infrastructures provided by such Euro- 4 5with regard to language identification. pean Resource Infrastructure Consortia (ERICs) as 5 6Then in Section 5 we presented an overview of the CLARIN, DARIAH in order to further sustain the 6 7impact of projects on the definition and use of LLD work carried out in individual research projects and via 7 8models and vocabularies. We focused on a number of open community groups. 8 9ongoing projects and looked at their current and poten- A recent CLARIN event brought members of these 9 10tial future contributions to LLD models and vocabular- two communities together in order to initiate a dia- 10 11ies. In the rest of this concluding section we will look logue on future collaboration between the two236. The 11 12at one important potential future trend, the involve- event was well received, being a promising start for 12 13ment of research infrastructures alongside community future collaborations. 13 14groups and projects in the definition and ongoing de- Note that although we will not discuss it here (as it 14 15velopment of models and vocabularies (Section 6.1.1. would shift us too far into the realms of research pol- 15 16We will also make a proposal for handling the increas- icy), the role played by the European Open Science 16 17ing complexity of LLD vocabularies (especially in the Cloud237 will also be crucial here (at the very least for 17 18domain of language resources), namely, the recourse projects and initiatives taking place in Europe) and es- 18 19to ontology design patterns (Section 6.1.2). pecially for its consistent promotion of FAIR data. 19 20 20 216.1. Discussion of Future Trends and Challenges 6.1.2. A Proposal? The Use of Design Patterns 21 22The OntoLex-Lemon model has come to be used in 22 236.1.1. Linguistic Linked Data, Projects, and Research (or has at least been proposed for) a wide range of use 23 24Infrastructures cases pertaining to a range of different disciplines and 24 25In this article we have tried to underline the role of kind of resources. As we have seen, the original model 25 26research projects alongside that of community groups been extended to cover new kinds of use cases (or is 26 27such as the Open Linguistics Working Group or the planned to do so) by the W3C OntoLex group through 27 28W3C Ontology-Lexicon Community Group in driving the definition and publication of new extensions with 28 29the development of LLD vocabularies and models. Go- their own separate guidelines. In the long term, how- 29 30ing ahead however, the role of SSH research infrastruc- ever, this has the potential to become very complicated 30 31tures could begin to play an important role too in order, very quickly. 31 32that is, to ensure longer term hosting solutions and the Take for instance the creation of specialised lan- 32 33greater sustainability of resources and tools based on guage resources for such areas as morpho-syntax or 33 34these models. historical linguistics (in the former case these are dealt 34 35Research infrastructures could also help to give long with in in part in the original guidelines and in the new 35 36term support to the community groups which are de- morphology module). In both these cases, there are so 36 37veloping such models and vocabularies that is in ad- many different types (and sub-types) of resource as 37 38dition to and in a complimentary way to the sup- well as different theoretical approaches and schools of 38 39port received from projects and COST actions in the thought (not to mention language-specific modelling 39 40short term. That is, in a similar way to which the TEI requirements) that it would be difficult to produce 40 41Lex-0 initiative (described in Section 5.2.3) has been guidelines with detailed enough provision for any and 41 42supported both by a number of funded projects and all of the exigencies that might potentially arise. Or 42 43COST actions as well as by the DARIAH "Lexical Re- instead take the modelling of lexicographic resources 43 235 44sources" Working Group . (something which falls within the compass of the lex- 44 45At the same time research infrastructures could also icog extension, Section 4.1.1). This could encompass 45 46help in the dissemination of LLD vocabularies and numerous different kinds of sub-cases – e.g., etymo- 46 47models, making them more accessible to ever wider 47 48numbers of users and cutting across different dis- 48 236A textual summary of the virtual event and recordings of the 49 49 presentations and discussion can be found here https://www.clarin. 50235See https://dariah-eric.github.io/lexicalresources/pages/ eu/event/2021/clarin-cafe-linguistic-linked-data 50 51TEILex0/TEILex0.html. 237https://eosc-portal.eu/ 51 48 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1logical dictionaries, philological dictionaries, rhyming ODPs and would be based on competency questions, 1 2dictionaries – each of which brings its own specific va- e.g., potential SPARQL queries. 2 3rieties of modelling challenges. Furthermore, there can Any resulting OntoLex-Lemon ODPs could then ei- 3 4often also differing technical solutions to given mod- ther be hosted on the ontology design patterns site239, 4 5elling problems without a strong enough consensus on or a special repository, or both. ODPs would provide 5 6any single one of these. Such is the case with mod- a bridge between OntoLex guidelines and concrete ap- 6 7elling ordered sequences in RDF. plications; they would help to prevent those guide- 7 8One way of handling this potential modelling com- lines from becoming overly-complicated and unwieldy 8 9plexity and that avoids the drafting of ever more elab- and would keep the extensions themselves as simple 9 10orate guidelines and the definition of ever more spe- (and hopefully uncontroversial) as possible240. They 10 11cialised modules is by through the publication and would make models such as OntoLex-Lemon, and in- 11 12maintainence of a repository of ontology design pat- deed several of the other models featured in this arti- 12 13terns (ODP). ODP’s are modelling solutions for re- cle, more accessible. Furthermore, they would allow us 13 14curring problems in the field of ontology design and to recommend the re-use of other vocabularies without 14 15are intended as a means of enhancing resuability in having to include them ‘officially’ within the OntoLex- 15 16knowledge base design. In this they are based on pre- Lemon guidelines themselves, ensuring the decoupling 16 17vious work on patterns in software engineering. ODPs of the OntoLex-Lemon guidelines from these other vo- 17 18are arranged in six types [183]. These range from so cabularies. 18 19called Logical ODPs, i.e., patterns that deal with prob- 19 20lems in expressivity of formal knowledge engineering 20 21languages such as OWL (such as the representation of Acknowledgments 21 22n-ary relations), and Architectural ODPs which are 22 This article is based upon work from COST Action 23compositions of Logical ODPs, to Reasoning ODPs 23 NexusLinguarum (CA18209), supported by COST 24which propose procedures for automatic inference (for 24 (European Cooperation in Science and Technology). 25a full list see [183]). In our case the most relevant 25 The authors thank Milan Dojchinovski and Francesca 26of these types are the Content ODPs, which are de- 26 Frontini for several very helpful suggestions. 27scribed as solving domain specific problems. 27 28The idea would be to define, promote, and col- 28 29 29 lect OntoLex-Lemon specific design patterns (as well References 30as those pertaining to other vocabularies) within the 30 31 31 LLD community and beyond. This is not a com- [1] M.D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Ap- 32pletely new idea and design patterns had been pro- pleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, 32 33 33 posed for OntoLex-Lemon’s predecessor lemon in the L.B. da Silva Santos, P.E. Bourne, J. Bouwman, A.J. Brookes, 34T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, 34 past. These are currently available on github238 and of- 35C.T. Evelo, R. Finkers, A. Gonzalez-Beltran, A.J.G. Gray, 35 fer templates for the creation of nominal, verbal and P. Groth, C. Goble, J.S. Grethe, J. Heringa, P.A.C. ’t Hoen, 36 36 adjectival lexical entries as well as more specific kinds R. Hooft, T. Kuhn, R. Kok, J. Kok, S.J. Lusher, M.E. Martone, 37 37 of the former categories such as Relational Nouns, A. Mons, A.L. Packer, B. Persson, P. Rocca-Serra, M. Roos, 38 38 State Verbs and Intersective Adjectives. These patterns R. van Schaik, S.-A. Sansone, E. Schultes, T. Sengstag, 39T. Slater, G. Strawn, M.A. Swertz, M. Thompson, J. van der 39 are fairly limited in scope and so our more specific pro- 40Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wit- 40 posal would be for patterns covering a variety of differ- tenburg, K. Wolstencroft, J. Zhao and B. Mons, The FAIR 41 41 ent areas/kinds of use cases. These patterns would ini- Guiding Principles for scientific data management and stew- 42 42 tially correspond to the different sections of the W3C ardship, Scientific Data 3(1) (2016), 160018. doi:10.1038/s- 43 43 Ontolex guidelines, such as for example the syntax and data.2016.18. https://www.nature.com/articles/sdata201618. 44[2] TEI Consortium, TEI P5: Guidelines for Electronic Text 44 semantics and the decomposition sections, as well as 45Encoding and Interchange, Zenodo, 2020. doi:10.5281/zen- 45 the lexicography module and the forthcoming Mor- odo.3992514. 46 46 phology and Frequency Attestation and Corpus (FrAC) 47 47 239 48modules. Each of these patterns would follow the set http://ontologydesignpatterns.org/wiki/Main_Page 48 240Although of course the original modules would still need 49of criteria proposed in for instance [183] for Content 49 to be revised and extended on the basis of new kinds of use- 50cases/modelling needs; ODPs would help to keep these to a mini- 50 51238https://github.com/jmccrae/lemon.patterns mal. 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 49

1[3] G. Francopoulo, M. George, N. Calzolari, M. Monachini, [19] M. Ehrmann, F. Cecconi, D. Vannella, J.P. Mccrae, P. Cimi- 1 2N. Bel, M. Pet and C. Soria, Lexical markup framework ano and R. Navigli, Representing Multilingual Data as Linked 2 3(LMF), 2006. Data: the Case of BabelNet 2.0, in: Proceedings of the Ninth 3 [4] E. de la Clergerie and L. Clément, MAF: a morphosyntactic International Conference on Language Resources and Eval- 4 4 annotation framework, Actes de LTC (2005), 90–94. uation (LREC’14), N.C.C. Chair), K. Choukri, T. Declerck, 5[5] N. Guarino and C.A. Welty, An Overview of OntoClean, in: H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk 5 6Handbook on Ontologies, Springer Berlin Heidelberg, 2004, and S. Piperidis, eds, European Language Resources As- 6 7pp. 151–171. sociation (ELRA), Reykjavik, Iceland, 2014. ISBN 978-2- 7 8[6] B. Mons, FAIR Science for Social Machines: Let’s Share 9517408-8-4. 8 [20] B. Klimek, N. Arndt, S. Krause and T. Arndt, Creating 9Metadata Knowlets in the Internet of FAIR Data and Services, 9 Data Intelligence 1(1) (2019), 22–42. Linked Data morphological language resources with MMoOn 10 10 [7] J. Bosque-Gil, J. Gracia, E. Montiel-Ponsoda and (2016). 11A. Gómez-Pérez, Models to represent linguistic linked [21] R. Forkel, The cross-linguistic linked data project, in: 3rd 11 12data, Natural Language Engineering 24(6) (2018), Workshop on Linked Data in Linguistics: Multilingual Knowl- 12 13811–859. doi:10.1017/S1351324918000347. https://www. edge Resources and Natural Language Processing, 2014, 13 p. 61. 14cambridge.org/core/journals/natural-language-engineering/ 14 article/models-to-represent-linguistic-linked-data/ [22] M. Kemps-Snijders, M. Windhouwer, P. Wittenburg and 15 15 805F3E46882414B9144E43E34E89457D. S.E. Wright, ISOcat: Corralling data categories in the wild, 16[8] H. Bohbot, F. Frontini, F. Khan, M. Khemakhem and L. Ro- in: 6th International Conference on Language Resources and 16 17mary, Nénufar: Modelling a Diachronic Collection of Dictio- Evaluation (LREC 2008), 2008. 17 18nary Editions as a Computational Lexical Resource, in: The [23] S. Farrar and D.T. Langendoen, A linguistic ontology for the 18 semantic web, GLOT international 7(3) (2003), 97–100. 19sixth biennial conference on electronic lexicography, eLex 19 [24] H. Aristar-Dry, S. Drude, M. Windhouwer, J. Gippert and 202019, 2019. 20 [9] M. Passarotti, F. Mambrini, G. Franzini, F.M. Cecchini, I. Nevskaya, Rendering endangered lexicons interoperable 21through standards harmonization: the relish project, in: LREC 21 E. Litta, G. Moretti, P. Ruffolo and R. Sprugnoli, Interlinking 2012: 8th International Conference on Language Resources 22through Lemmas. The Lexical Collection of the LiLa Knowl- 22 and Evaluation, European Language Resources Association 23edge Base of Linguistic Resources for Latin, Studi e Saggi 23 (ELRA), 2012, pp. 766–770. 24Linguistici 58(1) (2020), 177–212. 24 [25] D.T. Langendoen, Whither GOLD?, in: Development of Lin- [10] P. Cimiano, C. Chiarcos, J.P. McCrae and J. Gracia, Linguistic 25guistic Linked Open Data Resources for Collaborative Data- 25 Linked Data: Representation, Generation and Applications, 26Intensive Research in the Language Sciences, A. Pareja-Lora, 26 Springer International Publishing, Cham, 2020. B. Lust, M. Blume and C. Chiarcos, eds, MIT Press, 2019. 27[11] A. Pareja-Lora, M. Blume, B.C. Lust and C. Chiarcos (eds), 27 [26] C. Chiarcos and M. Sukhareva, Olia – ontologies of linguistic 28Development of linguistic linked open data resources for col- 28 annotation, Semantic Web 6(4) (2015), 379–386. 29laborative data-intensive research in the language sciences, 29 [27] C. Chiarcos, C. Fäth and F. Abromeit, Annotation Interop- MIT Press, Cambridge, 2019. ISBN 978-0-262-53625-7. 30erability for the Post-ISOCat Era, in: Proceedings of The 30 [12] C. Chiarcos, S. Hellmann and S. Nordhoff, Linking linguistic 3112th Language Resources and Evaluation Conference, 2020, 31 resources: Examples from the open linguistics working group, 32pp. 5668–5677. 32 in: Linked Data in Linguistics, Springer, 2012, pp. 201–216. 33[28] C.A. Ferguson, Diglossia, WORD 15(2) (1959), 325–340. 33 [13] Y. Le Franc, J. Parland-von Essen, L. Bonino, H. Lehväslaiho, doi:10.1080/00437956.1959.11659702. 34 34 G. Coen and C. Staiger, D2.2 FAIR Semantics: First recom- [29] D.G. Martin Haspelmath Matthew S. Dryer and B. Comrie, 35mendations (2020). doi:10.5281/ZENODO.3707985. https:// The world atlas of language structures, Oxford University 35 36zenodo.org/record/3707985. Press, 2005. 36 [14] P.-Y. Vandenbussche and B. Vatant, Metadata recommenda- 37[30] M.S. Dryer and M. Haspelmath (eds), WALS Online, Max 37 tions for linked open data vocabularies, Version 1 (2011), 38Planck Institute for Evolutionary Anthropology, Leipzig, 38 2011–12. 2013. https://wals.info/. 39 39 [15] P.-Y. Vandenbussche, G.A. Atemezing, M. Poveda-Villalón [31] P. Monachesi, A. Dimitriadis, R. Goedemans, A.-M. Mineur 40and B. Vatant, Linked Open Vocabularies (LOV): a gateway and M. Pinto, The typological database system, in: Proceed- 40 41to reusable semantic vocabularies on the Web, Semantic Web ings of the IRCS workshop on linguistic databases, 2001, 41 428(3) (2017), 437–452. pp. 181–186. 42 [16] J. McCrae, G. Aguado-de-Cea, P. Buitelaar, P. Cimi- 43[32] A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans 43 ano, T. Declerck, A. Gómez-Pérez, J. Gracia, L. Hollink, and T. Bíró, How to integrate databases without starting a ty- 44 44 E. Montiel-Ponsoda and D. Spohr, Interchanging lexical re- pology war: The Typological Database System, The Use of 45sources on the semantic web, Language Resources and Eval- Databases in Cross-Linguistic Studies, Mouton de Gruyter, 45 46uation 46(4) (2012), 701–719, Publisher: Springer. Berlin (2009), 155–207. 46 47[17] P. Cimiano, J.P. McCrae and P. Buitelaar, Lexicon Model for [33] P. Westphal, C. Stadler and J. Pool, Countering language at- 47 48Ontologies: Community Report, W3C, 2016. https://www. trition with PanLex and the Web of Data, Semantic Web 6(4) 48 w3.org/2016/05/ontolex/. (2015), 347–353. 49 49 [18] G. Sérasset, DBnary: Wiktionary as a Lemon-based multi- [34] C. Chiarcos, C. Fäth and M. Ionov, The ACoLi dictionary 50lingual lexical resource in RDF, Semantic Web 6(4) (2015), graph, in: Proceedings of The 12th Language Resources and 50 51355–361, Publisher: IOS Press. Evaluation Conference, 2020, pp. 3281–3290. 51 50 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1[35] G. De Melo, Lexvo.org: Language-related information for [49] C. Chiarcos, M. Ionov, J. de Does, K. Depuydt, 1 2the linguistic linked data cloud, Semantic Web 6(4) (2015), A.F. Khan, S. Stolk, T. Declerck and J.P. McCrae, 2 3393–400. Modelling Frequency and Attestations for OntoLex- 3 [36] A. Stellato, M. Fiorelli, A. Turbati, T. Lorenzetti, W. van Lemon, in: Proceedings of the Globalex Work- 4 4 Gemert, D. Dechandon, C. Laaboudi-Spoiden, A. Gerencsér, shop on Linked Lexicography (@LREC 2020), 2020, 5A. Waniart, E. Costetchi and et al., VocBench 3: A collabo- pp. 1–9. https://lrec2020.lrec-conf.org/media/proceedings/ 5 6rative Semantic Web editor for ontologies, thesauri and lexi- Workshops/Books/GLOBALEX2020book.pdf#page=19. 6 7cons, Semantic Web 11(5) (2020), 855881–. doi:10.3233/SW- [50] S. Peroni and D. Shotton, FaBiO and CiTO: ontologies for 7 8200370. describing bibliographic resources and citations, Web Seman- 8 [37] A. Bellandi, E. Giovannetti and A. Weingart, Multilingual 9tics: Science, Services and Agents on the World Wide Web 17 9 and multiword phenomena in a lemon old occitan medico- (2012), 33–43. 10 10 botanical lexicon, Information 9(3) (2018), 52. [51] C. Chiarcos, T. Declerck and M. Ionov, Embeddings for the 11[38] C. Chiarcos, S. Hellmann and S. Nordhoff, Towards a Lin- Lexicon: Modelling and Representation, in: Proceedings of 11 12guistic Linked Open Data cloud: The Open LinguisticsWork- the 6th Workshop on Semantic Deep Learning (SemDeep- 12 13ing Group, TAL Traitement Automatique des Langues 52(3) 6), held virtually in January 2021, co-located with IJCAI- 13 (2011), 245–275. 14PRICAI 2020, Japan, 2021. 14 [39] C. Chiarcos, S. Nordhoff and S. Hellmann, Linked Data in [52] S. Stolk, lemon-tree: Representing Topical Thesauri on the 15 15 Linguistics, Springer, 2012. Semantic Web, in: 2nd Conference on Language, Data and 16[40] J.P. McCrae, S. Moran, S. Hellmann and M. Brümmer (eds), Knowledge (LDK 2019), Schloss Dagstuhl-Leibniz-Zentrum 16 17Semantic Web 6(4), Special Issue on Multilingual Linked fuer Informatik, 2019. 17 18Open Data, IOS Press, 2015, pp. 313–400. https://content. [53] S. Stolk, A Thesaurus of Old English as linguistic linked data: 18 iospress.com/journals/semantic-web/6/4. 19Using OntoLex, SKOS and lemon-tree to bring topical the- 19 [41] F. Khan, F. Boschetti and F. Frontini, Using lemon to model 20sauri to the Semantic Web, in: Proceedings of the eLex 2019 20 lexical semantic shift in diachronic lexical resources, in: Pro- conference, 2019, pp. 223–247. 21ceedings of the 3rd Workshop on Linked Data in Linguistics 21 [54] A.F. Khan, Towards the Representation of Etymological Data (LDL-2014): Multilingual Knowledge Resources and Natural 22on the Semantic Web, Information 9(12) (2018), 304. 22 Language Processing, 2014, pp. 50–54. 23[55] F. Khan, Representing Temporal Information in Lexical 23 [42] B. Klimek and M. Brümmer, Enhancing lexicography with 24Linked Data Resources, in: Proceedings of the 7th Work- 24 semantic language databases, Kernerman Dictionary News 23 shop on Linked Data in Linguistics (LDL-2020), European 25(2015), 5–10. 25 Language Resources Association, Marseille, France, 2020, 26[43] J. Bosque-Gil, J. Gracia, E. Montiel-Ponsoda and G. Aguado- 26 pp. 15–22. ISBN 979-10-95546-36-8. https://www.aclweb. de-Cea, Modelling multilingual lexicographic resources for 27org/anthology/2020.ldl-1.3. 27 the Web of Data: The K Dictionaries case, in: GLOBALEX 28[56] A. Burchardt, S. Padó, D. Spohr, A. Frank and U. Heid, For- 28 2016 Lexicographic Resources for Human Language Tech- 29malising Multi-layer Corpora in OWL/DL – Lexicon Mod- 29 nology Workshop Programme, 2016, p. 65. elling, Querying and Consistency Control, in: Proc. of the 30[44] F. Abromeit, C. Chiarcos, C. Fäth and M. Ionov, Linking the 30 3rd International Joint Conference on NLP (IJCNLP), Hyder- 31Tower of Babel: Modelling a Massive Set of Etymological 31 abad, India, 2008, pp. 389–396. 32Dictionaries as RDF, in: Proc. of the 5th Workshop on Linked 32 [57] K. Verspoor and K. Livingston, Towards Adaptation of Lin- 33Data in Linguistics: Managing, Building and Using Linked 33 Language Resources, 2016, p. 11. guistic Annotations to Scholarly Annotation Formalisms on 34 34 [45] J. Gracia, M. Villegas, A. Gómez-Pérez and N. Bel, The aper- the Semantic Web, in: Proc. of the 6th Linguistic Annotation 35tium bilingual dictionaries on the web of data, Semantic Web Workshop, Association for Computational Linguistics, Jeju, 35 369(2) (2018), 231–240. doi:10.3233/SW-170258. Republic of Korea, 2012, pp. 75–84. 36 [58] S. Hellmann, J. Lehmann, S. Auer and M. Brümmer, In- 37[46] J. Bosque-Gil, J. Gracia and E. Montiel-Ponsoda, Towards 37 tegrating NLP using Linked Data, in: Proc. 12th Interna- 38a Module for Lexicography in OntoLex, in: Proc. of the 38 LDK workshops: OntoLex, TIAD and Challenges for Word- tional Semantic Web Conference, 21-25 October 2013, Syd- 39 39 nets at 1st Language Data and Knowledge conference (LDK ney, Australia, 2013, also see http://persistence.uni-leipzig. 402017), Galway, Ireland, Vol. 1899, CEUR-WS, Galway (Ire- org/nlp2rdf/. 40 41land), 2017, pp. 74–84. ISSN 1613-0073. http://ceur-ws.org/ [59] N. Mazziotta, Building the syntactic reference corpus of me- 41 42Vol-1899/OntoLex{_}2017{_}paper{_}5.pdf. dieval French using notabene rdf annotation tool, in: Proc. of 42 the Fourth Linguistic Annotation Workshop, Association for 43[47] J. Bosque-Gil, D. Lonke, J. Gracia and I. Kernerman, Vali- 43 dating the OntoLex-lemon lexicography module with K Dic- Computational Linguistics, 2010, pp. 142–146. 44 44 tionaries’ multilingual data, in: Electronic lexicography in the [60] B. Almas, H. Cayless, T. Clérice, Z. Fletcher, V. Jolivet, P. Li- 4521st century. Proceedings of the eLex 2019 conference, 2019, uzzo, E. Morlock, J. Robie, M. Romanello, J. Tauber and 45 46pp. 726–746. J. Witt, Distributed Text Services (DTS). First Public Work- 46 47[48] B. Klimek, J.P. McCrae, M. Ionov, J.K. Tauber, C. Chiarcos, ing Draft, Technical Report, Github, 2019, version of May 23, 47 48J. Bosque-Gil and P. Buitelaar, Challenges for the Represen- 2019. 48 tations for Morphology in Ontology Lexicons, in: Proceed- [61] S. Cassidy, An RDF realisation of LAF in the DADA annota- 49 49 ings of Sixth Biennial Conference on Electronic Lexicogra- tion server, in: Proc. of the 5th Joint ISO-ACL/SIGSEM Work- 50phy, eLex 2019, 2019. https://elex.link/elex2019/wp-content/ shop on Interoperable Semantic Annotation (ISA-5), Hong 50 51uploads/2019/09/eLex_2019_33.pdf. Kong, 2010. 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 51

1[62] N. Diewald, M. Hanl, E. Margaretha, J. Bingel, M. Kupietz, [75] N. Ide, K. Suderman, J. Pustejovsky, M. Verhagen and 1 2P. Banski´ and A. Witt, KorAP Architecture – Diving in the C. Cieri, The language application grid and galaxy, in: Pro- 2 3Deep Sea of Corpus Data, in: Proceedings of the Tenth Inter- ceedings of the Tenth International Conference on Language 3 national Conference on Language Resources and Evaluation Resources and Evaluation (LREC’16), 2016, pp. 457–462. 4 4 (LREC’16), 2016, pp. 3586–3591. [76] E. Hinrichs, N. Ide, J. Pustejovsky, J. Hajic, M. Hinrichs, 5[63] ISO, ISO 24612:2012. Language Resource Management - M.F. Elahi, K. Suderman, M. Verhagen, K. Rim, P. Stranák 5 6Linguistic Annotation Framework, Technical Report, ISO/TC et al., Bridging the LAPPS Grid and CLARIN, in: Proceed- 6 737/SC 4, Language resource management, 2012. https:// ings of the Eleventh International Conference on Language 7 8www.iso.org/standard/37326.html. Resources and Evaluation, 2018. 8 [77] E. Wilde and M. Duerst, RFC 5147 – URI Fragment Identi- 9[64] C. Chiarcos, POWLA: Modeling Linguistic Corpora in 9 OWL/DL, in: The Semantic Web: Research and Applications, fiers for the text/plain Media Type, Technical Report, Inter- 10 10 E. Simperl, P. Cimiano, A. Polleres, O. Corcho and V. Pre- net Engineering Task Force (IETF), Network Working Group, 11sutti, eds, Springer Berlin Heidelberg, Berlin, Heidelberg, 2008. 11 122012, pp. 225–239. ISBN 978-3-642-30284-8. [78] D. Filip, S. McCance, D. Lewis, C. Lieske, A. Lommel, 12 13[65] N. Ide and L. Romary, International Standard for a Lin- J. Kosek, F. Sasaki and Y. Savourel, Internationalization Tag 13 Set (ITS) Version 2.0, Technical Report, W3C Recommenda- 14guistic Annotation Framework, Natural language engineer- 14 ing 10(3–4) (2004), 211–225. tion 29 October 2013, 2013. 15 15 [66] S. Tittel, H. Bermúdez-Sabel and C. Chiarcos, Using RDFa to [79] J. Frey, M. Hofer, D. Obraczka, J. Lehmann and S. Hellmann, 16Link Text and Dictionary Data for Medieval French, in: Pro- DBpedia FlexiFusion the best of Wikipedia> Wikidata> your 16 17ceedings of the Eleventh International Conference on Lan- data, in: International Semantic Web Conference, Springer, 17 18guage Resources and Evaluation (LREC 2018), J.P. McCrae, 2019, pp. 96–112. 18 [80] R. Sanderson, P. Ciccarese and B. Young, Web Annotation 19C. Chiarcos, T. Declerck, J. Gracia and B. Klimek, eds, 19 Data Model, Technical Report, W3C Recommendation, 2017. 20European Language Resources Association (ELRA), Paris, 20 France, 2018. ISBN 979-10-95546-19-1. https://www.w3.org/TR/annotation-model/. 21[81] R. Sanderson, P. Ciccarese and B. Young, Web Annotation 21 [67] P. Ruiz Fabo, H. Bermúdez Sabel, C. Martínez Cantón and Vocabulary, Technical Report, W3C Recommendation, 2017. 22E. González-Blanco, The Diachronic Spanish Sonnet Cor- 22 https://www.w3.org/TR/annotation-vocab/. 23pus: TEI and linked open data encoding, data distribution, 23 [82] L. Isaksen, R. Simon, E.T. Barker and P. de Soto Cañamares, 24and metrical findings, Digital Scholarship in the Humanities 24 Pelagios and the emerging graph of ancient world data, in: (2020). 25Proceedings of the 2014 ACM conference on Web science, 25 [68] A. Gangemi, V. Presutti, D. Reforgiato Recupero, A.G. Nuz- 262014, pp. 197–201. 26 zolese, F. Draicchio and M. Mongiovì, Semantic Web Ma- [83] R. Simon, E. Barker, L. Isaksen and P. de Soto Cañamares, 27chine Reading with FRED, Semantic Web 8(6) (2017), 27 Linked Data Annotation Without the Pointy Brackets: Intro- 28873–893. 28 ducing Recogito 2, Journal of Map & Geography Libraries 29[69] P. Vossen, R. Agerri, I. Aldabe, A. Cybulska, M. van Erp, 29 13(1) (2017), 111–132. A. Fokkens, E. Laparra, A.-L. Minard, A.P. Aprosio, G. Rigau 30[84] P. Cimiano, C. Chiarcos, J.P. McCrae and J. Gracia, Linguis- 30 et al., Newsreader: Using knowledge resources in a cross- 31tic Linked Data in Digital Humanities, in: Linguistic Linked 31 lingual reading machine to generate more knowledge from 32Data, Springer, 2020, pp. 229–262. 32 massive streams of news, Knowledge-Based Systems 110 33[85] C. Chiarcos and M. Ionov, Ligt: An LLOD-Native Vocabu- 33 (2016), 60–85. lary for Representing Interlinear Glossed Text as RDF, in: 2nd 34 34 [70] N. Ide, J. Pustejovsky, C. Cieri, E. Nyberg, D. DiPersio, Conference on Language, Data and Knowledge (LDK 2019), 35C. Shi, K. Suderman, M. Verhagen, D. Wang and J. Wright, M. Eskevich, G. de Melo, C. Fäth, J.P. McCrae, P. Buite- 35 36The language application grid, in: International Workshop on laar, C. Chiarcos, B. Klimek and M. Dojchinovski, eds, Ope- 36 Worldwide Language Service Infrastructure, Springer, 2015, 37nAccess Series in Informatics (OASIcs), Vol. 70, Schloss 37 pp. 51–70. 38Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Ger- 38 [71] S. Peroni, A. Gangemi and F. Vitali, Dealing with markup se- many, 2019, pp. 3:1–3:15. ISSN 2190-6807. ISBN 978-3- 39 39 mantics, in: Proceedings of the 7th International Conference 95977-105-4. doi:10.4230/OASIcs.LDK.2019.3. http://drops. 40on Semantic Systems, 2011, pp. 111–118. dagstuhl.de/opus/volltexte/2019/10367. 40 41[72] A. Gangemi, N. Guarino, C. Masolo, A. Oltramari and [86] S. Robinson, G. Aumann and S. Bird, Managing fieldwork 41 42L. Schneider, Sweetening ontologies with DOLCE, in: Inter- data with Toolbox and the , Lan- 42 national Conference on Knowledge Engineering and Knowl- 43guage Documentation & Conservation 1(1) (2007), 44–57. 43 edge Management, Springer, 2002, pp. 166–181. [87] L. Butler and H. Van Volkinburg, Fieldworks Language Ex- 44 44 [73] A. Fokkens, A. Soroa, Z. Beloki, N. Ockeloen, G. Rigau, plorer (FLEx), Technology Review 1(1) (2007), 1. 45W.R. Van Hage and P. Vossen, NAF and GAF: Linking [88] M.W. Goodman, J. Crowgey, F. Xia and E.M. Bender, Xigt: 45 46linguistic annotations, in: Proceedings 10th Joint ISO-ACL extensible interlinear glossed text for natural language pro- 46 47SIGSEM Workshop on Interoperable Semantic Annotation, cessing, Language Resources and Evaluation 49(2) (2015), 47 482014, pp. 9–16. 455–485. 48 [74] M. Verhagen, K. Suderman, D. Wang, N. Ide, C. Shi, [89] S. Nordhoff, Modelling and Annotating Interlinear Glossed 49 49 J. Wright and J. Pustejovsky, The LAPPS interchange format, Text from 280 Different Endangered Languages as Linked 50in: International Workshop on Worldwide Language Service Data with LIGT, in: Proceedings of the 14th Linguistic Anno- 50 51Infrastructure, Springer, 2015, pp. 33–47. tation Workshop, 2020, pp. 93–104. 51 52 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1[90] C. Chiarcos and C. Fäth, CoNLL-RDF: Linked Corpora Done on Language Resources and Evaluation (LREC’10), Euro- 1 2in an NLP-Friendly Way, in: Language, Data, and Knowl- pean Language Resources Association (ELRA), Malta, 2010, 2 3edge, J. Gracia, F. Bond, J.P. McCrae, P. Buitelaar, C. Chiar- pp. 43–47. http://www.lrec-conf.org/proceedings/lrec2010/ 3 cos and S. Hellmann, eds, Springer, Cham, Switzerland, 2017, pdf/163_Paper.pdf. 4 4 pp. 74–88. ISBN 978-3-319-59888-8. [103] D. Broeder, D. van Uytvanck, M. Gavrilidou, T. Trippel 5[91] M. Marcus, B. Santorini and M.A. Marcinkiewicz, Building and M. Windhouwer, Standardizing a Component Meta- 5 6a Large Annotated Corpus of English: The Penn Treebank, data Infrastructure, in: Proceedings of the Eight Interna- 6 7Computational Linguistics 19(2) (1993), 313–330. tional Conference on Language Resources and Evalua- 7 8[92] F. Mambrini and M. Passarotti, Linked Open . tion (LREC’12), N.C.C. Chair), K. Choukri, T. Declerck, 8 Interlinking Syntactically Annotated Corpora in the LiLa M.U. Dogan,˘ B. Maegaard, J. Mariani, A. Moreno, J. Odijk 9 9 Knowledge Base of Linguistic Resources for Latin, in: and S. Piperidis, eds, European Language Resources Associ- 10Proceedings of the 18th International Workshop on Tree- ation (ELRA), Istanbul, Turkey, 2012. ISBN 978-2-9517408- 10 11banks and Linguistic Theories (TLT, SyntaxFest 2019), 2019, 7-7. 11 12pp. 74–81. [104] M. Windhouwer, E. Indarto and D. Broeder, CMD2RDF: 12 13[93] M. Tamper, P. Leskinen, K. Apajalahti and E. Hyvönen, Us- Building a Bridge from CLARIN to Linked Open 13 ing biographical texts as linked data for prosopographical re- Data, Ubiquity Press (2017), Publisher: Ubiquity Press. 14 14 search and applications, in: Euro-Mediterranean Conference, doi:10.5334/bbi.8. 15Springer, 2018, pp. 125–137. [105] D. Broeder, I. Schuurman and M. Windhouwer, Experi- 15 16[94] C. Chiarcos, B. Kosmehl, C. Fäth and M. Sukhareva, ences with the ISOcat Data Category Registry, in: Pro- 16 17Analyzing Middle High German Syntax with RDF and ceedings of the Ninth International Conference on Lan- 17 18SPARQL, in: Proceedings of the Eleventh International Con- guage Resources and Evaluation (LREC’14), N.C.C. Chair), 18 ference on Language Resources and Evaluation (LREC- K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mari- 19 19 2018), Miyazaki, Japan, 2018, pp. 4525–4534. ani, A. Moreno, J. Odijk and S. Piperidis, eds, European Lan- 20[95] C. Chiarcos, I. Khait, É. Pagé-Perron, N. Schenk, C. Fäth, guage Resources Association (ELRA), Reykjavik, Iceland, 20 21J. Steuer, W. Mcgrath, J. Wang et al., Annotating a low- 2014. ISBN 978-2-9517408-8-4. 21 22resource language with LLOD technology: Sumerian mor- [106] I. Schuurman, M. Windhouwer, O. Ohren and D. Zeman, 22 23phology and syntax, Information 9(11) (2018), 290. CLARIN Concept Registry: The New Semantic Registry, in: 23 [96] C. Chiarcos and C. Fäth, Graph-based annotation engineer- Selected Papers from the CLARIN Annual Conference 2015, 24 24 ing: towards a gold corpus for Role and Reference Gram- October 14–16, 2015, Wroclaw, Poland, Linköping Univer- 25mar, in: 2nd Conference on Language, Data and Knowledge sity Electronic Press, 2016, pp. 62–70. 25 26(LDK 2019), Schloss Dagstuhl-Leibniz-Zentrum fuer Infor- [107] P. Labropoulou, K. Gkirtzou, M. Gavriilidou, M. Deli- 26 27matik, 2019. giannis, D. Galanis, S. Piperidis, G. Rehm, M. Berger, 27 28[97] M. Ionov, F. Stein, S. Sehgal and C. Chiarcos, cqp4rdf: V. Mapelli, M. Rigault, V. Arranz, K. Choukri, G. Backfried, 28 Towards a Suite for RDF-Based Corpus Linguistics, J.M.G. Perez´ and A. Garcia-Silva, Making Metadata Fit for 29 29 in: European Semantic Web Conference, Springer, 2020, Next Generation Language Technology Platforms: The Meta- 30pp. 115–121. data Schema of the European Language Grid, in: Proceed- 30 31[98] C. Chiarcos, K. Donandt, H. Sargsian, M. Ionov and ings of the 12th Language Resources and Evaluation Confer- 31 32J.W. Schreur, Towards LLOD-based language contact stud- ence (LREC 2020), N. Calzolari, F. Bchet,´ P. Blache, C. Cieri, 32 33ies. A case study in interoperability, in: Proceedings of the 6th K. Choukri, T. Declerck, H. Isahara, B. Maegaard, J. Mar- 33 Workshop on Linked Data in Linguistics (LDL), 2018. iani, A. Moreno, J. Odijk and S. Piperidis, eds, European 34 34 [99] C. Chiarcos and L. Glaser, A Tree Extension for CoNLL- Language Resources Association (ELRA), Marseille, France, 35RDF, in: Proceedings of the Twelfth International Confer- 2020, pp. 3421–3430. 35 36ence on Language Resources and Evaluation (LREC-2020), [108] M. Gavrilidou, P. Labropoulou, E. Desipri, S. Piperidis, 36 37ELRA, Marseille, France, 2020, pp. 7161–7169. H. Papageorgiou, M. Monachini, F. Frontini, T. Declerck, 37 38[100] C. Chiarcos, POWLA: Modeling Linguistic Corpora in G. Francopoulo, V. Arranz and V. Mapelli, The META- 38 OWL/DL, in: The Semantic Web: Research and Applications, SHARE Metadata Schema for the Description of Lan- 39 39 Vol. 7295, D. Hutchison, T. Kanade, J. Kittler, J.M. Klein- guage Resources, in: Proceedings of the Eighth Interna- 40berg, F. Mattern, J.C. Mitchell, M. Naor, O. Nierstrasz, tional Conference on Language Resources and Evalua- 40 41C. Pandu Rangan, B. Steffen, M. Sudan, D. Terzopoulos, tion (LREC 2012), European Language Resources Associ- 41 42D. Tygar, M.Y. Vardi, G. Weikum, E. Simperl, P. Cimiano, ation (ELRA), 2012. http://www.lrec-conf.org/proceedings/ 42 43A. Polleres, O. Corcho and V. Presutti, eds, Springer Berlin lrec2012/pdf/998_Paper.pdf. 43 Heidelberg, Berlin, Heidelberg, 2012, pp. 225–239, Series Ti- [109] J.P. McCrae, P. Labropoulou, J. Gracia, M. Villegas, 44 44 tle: Lecture Notes in Computer Science. ISBN 978-3-642- V. Rodríguez-Doncel and P. Cimiano, One Ontology to 4530283-1 978-3-642-30284-8. Bind Them All: The META-SHARE OWL Ontology for 45 46[101] P. Cimiano, C. Chiarcos, J.P. McCrae and J. Gracia, Mod- the Interoperability of Linguistic Datasets on the Web, in: 46 47elling Linguistic Annotations, in: Linguistic Linked Data, The Semantic Web: ESWC 2015 Satellite Events, F. Gan- 47 48Springer, 2020, pp. 89–122. don, C. Guéret, S. Villata, J. Breslin, C. Faron-Zucker 48 [102] D. Broeder, M. Kemps-Snijders, D. Van Uytvanck, M. Wind- and A. Zimmermann, eds, Lecture Notes in Computer Sci- 49 49 houwer, P. Withers, P. Wittenburg and C. Zinn, A Data Cat- ence, Springer International Publishing, 2015, pp. 271–282. 50egory Registry- and Component-based Metadata Framework, ISBN 978-3-319-25639-9. https://link.springer.com/chapter/ 50 51in: Proceedings of the Seventh International Conference 10.1007/978-3-319-25639-9_42. 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 53

1[110] S. Piperidis, The META-SHARE Language Resources Shar- [120] F. Gillis-Webber and S. Tittel, The shortcomings of language 1 2ing Infrastructure: Principles, Challenges, Solutions, in: Pro- tags for linked data when modeling lesser-known languages, 2 3ceedings of the Eight International Conference on Lan- in: 2nd Conference on Language, Data and Knowledge (LDK 3 guage Resources and Evaluation (LREC’12), N.C.C. Chair), 2019), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 4 4 K. Choukri, T. Declerck, M.U. Dogan,˘ B. Maegaard, J. Mar- 2019. 5 5 iani, A. Moreno, J. Odijk and S. Piperidis, eds, European [121] S. Tittel and F. Gillis-Webber, Identification of Languages in 6Language Resources Association (ELRA), Istanbul, Turkey, Linked Data: A Diachronic-Diatopic Case Study of French, 6 72012. ISBN 978-2-9517408-7-7. in: Electronic lexicography in the 21st century. Proceedings 7 8[111] V. Rodriguez-Doncel and P. Labropoulou, Digital Represen- of the eLex 2019 conference, 2019, pp. 1–3. 8 tation of Licenses for Language Resources, in: Proceedings 9[122] F. Gillis-Webber and S. Tittel, A Framework for Shared 9 of the 4th Workshop on Linked Data in Linguistics: Resources Proceed- 10Agreement of Language Tags beyond ISO 639, in: 10 and Applications, Association for Computational Linguistics, ings of The 12th Language Resources and Evaluation Confer- 11 11 Beijing, China, 2015, pp. 49–58. doi:10.18653/v1/W15-4206. ence, 2020, pp. 3333–3339. http://aclweb.org/anthology/W15-4206. 12[123] S. Nordhoff, Linked data for linguistic diversity research: 12 [112] P. Labropoulou, D. Galanis, A. Lempesis, M. Greenwood, 13Glottolog/langdoc and asjp online, in: Linked Data in Lin- 13 P. Knoth, R. Eckart de Castilho, S. Sachtouris, B. Georgan- 14guistics, Springer, 2012, pp. 191–200. 14 topoulos, S. Martziou, L. Anastasiou, K. Gkirtzou, N. Manola [124] P. Cimiano, C. Chiarcos, J.P. McCrae and J. Gracia, Linguistic 15and S. Piperidis, OpenMinTeD: A Platform Facilitating Text 15 Linked Data, Springer, 2020. 16Mining of Scholarly Content, in: WOSP 2018 Workshop 16 [125] A. Pareja-Lora, M. Blume, B.C. Lust and C. Chiarcos, Devel- 17Proceedings, Eleventh International Conference on Lan- 17 opment of Linguistic Linked Open Data Resources for Collab- 18guage Resources and Evaluation (LREC 2018), European 18 Language Resources Association (ELRA), Miyazaki, Japan, orative Data-Intensive Research in the Language Sciences, 19 19 2018, pp. 7–12. ISBN 979-10-95546-20-7. http://lrec-conf. MIT Press, 2020. 20org/workshops/lrec2018/W24/pdf/13_W24.pdf. [126] L. Romary, M. Khemakhem, M. George, J. Bowers, F. Khan, 20 21[113] J.P. McCrae and P. Cimiano, Linghub: a Linked Data based M. Pet, S. Lewis, N. Calzolari and P. Banski, LMF Reloaded., 21 22portal supporting the discovery of language resources., SE- in: Asialex 2019, 2019. 22 [127] V.R. Doncel and E.M. Ponsoda, LYNX: Towards a Legal 23MANTiCS (Posters & Demos) 1481 (2015), 88–91. 23 [114] J.P. McCrae, P. Cimiano, V. Rodríguez Doncel, D. Vila- Knowledge Graph for Multilingual Europe, Law in Context. 24 24 Suero, J. Gracia, L. Matteis, R. Navigli, A. Abele, G. Vulcu A Socio-legal Journal 37(1) (2020), 1–4. 25and P. Buitelaar, Reconciling Heterogeneous Descriptions of [128] T. Burrows, E. Hyvönen, L. Ransom and H. Wijsman, 25 26Language Resources, in: Proceedings of the 4th Workshop Mapping Manuscript Migrations: Digging into Data for 26 27on Linked Data in Linguistics: Resources and Applications, the History and Provenance of Medieval and Renaissance 27 28Association for Computational Linguistics, Beijing, China, Manuscripts, Manuscript Studies: A Journal of the Schoen- 28 2015, pp. 39–48. doi:10.18653/v1/W15-4205. https://www. berg Institute for Manuscript Studies 3(1) (2018), 249–252. 29 29 aclweb.org/anthology/W15-4205. [129] E. Hyvönen, “Sampo” Model and Semantic Portals for Dig- 30 30 [115] G. Rehm, M. Berger, E. Elsholz, S. Hegele, F. Kintzel, ital Humanities on the Semantic Web., in: DHN, 2020, 31K. Marheinecke, S. Piperidis, M. Deligiannis, D. Gala- pp. 373–378. 31 32nis, K. Gkirtzou, P. Labropoulou, K. Bontcheva, D. Jones, [130] A. Khan, H. Bohbot, F. Frontini, M. Khemakhem and L. Ro- 32 33I. Roberts, J. Hajic,ˇ J. Hamrlová, L. Kacena,ˇ K. Choukri, mary, Historical Dictionaries as Digital Editions and Con- 33 V. Arranz, A. Vasiljevs, O. Anvari, A. Lagzdinš, J. Melnika, 34, , , , nected Graphs: the Example of Le Petit Larousse Illustré, in: 34 G. Backfried, E. Dikici, M. Janosik, K. Prinz, C. Prinz, 35Digital Humanities 2019, 2019. 35 S. Stampler, D. Thomas-Aniola, J.M. Gómez-Pérez, A. Gar- [131] A. Weingart and E. Giovannetti, A Lexicon for Old Occi- 36cia Silva, C. Berrío, U. Germann, S. Renals and O. Kle- 36 tan Medico-Botanical Terminology in Lemon., in: SWASH@ jch, European Language Grid: An Overview, in: Proceed- 37ESWC, 2016, pp. 25–36. 37 ings of the 12th Language Resources and Evaluation Con- 38[132] R. Costa, A. Salgado, A.F. Khan, S. Carvalho, L. Romary, 38 ference, European Language Resources Association, Mar- 39B. Almeida, M. Ramos, M. Khemakhem, R. Silva and 39 seille, France, 2020, pp. 3366–3380. ISBN 979-10-95546-34- T. Tasovac, MORDigital: The Advent of a New Lexicograph- 404. https://www.aclweb.org/anthology/2020.lrec-1.413. 40 ical Portuguese Project, in: eLex 2021 - Seventh biennial con- 41[116] M. Fiorelli, A. Stellato, J.P. Mccrae, P. Cimiano and 41 ference on electronic lexicography, Brno, Czech Republic, 42M.T. Pazienza, LIME: the metadata module for OntoLex, 42 2021. https://hal.inria.fr/hal-03195362. 43in: European Semantic Web Conference, Springer, 2015, 43 pp. 321–336. [133] V. Propp, Morphology of the folktale, Trans., Laurence Scott. 44 44 [117] R. Cyganiak, D. Wood and M. Lanthaler, RDF 1.1 Concepts 2nd ed., University of Texas Press, 1968. 45and Abstract Syntax, Technical Report, W3C Recommenda- [134] S. Thompson, Motif-index of folk-literature: A classification 45 46tion 25 February 2014, 2014. of narrative elements in folktales, ballads, myths, fables, me- 46 47[118] G. De Melo, Lexvo. org: Language-related information for dieval romances, exempla, fabliaux, jest-books, and local 47 legends, Revised and enlarged edition (1955–1958), Indiana 48the linguistic linked data cloud, Semantic Web 6(4) (2015), 48 393–400, Publisher: IOS Press. University Press, 1958. 49 49 [119] A. Phillips and M. Davis, BCP 47 – Tags for Identifying Lan- [135] H.-J. Uther, The Types of International Folktales: A Classifi- 50guages, Technical Report, Internet Engineering Task Force, cation and Bibliography. Based on the system of Antti Aarne 50 512006. http://www.rfc-editor.org/rfc/bcp/bcp47.txt. and Stith Thompson, Suomalainen Tiedeakatemia, 2004. 51 54 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1[136] T. Declerck, A. Kostová and L. Schäfer, Towards a Linked Linked Open Data, International Conference on Dublin Core 1 2Data Access to Folktales classified by Thompsons Motifs and and Metadata Applications 16 (2016), 19–20. 2 3Aarne-Thompson-Uthers Types, in: Proceedings of Digital [149] M. Curado Malta, P. Centenera and E. Gonzalez-Blanco, 3 Humanities 2017, ADHO, 2017. Using Reverse Engineering to Define a Domain Model: 4 4 [137] F. Diehr, M. Brodhun, S. Gronemeyer, K. Diederichs, The Case of the Development of a Metadata Applica- 5C. Prager, E. Wagner and N. Grube, Modellierung eines tion Profile for European Poetry, in: Developing Meta- 5 6digitalen Zeichenkatalogs für die Hieroglyphen des Klas- data Application Profiles, IGI Global, 2017, pp. 146–180. 6 7sischen Maya, in: 47. Jahrestagung der Gesellschaft für doi:10.4018/978-1-5225-2221-8. http://e-spacio.uned.es/fez/ 7 8Informatik, Digitale Kulturen, INFORMATIK 2017, Chem- view/bibliuned:365-Egonzalez9. 8 [150] M. Curado Malta, H. Bermúdez-Sabel, A.A. Baptista and 9nitz, Germany, September 25-29, 2017, M. Eibl and 9 M. Gaedke, eds, LNI, Vol. P-275, GI, 2017, pp. 1185–1196. E. Gonzalez-Blanco, Validation of a metadata application 10 10 doi:10.18420/in2017_120. profile domain model, International Conference on Dublin 11[138] G. Zólyomi, B. Tanos and S. Sövegjártó, The Electronic Text Core and Metadata Applications (2018), 65–75. 11 12Corpus of Sumerian Royal Inscriptions, 2008. [151] M. Curado Malta, Modelação de dados poéticos: Uma per- 12 13[139] É. Pagé-Perron, M. Sukhareva, I. Khait and C. Chiarcos, Ma- spectiva desde os dados abertos e ligados, in: Humanidades 13 Digitales. Miradas hacia la Edad Media, D. González 14chine translation and automated analysis of the sumerian lan- 14 guage, in: Proceedings of the Joint SIGHUM Workshop on and H. Bermudez Sabel, eds, De Gruyter, Berlin, 2019, 15 15 Computational Linguistics for Cultural Heritage, Social Sci- pp. 24–48. ISBN 978-3-11-058542-1. https://doi.org/10. 16ences, Humanities and Literature, 2017, pp. 10–16. 1515/9783110585421-004. 16 17[140] M. Hartung, M. Orlikowski and S. Veríssimo, Evaluating the [152] E. González-Blanco, S. Ros Muñoz, M.L. Díez Platas, 17 18Impact of Bilingual Lexical Resources on Cross-lingual Sen- J. De la Rosa, H. Bermúdez-Sabel, A. Pérez Pozo, L. Ay- 18 ciriex and B. Sartini, Towards an Ontology for Eu- 19timent Projection in the Pharmaceutical Domain, 2020. 19 ropean Poetry, DARIAH Annual Event 2019, Warsaw, 20[141] B.-P. Ivanschitz, T.J. Lampoltshammer, V. Mireles, 20 A. Revenko, S. Schlarb and L. Thurnay, A Semantic Poland, 2019. doi:10.5281/zenodo.3458772. https://zenodo. 21org/record/3458772#.Xhw_YOhKjIV. 21 Catalogue for the Data Market Austria., in: SEMANTICS [153] P.E. project, Network of ontologies - POSTDATA, Post- 22Posters&Demos, 2018. 22 data ERC project, [Online; accessed 2021-01-17]. http://http: 23[142] D. Lonke and J. Bosque Gil, Applying the OntoLex-lemon 23 //postdata.linhd.uned.es/results/. 24lexicography module to K Dictionaries’ multilingual data, K 24 [154] P.E. project, Postdata-core ontology, Postdata ERC project, Lexical News (KLN) (2019). https://kln.lexicala.com/kln28/ 25[Online; accessed 2021-01-17]. http://http://postdata.linhd. 25 lonke-bosque-gil-ontolex-lemon-lexicog/. 26uned.es/results/. 26 [143] G. Rehm, D. Galanis, P. Labropoulou, S. Piperidis, M. Welß, 27[155] P.E. project, Postdata-prosodic ontology, Postdata ERC 27 R. Usbeck, J. Köhler, M. Deligiannis, K. Gkirtzou, J. Fischer, project, [Online; accessed 2021-01-17]. http://http://postdata. 28C. Chiarcos, N. Feldhus, J. Moreno-Schneider, F. Kintzel, 28 linhd.uned.es/results/. 29E. Montiel, V. Rodríguez Doncel, J.P. McCrae, D. Laqua, 29 [156] P.E. project, Postdata-structural ontology, Postdata ERC I.P. Theile, C. Dittmar, K. Bontcheva, I. Roberts, A. Vasiljevs 30, project, [Online; 2021-01-17]. http://http://postdata.linhd. 30 and A. Lagzdinš, Towards an Interoperable Ecosystem of AI 31, uned.es/results. 31 and LT Platforms: A Roadmap for the Implementation of 32[157] M.L. Díez Platas, S. Ros Muñoz, E. González-Blanco, 32 Different Levels of Interoperability, in: Proceedings of the 33P. Ruiz Fabo and E. Álvarez Mellado, Medieval Span- 33 1st International Workshop on Language Technology Plat- ish (12th-15th centuries) Named Entity Recognition and 34 34 forms, European Language Resources Association, Marseille, Attribute Annotation System based on contextual in- 35France, 2020, pp. 96–107. ISBN 979-10-95546-64-1. https: formation, JASIST (Journal of the Association for In- 35 36//www.aclweb.org/anthology/2020.iwltp-1.15. formation Science and Technology) (2020). doi:https://- 36 [144] Slator, Slator 2021 Data-for-AI Market Report, Technical Re- 37doi.org/10.1002/asi.24399. 37 port, Slator, 2021. 38[158] J. De la Rosa, S. Ros Muñoz, E. González-Blanco, 38 [145] C. Chiarcos, M. Ionov, M. Rind-Pawlowski, C. Fäth, Á. Pérez Pozo, L. Hernández and A. Díaz Medina, Bertsifi- 39 39 J.W. Schreur and I. Nevskaya, LLODifying linguistic glosses, cation: Language modeling fine-tuning for Spanish scansion, 40in: Proceedings of Language, Data and Knowledge (LDK- 4th International Conference on Science and Literature (post- 40 412017), Galway, Ireland, 2017. poned due to COVID-19 crisis), Girona, 2020. 41 42[146] A. Pareja-Lora, B. Lust, M. Blume and C. Chiarcos, Develop- [159] J. De La Rosa, S. Ros Muñoz, E. González-Blanco and 42 ment of Linguistic Linked Open Data Resources for Collab- 43Á. Pérez Pozo, PoetryLab: An Open Source Toolkit for the 43 orative Data-Intensive Research in the Language Sciences, Analysis of Spanish Poetry Corpora, Carleton University and 44 44 The MIT Press, 2019. the University of Ottawa, Virtual Conference, 2020, DH2020. 45[147] H.B.-S. Sabine Tittel and C. Chiarcos, Using RDFa to Link doi:http://dx.doi.org/10.17613/rsd8-we57. https://hcommons. 45 46Text and Dictionary Data for Medieval French, in: Proc. of org/deposits/item/hc:31763/. 46 47the 6th Workshop on Linked Data in Linguistics (LDL-2018): [160] J. de la Rosa, S. Ros and E. González-Blanco, Predicting met- 47 Towards Linguistic Data Science 48, European Language Re- rical patterns in Spanish poetry with language models, arXiv 48 sources Association (ELRA), Paris, France, 2018. ISBN 979- preprint arXiv:2011.09567 (2020). 49 49 10-95546-19-1. [161] S. Krek, I. Kosem, J.P. McCrae, R. Navigli, B.S. Pedersen, 50[148] M. Curado Malta, P. Centenera and E. González-Blanco Gar- C. Tiberius and T. Wissik, European lexicographic infrastruc- 50 51cía, POSTDATA – Towards publishing European Poetry as ture (elexis), in: Proceedings of the XVIII EURALEX Interna- 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 55

1tional Congress on Lexicography in Global Contexts, 2018, (CLiC-it 2019), R. Bernardi, R. Navigli and G. Semeraro, eds, 1 2pp. 881–892. CEUR-WS.org, Bari, Italy, 2019, pp. 1–8. 2 3[162] P. Banski,´ J. Bowers and T. Erjavec, TEI-Lex0 guidelines for [172] F.M. Cecchini, R. Sprugnoli, G. Moretti and M. Passarotti, 3 the encoding of dictionary information on written and spo- UDante: First Steps Towards the Universal Dependencies 4 4 ken forms, in: Electronic Lexicography in the 21st Century: Treebank of Dante’s Latin Works, in: Seventh Italian Con- 5Proceedings of ELex 2017 Conference, 2017. ference on Computational Linguistics, CEUR-WS. org, 2020, 5 6[163] F. Mambrini and M. Passarotti, Linked Open Treebanks. pp. 1–7. 6 7Interlinking Syntactically Annotated Corpora in the LiLa [173] C. Fäth, C. Chiarcos, B. Ebbrecht and M. Ionov, Fintan – 7 8Knowledge Base of Linguistic Resources for Latin, in: Pro- Flexible, Integrated Transformation and Annotation eNgi- 8 ceedings of the 18th International Workshop on Treebanks 9neering, in: Proceedings of the Twelfth International Confer- 9 and Linguistic Theories (TLT, SyntaxFest 2019), Associ- ence on Language Resources and Evaluation (LREC-2020), 10 10 ation for Computational Linguistics, Paris, France, 2019, ELRA, Marseille, France, 2020, pp. 7212–7221. 11pp. 74–81. doi:10.18653/v1/W19-7808. https://www.aclweb. [174] T. Declerck, J. McCrae, M. Hartung, J. Gracia, C. Chiar- 11 12org/anthology/W19-7808. cos, E. Montiel, P. Cimiano, A. Revenko, R. Sauri, D. Lee, 12 13[164] R. Sprugnoli, M. Passarotti, F.M. Cecchini and M. Pelle- S. Racioppa, J. Nasir, M. Orlikowski, M. Lanau-Coronas, 13 grini, Overview of the EvaLatin 2020 Evaluation Campaign, 14C. Fäth, M. Rico, M.F. Elahi, M. Khvalchik, M. Gonzalez and 14 in: Proceedings of LT4HALA 2020 - 1st Workshop on Lan- K. Cooney, Recent Developments for the Linguistic Linked 15 15 guage Technologies for Historical and Ancient Languages, Open Data Infrastructure, in: Proceedings of the Twelfth In- 16European Language Resources Association (ELRA), Mar- ternational Conference on Language Resources and Evalua- 16 17seille, France, 2020, pp. 105–110. ISBN 979-10-95546-53-5. tion (LREC 2020), N. Calzolari, F. Béchet, P. Blache, C. Cieri, 17 18https://www.aclweb.org/anthology/2020.lt4hala-1.16. K. Choukri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, 18 [165] M. Passarotti, M. Budassi, E. Litta and P. Ruffolo, The Lem- 19J. Mariani, H. Mazo, A. Moreno, J. Odijk and S. Piperidis, 19 lat 3.0 Package for Morphological Analysis of Latin, in: Pro- 20eds, ELRA, 2020, pp. 5660–5667, ELRA. 20 ceedings of the NoDaLiDa 2017 Workshop on Processing [175] R. Saurí, L. Mahon, I. Russo and M. Bitinis, Cross-Dictionary 21Historical Language 21 , Linköping University Electronic Press, Linking at Sense Level with a Double-Layer Classifier, in: 2017, pp. 24–31. 222nd Conference on Language, Data and Knowledge (LDK 22 [166] E. Litta, M. Passarotti and F. Mambrini, The Treatment of 232019), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 23 Word Formation in the LiLa Knowledge Base of Linguis- 242019. 24 tic Resources for Latin, in: Proceedings of the Second In- [176] B. Lorincz, M. Nutu, A. Stan and G. Mircea, An Evaluation 25ternational Workshop on Resources and Tools for Deriva- 25 of Postfiltering for Deep Learning Based Speech Synthesis 26tional Morphology, Charles University, Faculty of Mathemat- 26 with Limited Data, in: IEEE 10th International Conference ics and Physics, Institute of Formal and Applied Linguistics, 27on Intelligent Systems (IS), 2020. 27 Prague, Czechia, 2019, pp. 35–43. https://www.aclweb.org/ 28[177] R. Ion, Teprolin: an Extensible, Online Text Preprocessing 28 anthology/W19-8505. 29Platform for Romanian, in: Proceedings of the ConsILR- 29 [167] F. Mambrini and M. Passarotti, Harmonizing Differ- 2018, 2018, pp. 69–76. 30ent Lemmatization Strategies for Building a Knowledge 30 31Base of Linguistic Resources for Latin, in: Proceed- [178] A.L. Georgescu, H. Cucu, A. Buzo and C. Burileanu, RSC: A 31 Romanian Read for Automatic Speech Recog- 32ings of the 13th Linguistic Annotation Workshop, Associ- 32 nition, in: Proceedings of the 12th Language Resources and 33ation for Computational Linguistics, Florence, Italy, 2019, 33 pp. 71–80. doi:10.18653/v1/W19-4009. https://www.aclweb. Evaluation Conference (LREC), 2020, pp. 6606–6612. 34 34 org/anthology/W19-4009. [179] D. Cristea, I. Pistol, S. Boghiu, A. Bibiri, D. Gifu, A. Scutel- 35[168] F. Mambrini and M. Passarotti, Representing Etymology in nicu, M. Onofrei, D. Trandabat and G. Bugeag, CoBiLiRo: 35 36the LiLa Knowledge Base of Linguistic Resources for Latin, a Research Platform for Bimodal Corpora, in: Proceedings 36 of the 1st International Workshop on Language Technology 37in: Proceedings of the 2020 Globalex Workshop on Linked 37 Platforms (IWLTP 2020), European Language Resources As- 38Lexicography, European Language Resources Association, 38 Marseille, France, 2020, pp. 20–28. ISBN 979-10-95546-46- sociation, 2020, pp. 22–27. 39 39 7. https://www.aclweb.org/anthology/2020.globalex-1.3. [180] D. Gifu, A. Moruz, C. Bolea, A. Bibiri and M. Mitrofan, The 40[169] G. Franzini, F. Zampedri, M. Passarotti, F. Mambrini and Methodology of Building CoRoLa, in: Revue Roumaine de 40 41G. Moretti, Græcissâre: Ancient Greek Loanwords in the Linquistique (Romanian Review of Linguistics)/ On design, 41 42LiLa Knowledge Base of Linguistic Resources for Latin., creation and use of of the Reference Corpus of Contemporary 42 Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA 43in: Seventh Italian Conference on Computational Linguistics, 43 J. Monti, F. Dell’Orletta and F. Tamburini, eds, CEUR-WS. and EuReCo / Conception, création et utilisation du Cor- 44 44 org, Bologna, 2020, pp. 1–6. http://ceur-ws.org/Vol-2769/ pus de Référence du Roumain Contemporain et de ses outils 45paper_06.pdf. d’analyse. CoRoLa, KorAP, DRuKoLA et EuReCo, Vol. 64, 45 46[170] A. Westerski and J.F. Sánchez-Rada, Marl Ontology Specifi- 2019, pp. 241–253. 46 47cation, V1.1 8 March 2016, 2016. http://www.gsi.dit.upm.es/ [181] C.M. Sperberg-McQueen and L. Burnard, Original editors, 47 48ontologies/marl/. revised and expanded under the supervision of the Technical 48 [171] G. Franzini, A. Peverelli, P. Ruffolo, M. Passarotti, H. Sanna, Council of the TEI Consortium. TEI P5: Guidelines for Elec- 49 49 E. Signoroni, V. Ventura and F. Zampedri, Nunc Est Aes- tronic Text Encoding and Interchange, 2018. 50timandum. Towards an evaluation of the Latin WordNet, [182] A. Li and Z. Yin, Standardization of Speech Corpus, in: Data 50 51in: Sixth Italian Conference on Computational Linguistics Science Journal, Vol. 6, 2007. 51 56 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1[183] V. Presutti, E. Blomqvist, E. Daga and A. Gangemi, Pattern- module which is described in Section A.2 . ontoLex- 1 2based ontology design, in: Ontology Engineering in a Net- ical Entry is related to Lexical Sense via the sense 2 3worked World, Springer, 2012, pp. 35–64. property (and its inverse isSenseOf. Members of the 3 4Lexical Sense class represent 4 5 5 the lexical meaning of a lexical entry when inter- 6 6 Appendix A. The OntoLex-Lemon Model preted as referring to the corresponding ontology 7 7 element...A link between a lexical entry and an on- 8 8 In order to make this paper as self-contained as pos- tology entity via a Lexical Sense object implies 9 9 sible we will introduce the OntoLex-Lemon model that the lexical entry can be used to refer to the 10 10 in the appendix. We will describe the core module ontology entity in question241. [Emphasis ours] 11with an example and briefly describe its different sub- 11 12modules. The full guidelines can be found at the fol- The Lexical Sense object relates to a corresponding 12 13lowing URL: https://www.w3.org/2016/05/ontolex/. ontology entity via the reference property; Lexical En- 13 14This appendix is stand-alone and can be skipped by try individuals can be related directly to these ontology 14 15those who are already familiar with the OntoLex- entities via the property denotes242. A second class 15 16Lemon model. which is used to describe the semantics of an entry 16 17is the Lexical Concept class. Members of the latter 17 18A.1. Introduction and the Core Module class represent "a mental abstraction, concept or unit 18 19of thought that can be lexicalized by a given collec- 19 20As has been mentioned several times in this arti- tion of senses"243; Lexical Concept is a subclass of the 20 21cle, OntoLex-Lemon, is an RDF-native model for the SKOS class Concept. 21 22modelling and publication of ontology-lexicons (that We will show how some of these classes and prop- 22 23is language resources that consist both of a lexical erties are used in practise with an example entry from 23 24and an ontological component) on the Semantic Web. Wiktionary, the open-source online multilingual dic- 24 25OntoLex-Lemon is an update of the original lemon tionary. We will take the example of the Urdu word 25 26model and as with the latter model it has the aim of zn (zuban) which means both ‘tongue’ and ‘lan- 26 27enriching or grounding linked data ontologies with lin- guage’244; a screenshot of the entry can be seen in Fig- 27 28guistic information. However it has increasingly come ure 9. 28 29to be used for the representation of linked data lexi- The entry contains some standard morphosyntactic 29 30cons without any ontological component (and indeed and semantic information about the word, quite a lot 30 31extensions of OntoLex-Lemon such as lexicog 4.1.1 of which can already be encoded using the OntoLex- 31 32and FrAC 4.1.3 are strongly motivated by lexical use- Lemon core, along with a select number of properties 32 33cases rather than ontological ones). from the LexInfo vocabulary (described above in Sec- 33 34OntoLex consists of a core module and four addi- tion 3.5). 34 35tional thematic modules three of which we will de- Figure 10 shows the encoding of zn as a OntoLex- 35 36scribe in the following sections (the Metadata module Lemon Word. Here we can note the use of the LexInfo 36 37is described in Section 4.3.3). In the current section we vocabulary to specify the part of speech and gender of 37 38will look at the core module, presented in Figure 8, the word and the use of the lime vocabulary to specify 38 39with the help of an example. As the figure shows, the the language of the entry. We can also see the relation 39 40core of OntoLex-Lemon is essentially based around of the entry to the different forms of the word via the 40 41the class Lexical Entry and the different properties and two object properties canonicalForm (linking it up to 41 42relationships that it can have. Lexical Entry has the sub- its headword form) and otherForm (linking the word 42 43classes Word, Multiword Expression and Affix. We can up to its different morphological variants). In addition 43 44associate information about the different forms that a 44 45Lexical Entry can have via the Form class. The Form 241Definition taken from https://www.w3.org/2016/05/ontolex/ 45 46class can be associated with written representations #lexical-sense-reference. 46 47via the representation properties and its sub-properties. 242The property denotes is equivalent to the property chain sense 47 48The semantics of a Lexical Entry can be described us- o reference. 48 243Definition taken from https://www.w3.org/2016/05/ontolex/ 49 49 ing the Lexical Sense and the Lexical Concept classes #lexical-concept. 50(a fuller ontological based description of the semantics 244The entry can be found here https://en.wiktionary.org/wiki/ 50 51of an entry can be found in the Syntax and Semantics %D8%B2%D8%A8%D8%A7%D9%86#Urdu. 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 57

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20Fig. 8. The OntoLex-Lemon Core. 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 Fig. 9. (zuban) 49zn 49 50 50 51 51 58 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15Fig. 10. The Word zn . 15 16 16 17 17 we see the link to the two senses from the Wiktionary the interested reader is invited to consult the OntoLex- 18 18 entry encoded as _sense1 and _sense2 respec- Lemon guidelines. 19zn zn 19 tively, both of which senses link to DBpedia pages and 20 20 the denotes property which links the entry straight to 21A.3. Decomposition 21 these two. We show a form, the direct plural form of 22 22 the noun, and a sense, the second sense of the word 23The OntoLex-Lemon Decomposition module (de- 23 meaning ‘language’. Both of these are presented in 24comp for short), represented in Figure 13, is designed 24 Figure 11. Here we can see the use of the writtenRep 25to indicate "which elements constitute a multiword or 25 property to associate both the roman and arabic script 247 26compound lexical entry" . It does this via the Com- 26 versions of the word’s orthography, and the use of ref- 27ponent class and the properties subterm, constituent, 27 erence to link the sense of the word to its ontological 28and correspondsTo. 28 reference. 29For instance, to return to our earlier example, take 29 30the derived term which was listed in the Wiktionary 30 A.2. Syntax and Semantics 31entry for zn , namely ±drzn madri zuban. This ex- 31 32pression literally means ‘mother tongue’ or ‘native lan- 32 The next module we will look at is the Syntax and 33guage’ and can be decomposed into two subterms us- 33 Semantics module of OntoLex-Lemon, shortened as 34ing the decomp property subterm. That is it can be de- 34 synsem. It contains a number of classes and proper- 35composed into the two entries zn and ±dr , the latter 35 ties for modelling the syntactic behaviour of words and 36meaning ‘maternal’, as in Figure 14 (note that we class 36 the relationship(s) between this syntactic behaviour 37madri zuban as a MultiWordExpression). 37 and the semantic properties of those words. Figure 38Other examples and details of the other properties 38 12 presents these classes and properties. The two 39can be found, as always, in the OntoLex-Lemon guide- 39 classes used to model syntactic behaviour are Syntac- 40lines. 40 tic Frame245 and Syntactic Argument246. These can be 41 41 related to corresponding ontological predicates via the 42A.4. Variation and Translation 42 OntoMap class and various object properties. To de- 43 43 scribe these mechanisms here would take us too far be- 44Finally in this appendix we will look at the Vari- 44 yond the purpose of this brief introduction. However 45ation and Translation module (also known as var- 45 46trans). This module is concerned with representing 46 47245This "represents the syntactic behavior of an open class word "relations between lexical entries and lexical senses 47 48in terms of the (syntactic) arguments it requires". See https://www. that are variants of each other." In this context these 48 w3.org/2016/05/ontolex/#syntactic-frames 49 49 246This "represents a slot that needs to be filled for a certain 50syntactic frame to be complete." See https://www.w3.org/2016/05/ 247https://www.w3.org/2016/05/ontolex/ 50 51ontolex/#syntactic-frames. #decomposition-decomp 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 59

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 Fig. 11. A form and a sense belonging to zn . 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 Fig. 12. The OntoLex-Lemon Synsem Module. 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44Fig. 13. The OntoLex-Lemon Decomposition Module. 44 45 45 46 46 entries can be variants via lexico-semantic relations words, defining a class Lexico-semantic Relation with 47 47 48such as hyponymy and synonymy or because they are subclasses, Lexical Relation and Sense Relation . 48 49translations. The classes and properties of the vartrans We will illustrate the use of this module with two 49 50module are shown in Figure As can be seen the var- examples from our running zn example .In the first 50 51trans module reifies lexico-semantic relations between case we relate the word zn to one of its synonyms, 51 60 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

1 1 2 2 3 3 4 4 5 5 6Fig. 14. Multi-word Expression Example. 6 7 7 8 8 9bAÚ bhasa ‘language’ – or rather we relate the sense tively. In Fig 16 on the other hand we represent a trans- 9 10of the word meaning ‘language’ to a sense of the word lation relation between and the first sense of the En- 10 11which means the same thing. This is represented in Fig glish word language. This is done in much the same 11 1215 using vartrans properties and classes. We specify way as in the previous example, except that we spec- 12 13the relation as a Sense Relation, use the LexInfo class ify that the relation is one of direct equivalence using 13 14synonym to categorise it and specify its source and tar- the property category and a term from a terminological 14 15get using the vartrans source and target objects respec- vocabulary. 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 61

1 1 2 2 3 3 4 4 5Fig. 15. Example of the use of vartrans. 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20Fig. 16. Example of the use of vartrans. 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50Fig. 17. The OntoLex-Lemon Vartrans Module. 50 51 51