<<

Danish in lexemes

Finn Arup˚ Nielsen Cognitive Systems, DTU Compute, Technical University of Denmark Kongens Lyngby, Denmark

Abstract 2 Wikidata lexemes

Wikidata introduced support for lexico- Wikidata stores lexeme data on a new type of pages graphic data in 2018. Here we describe the prefixed with the letter ‘L’ and further identified lexicographic part of Wikidata as well as with an integer, e.g., the Danish lexeme gentagelse experiences with setting up lexemes for the (repetition) has the identifier “L117” and available . We note various possible for view and edit at https://www.wikidata.org/ annotations for lexemes as well as discuss /Lexeme:L117. On the same page, multiple various choices made. senses and forms for the lexeme may be defined. 1 Introduction They are identified by suffixes to the lexeme iden- tifier, e.g., the plural indefinite form gentagelser ’s structured sister Wikidata (Vrandeˇci´c would be identified as “L117-F3”, while the first and Kr¨otzsch, 2014) at https://www.wikidata. sense—if it was defined—would have been identi- org/ supports interlinking different language ver- fied as “L117-S1”. Forms and senses are defined sions of Wikipedia as well as several other Wikime- separately, so it is currently difficult to define a dia sites, such as and Wikimedia Com- specific sense for a specific form. Lexemes, forms mons. One wiki that has been missing from the list and senses may be associated with properties, and is Wiktionary, — the wiki. Wiktionary these properties are identified with integer prefixed has a structure different from the other as with the letter ‘P’. multiple different and concepts might be de- The Wikidata lexeme data maps to an RDF scribed on the same page, only connected through representation,2 and the RDF data is queryable the same orthographic representation. via the Wikidata Query Service SPARQL end- In 2018, Wikidata enabled support for lexico- point at https://query.wikidata.org/. The graphic data via special lexeme wiki pages. Com- mapping uses part of the Lexicon Model for On- pared to Wiktionary, Wikidata lexemes offer a so- tologies (LEMON) ontology (Cimiano et al., 2016; lution with directly machine-readable data: it is McCrae et al., 2012). The central OWL con- not necessary to write parsers to obtain the lex- cepts for the lexeme data are ontolex:LexicalEntry, eme data in a structured format. Wikidata lex- ontolex:Form and ontolex:LexicalSense for lex- emes also reduce the amount of redundant in- eme, form and sense respectively with the prefix put: In Wiktionary, each language edition sets http://www.w3.org/ns/lemon/ontolex#. Each up its own dictionary, and a described in of these three OWL concepts has associated ba- one Wiktionary is not directly available in another sic data in Wikidata. Apart from the identifier, Wiktionary. Further issues with Wiktionary are the lexeme has the lemma, language and lexical the linkage to the Wikidata concept ontology and category, the form has its orthographic representa- the linkage to external resources such as WordNet tion and grammatical features while the sense may (Miller, 1995). Neither of these links is non-trivial have multiple glosses. This basic data cannot be to set up, though matching lexical entries between associated with qualifiers and references like nor- Wiktionary and WordNet may be done with good mal Wikidata properties. accuracy (McCrae et al., 2012). Links from Wikidata lexemes (L-pages) to Q- Below we will describe how lexemes are sup- items (i.e., the ordinary Wikidata items) are of two 1 ported on Wikidata, and list some of the Danish kinds: Either for the description of the lexical and resources relevant for Wikidata lexemes. Then we grammatical “metadata” for the lexeme or for the will detail how Danish lexemes have been anno- description of the meaning of a sense. In the latter tated and discuss some of the choices made. case, the Q-items function as the notion 1There is an introduction to Wikidata lexemes on Wikidata itself at https://www.wikidata.org/wiki/ 2https://www.mediawiki.org/wiki/Extension: Wikidata_talk:Lexicographical_data. WikibaseLexeme/RDF_mapping Danish Total Description SPARQL query fragment 1268 43816 Number of lexemes [] a ontolex:LexicalEntry 4826 118742 Number of forms [] a ontolex:Form 617 11194 Number of sense [] a ontolex:LexicalSense 8594 218803 Number of grammatical feature links [] wikibase:grammaticalFeature []

Table 1: Statistics for lexeme data in Wikidata. See also the statistics displayed on the Ordia website at https://tools.wmflabs.org/ordia/statistics/. of synsets. Wikidata has specific properties to link the Danish part of the Europarl corpus (Koehn, Q-items to synsets in external lexical resources, 2005). Fairy tales by Hans Christian Andersen can including BabelNet (Navigli and Ponzetto, 2010) be found with modern spelling. (P2581) and the Collaborative Interlingual Index Of the lexicographical resources, the standard (P5063)(Bond et al., 2016). Alternatively, the Danish dictionary, Retskrivningsordbogen, has a re- more generic property for Linked Open data URIs strictive license. Another large Danish dictionary (P2888) can be used. Wikidata has linked some with over 300’000 entries and used, e.g., with the WordNet synsets used in ImageNet. The corre- computer program aspell, is under the GNU Gen- spondence between the resources is not necessarily eral Public License and is not compatible with straightforward to establish (Nielsen, 2018). Wikidata’s CC0. DanNet (Pedersen et al., 2009) Some statistics for the lexeme data in Wikidata has a WordNet-derived open license and a Wiki- are displayed in Table1. It displays, e.g., that the data property (P6140) for the DanNet words — number of forms is close to 120’000. In compari- corresponding to Wikidata lexemes—has been cre- son, the English Wiktionary has currently around ated in November 2018. DanNet is distributed as 5.9 million content pages, while the Danish Wik- OWL, so should fit well with Wikidata lexemes. tionary has around 38 thousand.3 The numbers NST Lexical database for Danish4 has Speech are not directly comparable as multiple forms may Assessment Methods Phonetic (SAMPA) be listed on one Wiktionary content page. A count pronunciation specification for over 235’000 Danish on the distinct number of (monolingual) form rep- words and stated to have the CC0 license. resentations in Wikidata gives 89’728 on 27 March As of June 2019, we have used 160 sentences 2019 based on the following SPARQL query: from the Danish part of the Europarl corpus,5 and 6 SELECT linked to 1258 DanNet 2.2 word identifiers, while (COUNT(DISTINCT(?representation)) the NST phonetic data has hardly been used. AS ? count ) { [] ontolex:representation 4 Annotating lexemes, forms and ?representation . } senses

3 Danish resources Wikidata has a continuously growing number of properties that can be used to annotate lexemes, There are some Danish resources relevant for Wiki- forms and senses. General properties—that are rel- data lexemes, e.g., corpora for language usage ex- evant for lexemes of most word classes—are usage amples. As Wikidata is distributed under the Cre- example (P5831), word stem (P5187) and derived ative Commons Zero (CC0) license, the resources from (P5191), where the latter may indicate et- incorporated into Wikidata need to be compatible ymological origin or origin of derivations. Com- with that license. pound parts may be linked with a property (P5238) Old out-of-copyright Danish works are typically and the order of the parts may be specified with a with an antiquated spelling, e.g., where the first property used as a qualifier (P1545). The Wikidata letter of nouns has a capital letter. Wikipedia and property for DanNet words (P6140) are linked to Wiktionary may not be used because their content version 2.2 of the resource. As of March 2019, 844 is under an attribution and share-alike license, not lexemes with associated information about DanNet compatible with the CC0 license. Modern Dan- words are linked.7 The data model of Wikidata al- ish sentences can be retrieved from, e.g., Danish 4 https://www.nb.no/sprakbanken/show?serial= law texts at https://www.retsinformation.dk/, sbr-26 Danish of international treaties and 5See Ordia’s statistics at https://tools.wmflabs. conventions, such as the Treaty of Lisbon, and org/ordia/reference/Q5412081 6The SPARQL query SELECT ?dannet { ?lexeme 3 https://en.wiktionary.org/wiki/Special: wdt:P6140 ?dannet } on the Wikidata Query Service. Statistics and https://da.wiktionary.org/wiki/ 7https://www.wikidata.org/wiki/Property_ Speciel:Statistik talk:P6140 displays the DanNet property statistics. lows for the specification of “no value”, thus it is 4.3 Images and audio possible to specify that a lexeme cannot be found Senses may be associated with images by referenc- in the DanNet 2.2 resource. For instance, adverbs ing filenames in the free media archive Wikimedia and rare nouns, such as lommevogn (L40687), are Commons. The link may help language learners not in DanNet and indicated as such. The usage and possibly be a resource for training natural lan- example property (P5831) can store a short free- guage processing machine learning models in the form text and the qualifier stated in (P248) can same way that ImageNet has used WordNet, see point to a Q-item with metadata about a work (Nielsen, 2018; Nielsen and Hansen, 2018) for ap- where the text appears. A related property is at- plications of the use of Wikidata. Typically the tested in (P5323) which also can point to a Q-item. senses of nouns may be associated with images, Lexemes may also be associated with classes via while it may be difficult to identify good images the instance of property (P31). Properties rele- to be associated with, e.g., adverbs. A few Danish vant for lexemes across word classes in Danish are, verbs have been associated with images, e.g., g˚a e.g., whether they loan words and/or compound (walk) and “visual” , such as rød (red), words. are also associated with images. Forms may be associated with hyphenation and Lexemes can be associated with images. Pho- pronunciation specification. Wikidata has proper- tos of written signs may exemplify how words are ties for X-SAMPA, IPA transcription and Kirshen- used in the environment, e.g., a photo of a street baum code. These pronunciation properties have sign reading “Cyklist vig for g˚aende”is used to il- been used on Wikidata’s Q-items, but so far not lustrate the usages of the lexeme cyklist (L43527, (or very limited) for Danish lexemes. cyclist). Senses can be associated with language style Audio files can be associated with the lexico- (P6191) and perhaps most importantly with item graphic data. For forms, the P443 property can for this sense property (P5137) which links the link to one of the currently around 130 pronuncia- senses of lexemes to the Q-items and thus with tion audio files for Danish words, while senses can the rest of the Wikidata . Syn- link to sound files with the P51 property, e.g., the onyms, antonyms, hypernyms and hyponyms may sense for bi (L37259, bee) links to a sound record- be inferred from the information in that part of the ing of bees buzzing and the sense for bil (L36385, Wikidata knowledge graph. car) is associated with an audio file of a starting Links between lexemes in different languages can and driving car. currently be made with a specific prop- erty (P5972) applicable for senses, or the connec- 5 Discussion tion between lexemes can be made through their senses and the P5137 property linking to Q-item Wikidata is entirely field-based and especially for that then binds lexemes from separate languages lexemes there are very few means to enter free-form together. information. While exceptions can be noted in standard such as Wiktionary, almost 4.1 Verbs every piece of information added for a lexeme in Verbs can be associated with conjugation class Wikidata must be associated with a property. The through the P5186 property. We have followed the explicitness of Wikidata complicates the modeling scheme of (Allan et al., 1995) where there are four of language. Below we discuss a few of the issues main Danish conjugation categories. The auxil- that have appeared for the Danish language. iary verb(s) for a verb can be specified with P5401. 5.1 Lexeme splitting Some verbs can be assigned to a class, e.g., motion verbs, auxiliary verb, transitive/intransitive verb The English lexeme they (L371) incorporates the or deponent verb. The valence (P5526) can also forms they, them, their, theirs, themselves and be specified. themself, while French vous (plural you) and votre are separate lexemes (L9289 and L9289). In the 4.2 Nouns Danish online dictionary Den Danske Ordbog, the Danish nouns may be characterized by grammati- corresponding forms for they are split into several cal gender and class. Classes of commons nouns dictionary entries, while the German Wikidata lex- may be countable or mass noun, singulare tan- emes ich (L7877, the personal pronoun I ) has cur- rently no other form than ich. The issue was the tum, plurale tantum, collective noun, ‘nexual’ 8 or ‘innexual’ noun or nomen agentis. The dis- subject of an inconclusive discussion on Wikidata. tinction between nexual and innexual is based As noted by one of the discussants, if the pronouns on (Hansen and Heltoft, 2019) where the former 8https://www.wikidata.org/wiki/Wikidata_ may refer to “actions and processes, activities and talk:Lexicographical_data/Archive/2018/11#How_ states,” and the later “objects or compounds”. to_split_or_merge_stedord_(Q36224). are split, a question is how to link between such 5.3 Linking compounds to parts different lexemes. A related issue for Danish ap- pears for some adverbs, which could easily be re- When orthographically similar words with the garded as separate lexemes, such as hjem, hjemad same are split across multiple lexemes it and hjemme (home, homeward, at home). Here the may be unclear which lexeme a compound derives words are distinguished by telicity and a dynam- from. For instance, the compound vaskemaskine ic/static feature, e.g., hjemad is atelic and dynamic (L42991, washing machine) could be analyzed as (Hansen and Heltoft, 2019, p. 216). Possibly new consisting of: 1) a verb stem (vask), an interfix specific properties could describe the relationships. (-e-) and a noun (maskine), or 2) a verb in its Wikidata’s choice of separating form and sense infinitive form (vaske) and a noun, or 3) a noun complicates modeling of some words. vand (wa- (vask), an interfix and a noun. During data entry ter), øl (beer) and tøj (cloths) are examples of one would need to make an explicit choice. The words that each are regarded as one lexeme but same choice may appear for affixations, such as for- where the specific forms are associated with spe- be-handle. While be- is arguably a prefix (L44579), cific semantics: The common gender version of øl for may be a prefix or an adverb, — or possibly an relates to a countable noun as in “one beer”, while preposition. the neuter version relates to a mass noun. Here we could split the lexeme into two Wikidata lex- 5.4 Genitive emes, e.g., vand with common gender and vand with neuter, but that would complicate their re- Danish genitive, where an -s suffix is added, has lations to other lexemes, e.g., in terms of com- traditionally been regarded as a case, but newer pounding and etymology. A related issue occurs words for Danish grammar challenge that notion for deponent verbs. For finde/findes (active/pas- and argues that it is a clitic and a derivation mak- sive; find/exist) the lexemes have been separated ing a nominal to non-nominal (Hansen and Heltoft, (L39637 and L44601) following the convention of 2019, p. 255). Originally, we began adding the DanNet. genitive -s forms for the Danish nouns, but has In case of, e.g., the lexemes mor and moder discontinued it after becoming aware of the issue. (mother) their singular forms are different but they The Swedish part of Wikidata lexeme continues to have the same meaning and their plural form, add the genitive -s forms for nouns. If we were to mødre, is the same. There is no way of merging add the genitive for Danish nouns, then one could the separate plural forms when mor and moder argue that genitive versions of other word classes are regarded as separate lexemes as the forms are should also be added, — as words from other word tied to separate lexeme pages. The creation of a classes can be used as nouns, e.g., de rødes valgsejr dedicated property could link such forms together. (literally, the reds’ election victory) where the ad- jective røde has the added genitive -s. The ad- 5.2 Compound splitting vantage of have the -s forms is that lookup, e.g., for spellchecking may be more convenient. Other Danish is a language rich in compounds. The Danish digital dictionaries record the -s form. compounds and affixes of a lexeme can be spec- ified with the P5238 property where other lex- emes can be linked. The currently longest Danish 5.5 Data quality lexeme in Wikidata, ejendomsadministrationsvirk- The structured format of the data and the query somhed (building administration business), could tools associated with Wikidata enable us to per- be split as ejendom-s-administration-s-virksomhed form some completeness and internal consistency with two s-interfixes and three words with a checks. For instance, we may formulate a SPARQL good semantic relation to the complete lexeme. query that returns Danish lexemes without any us- With a more granular level, the word could be age examples. We have used the Shape Expressions split into ejen-dom-s-ad-ministr-ation-s-virk-som- (ShEx) language (Baker and Prud’hommeaux, hed, where affixes have been split from the roots. 2017) to formalize such checks, and these ShEx Here, dom and virk has little semantic relation- definitions are available on separate pages on Wiki- ship to the compound lexeme. With the cur- data (Nielsen et al., 2019). As an example, the rent setup of the P5238 property and the struc- ShEx definition E659 checks Danish numerals re- ture of Wikidata, it is difficult to see how the garding data about language, lemma, word stem, two splits can coexist with the same lexeme. Cur- word class, DanNet, usage example, sense, form rently, we typically split on the highest level, e.g., and hyphenation. ejendomsadministration-s-virksomhed. The lex- eme pages for ejendomsadministration and virk- somhed can further split the compounds and de- 9https://www.wikidata.org/wiki/ rived words. EntitySchema:E65 5.6 Applications References What can Wikidata lexemes be used for? Wikidata [Allan et al.1995] Robin Allan, Philip Holmes, and itself has a dedicated page for application ideas.10 Tom Lundskær-Nielsen. 1995. Danish. Rout- For spellchecking the current number of lexemes in ledge. Wikidata can hardly compete with already estab- lished larger word lists, but in the long run using [Baker and Prud’hommeaux2017] Thomas Baker the lexeme forms for spellchecking might be of in- and Eric Prud’hommeaux. 2017. Shape terest. The advantage is that it is collaboratively Expressions (ShEx) Primer. July. extensible, likely able to quickly catch up on neol- [Bond et al.2016] Francis Bond, Piek Vossen, ogisms and evolving jargon in comparison to stan- John P. McCrae, and Christiane Fellbaum. dard dictionaries. It is less clear if Wikidata lex- 2016. CILI: the Collaborative Interlingual In- emes can be used for more advanced natural lan- dex. Proceedings of the Eighth Global WordNet guage processing, such as part-of-speech tagging Conference, pages 50–57, January. and grammar checking. The ability of the cur- rent Wikidata lexeme system has limited means [Cimiano et al.2016] Philipp Cimiano, John P. Mc- for specifying grammar. Crae, and Paul Buitelaar. 2016. Lexicon Model for Ontologies: Community Report, 10 May 6 Related Research 2016. May. Among related research, there are several studies [Hansen and Heltoft2019] Erik Hansen and Lars reporting the extraction of data from Wiktionary Heltoft. 2019. Grammatik over det Danske and using the structured data for linguistic tasks Sprog. University Press of Southern Denmark, or building a resource (Zesch et al., 2008; McCrae February. et al., 2012; S´erasset,2014; Pantaleo et al., 2017). [Koehn2005] Philipp Koehn. 2005. Europarl: A For instance, the Java- and database-based system Parallel Corpus for Statistical Machine Transla- by Zesch et al. (Zesch et al., 2008) for reading, stor- tion. The Tenth Machine Translation Summit: ing and querying lexical semantic knowledge from Proceedings of Conference, pages 79–86. Wikipedia and Wiktionary enables a user, e.g., to programmatically query for hyponyms of senses. [McCrae et al.2012] John P. McCrae, Elena The parser of the described system needs to be Montiel-Ponsoda, and Philipp Cimiano. 2012. adjusted for each language edition of Wiktionary Integrating WordNet and Wiktionary with as each edition may use different markup for the lemon. Linked Data in Linguistics. lexical semantic information. [Miller1995] George Armitage Miller. 1995. Word- The lexicographic part of Wikidata is still com- Net: a lexical database for English. Communi- parably small, but contrary to many other on- cations of the ACM, 38:39–41, November. line dictionaries with rich semantics, Wikidata users can add and edit the lexicographic informa- [Navigli and Ponzetto2010] Roberto Navigli and Si- tion and more or less immediately see it becom- mone Paolo Ponzetto. 2010. BabelNet: building ing available in the powerful query facility of the a very large multilingual . Pro- SPARQL-based Wikidata Query Service. Our Or- ceedings of the 48th Annual Meeting of the As- dia Web application at http://tools.wmflabs. sociation for Computational Linguistics, pages org/ordia takes advantage of this service (Nielsen, 216–225, July. 2019). [Nielsen and Hansen2018] Finn Arup˚ Nielsen and Lars Kai Hansen. 2018. Inferring visual seman- 7 Acknowledgment tic similarity with deep learning and Wikidata: We thank Bolette Sandford Pedersen, Sanni Nimb, Introducing imagesim-353. Proceedings of the First Workshop on Deep Learning for Knowl- Sabine Kirchmeier, Nicolai Hartvig Sørensen and edge Graphs and Semantic Technologies, pages Lars Kai Hansen for discussions and answering 56–61, April. questions, and the reviewers for suggestions for improvement of the manuscript. This work is [Nielsen et al.2019] Finn Arup˚ Nielsen, Katherine funded by the Innovation Fund Denmark through Thornton, and Jos´eEmilio Labra Gayo. 2019. the projects DAnish Center for Big Data Analytics Validating Danish Wikidata lexemes. June. driven Innovation (DABAI) and Teaching platform Submitted to SEMANTiCS 2019. for developing and automatically tracking early ˚ stage literacy skills (ATEL). [Nielsen2018] Finn Arup Nielsen. 2018. Link- ing ImageNet WordNet Synsets with Wikidata. WWW ’18 Companion: The 2018 Web Con- 10https://www.wikidata.org/wiki/Wikidata: ference Companion, April 23–27, 2018, Lyon, Lexicographical_data/Ideas_of_tools. France, pages 1809–1814, April. [Nielsen2019] Finn Arup˚ Nielsen. 2019. Ordia: A Web application for Wikidata lexemes. May. From ESWC 2019. [Pantaleo et al.2017] Ester Pantaleo, Vito Walter Anelli, Tommaso Di Noia, and Gilles S´erasset. 2017. Etytree: A Graphical and Interactive Et- ymology Dictionary based on Wiktionary. Pro- ceedings of the 26th International Conference on Companion - WWW ’17 Com- panion, pages 1635–1640. [Pedersen et al.2009] Bolette Sandford Pedersen, Sanni Nimb, Jørg Asmussen, Nicolai Hartvig Sørensen, Lars Trap-Jensen, and Henrik Lorentzen. 2009. DanNet: the challenge of compiling a wordnet for Danish by reusing a monolingual dictionary. Language Resources and Evaluation, 43:269–299, August. [S´erasset2014] Gilles S´erasset.2014. DBnary: Wik- tionary as a Lemon-Based Multilingual Lexical Resource in RDF. Semantic Web: interoperabil- ity, usability, applicability. [Vrandeˇci´cand Kr¨otzsch2014] Denny Vrandeˇci´c and Markus Kr¨otzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57:78–85, October. [Zesch et al.2008] Torsten Zesch, Christof M¨uller, and Iryna Gurevych. 2008. Extracting Lexical Semantic Knowledge from Wikipedia and Wik- tionary. Proceedings of the Sixth International Conference on Language Resources and Evalua- tion (LREC’08), pages 1646–1652.