Towards a Formal Representation of Components of German Compounds Thierry Declerck Piroska Lendvai DFKI Gmbh Dept
Total Page:16
File Type:pdf, Size:1020Kb
Towards a Formal Representation of Components of German Compounds Thierry Declerck Piroska Lendvai DFKI GmbH Dept. of Computational Linguistics D-66123 Saarbrucken,¨ Germany Saarland University & D-66123 Saarbrucken,¨ Germany Austrian Centre for Digital Humanities [email protected] A-1010 Vienna, Austria [email protected] Abstract sentation of such components in German, in order to facilitate the segmentation and possi- This paper presents an approach for bly the generation of compound words. the formal representation of compo- Several German lexical resources feature nents in German compounds. We as- designated elements to mark up entries as sume that such a formal representa- compounds, whereas they typically lack ele- tion will support the segmentation and ments that would represent the components of analysis of unseen compounds that compounds. feature components already seen in One of the most fully-fledged resource other compounds. An extensive lan- of German compound nouns is GermaNet guage resource that explicitly codes (Hamp and Feldweg, 1997; Kunze and Lem- components of compounds is Ger- nitzer, 2002), a lexical semantic net for Ger- maNet, a lexical semantic network man following the principles of the Princeton for German. We summarize the Ger- WordNet (Fellbaum, 1998). The approach maNet approach to the description of of GermaNet to the encoding of elements of compounds, discussing some of its compounds within the lexical-semantic net is shortcomings. Our proposed exten- described in (Henrich and Hinrichs, 2011). sion of this representation builds on Additionally to this representation of com- the lemon lexicon model for ontolo- pounds, GermaNet offers a freely available gies, established by the W3C Ontol- list of 66,047 nominal compounds split to ogy Lexicon Community Group. their modifier(s) and head, in a tab-delimited 1 1 Introduction (tsv) format . Table 1 shows few examples taken from this list, while Example ?? shows The motivation for our study is the assumption the encoding of the compound Rotsperre (’red that the availability of a formal representation card suspension’) within the XML represen- of components in German compound words tation of the GermaNet lexical semantic net- can help in the detection, processing and anal- work. ysis of new unseen compounds. Compound- Based on the GermaNet description of com- ing in German is a productive process to cre- pounds, several studies on annotating and sys- ate (new) words, and these typically consist of 1http://www.sfs.uni-tuebingen.de/ lexical items partly seen in other compounds GermaNet/documents/compounds/split_ already. Our aim is to create a formal repre- compounds_from_GermaNet11.0.txt tems on processing compounds have been the GermaNet formal representation of com- proposed (Hinrichs et al., 2013; Santos, 2014; pounds in the full context of the lexical se- Dima et al., 2014; Dima and Hinrichs, 2015). mantic net. Then we suggest our extensions to the GermaNet representation, utilizing mod- Compound Modifier(s) Head ules of the lemon2 approach to the encoding Rotschopf rot Schopf of lexical data. Rotschwanz rot Schwanz Rotschwingel rot Schwingel 2 Representation of Compounds in Rotspecht rot Specht the GermaNet lexical semantic net Rotsperre rot Sperre Rotstich rotjRot Stich The structure of a GermaNet entry containing Rotstift rot Stift the compound word Rotsperre (’red card sus- Rotstiftaktion Rotstift Aktion pension’) is shown in Example 1. The rele- vant information is to be found in the XML Table 1: Examples from the GermaNet list of elements rendered in bold face. nominal compounds. <synset class=”Geschehen” category=”nomen” id=”s21159”> <lexUnit id=”l29103” The few examples listed in Table 1 show styleMarking=”no” that GermaNet describes explicitly only im- artificial=”no” mediate constituents of compounds, but is namedEntity=”no” source=” core” sense=”1”> also reflecting the recursive nature of com- <orthForm>R o t s p e r r e</ pounds that have more than two constituent orthForm> parts, as can be seen with the words Rots- <compound> <modifier c a t e g o r y =” tift (’red pencil’) and Rotstiftaktion (’cutback’, A d j e k t i v ”>r o t</ ’reduce spending’). In this case a tool can eas- modifier> ily split Rotstiftaktion into rot, Stift and Ak- <head>S p e r r e</ head> </ compound> tion, on the basis of the segmentation of Rots- </ l e x U n i t> tift. <p a r a p h r a s e>beim Fussball</ We note also that one compound can have p a r a p h r a s e> </ s y n s e t> more than one modifier, as in the case of Rot- stich (’tinge of red’), where we have both an Example 1: A compound lexical unit in adjectival (rot) and a nominal (Rot) modifier. GermaNet: Rotsperre (’red card suspension’) GermaNet marks the different part-of-speech In this formal representation, the PoS of the (PoS) properties of the components being in modifier element of the compound (rot, ’red’) the modifier position by using different cases: is explicitly given, while this is not the case Upper case marks a noun (as this is the case for the head, as the PoS of the head element for all the listed compounds), while lower case of a compound is identical to the PoS of the marks either a verb or an adjective. whole compound. However, we advocate that We observe also that the modifier rot is of- explicitly encoding the PoS information of the ten repeated (in fact much more often then in this slice taken from the list: there are also 2The lexicon model for ontologies (lemon) many compounds ending with the component is resulting from the work of the W3C On- tology Lexicon Community Group; https: rot). //www.w3.org/community/ontolex/wiki/ In the following sections we present first Final_Model_Specification. head component can be necessary; for exam- lead’). GermaNet does not include this infor- ple if a tool would access only the repository mation in its XML representation. of components. In this case, the tool would Finally, we notice that the ordering of com- have to infer the PoS information of the head ponents is not explicitly encoded. component from the compounds in which it In order to remedy the above issues, we occurs, adding thus an additional processing suggest to adopt the recently published spec- step, which can be avoided if the PoS of the ifications of the lemon model. In the follow- head component is explicitly marked. ing section, we describe this model and our As already observed for the list of com- suggested representation of GermaNet com- pounds in the tsv format, the GermaNet entry pounds. displays here the adjective modifier in lower- case. In this case we are loosing the informa- 3 The lemon Model tion about the original use of the word form. We suggest to introduce an additional feature The lemon model has been designed using in which the original form of the component the Semantic Web formal representation lan- is preserved. guages OWL, RDFS and RDF3. It also makes 4 By observing the list of compounds pro- use of the SKOS vocabulary . lemon is vided by GermaNet, we noted that the mod- based on the ISO Lexical Markup Frame- 5 ifier component of Rotsperre keep recurring work (LMF) and the W3C Ontology Lexicon in other compounds. This is for sure also Community Group proposed an extension of 6 the case for the head components. For ex- the original lemon model , stressing its mod- ample, the component Sperre (’suspension’, ular design. ’block’, ...) is repeated in the related word The core module of lemon, called ontolex, Gelbsperre (’yellow card suspension’). Such is displayed in Figure 1. In ontolex, each productively recurring components would be element of a lexicon entry is described inde- beneficial to have encoded in a repository so pendently, while typed relation markers, in the that they are included only once in a lexicon, form of OWL, RDF or ontolex properties, are possibly with links to the different compo- interlinking these elements. nents they can be combined with, depending Additionally to the core module of lemon, on their related senses. we make use of its decomposition module, 7 The use of a modifier in a compound can called decomp , designed for the representa- play a disambiguation role. While we can eas- tion of Multiword Expression lexical entries, ily establish a relation between the reduced and which we use for the representation of set of senses of the compound and the set of compound words. senses of the head of the compound, we have 3See respectively http://www.w3. no immediate information on the synsets asso- org/TR/owl-semantics/, https: ciated to the modifier of the compound. This //www.w3.org/TR/rdf-schema/, and is an information we would also like to explic- https://www.w3.org/RDF/ 4https://www.w3.org/2004/02/skos/ itly encode. 5See (Francopoulo et al., 2006) and http://www. Further, we consider the encoding of the lexicalmarkupframework.org/ 6 Fugenelement (’connecting element’) that is See (McCrae et al., 2012) 7http://www.w3.org/community/ often used in the building of compounds; e.g. ontolex/wiki/Final_Model_ the s in Fuhrungs¨ tor (’goal which gives the Specification in the entries (1-3), which are instances of the class ontolex:Component. This way we can encode the surface forms of the compo- nents, as they are used in compounds. The relation between the ontolex:Component instances (the surface forms of the components occurring in the compounds) and the ontolex:Word in- stances (the full lexical entries corresponding the surface form of the components) follows Figure 1: ontolex, the core module of lemon. the schema for the relation between the two Figure created by John P.