Towards a Formal Representation of Components of German Compounds Thierry Declerck Piroska Lendvai DFKI GmbH Dept. of Computational Linguistics D-66123 Saarbrucken,¨ Germany Saarland University & D-66123 Saarbrucken,¨ Germany Austrian Centre for Digital Humanities [email protected] A-1010 Vienna, Austria [email protected]

Abstract sentation of such components in German, in order to facilitate the segmentation and possi- This paper presents an approach for bly the generation of compound words. the formal representation of compo- Several German lexical resources feature nents in German compounds. We as- designated elements to mark up entries as sume that such a formal representa- compounds, whereas they typically lack ele- tion will support the segmentation and ments that would represent the components of analysis of unseen compounds that compounds. feature components already seen in One of the most fully-fledged resource other compounds. An extensive lan- of German compound nouns is GermaNet guage resource that explicitly codes (Hamp and Feldweg, 1997; Kunze and Lem- components of compounds is Ger- nitzer, 2002), a lexical semantic net for Ger- maNet, a lexical man following the principles of the Princeton for German. We summarize the Ger- WordNet (Fellbaum, 1998). The approach maNet approach to the description of of GermaNet to the encoding of elements of compounds, discussing some of its compounds within the lexical-semantic net is shortcomings. Our proposed exten- described in (Henrich and Hinrichs, 2011). sion of this representation builds on Additionally to this representation of com- the lemon lexicon model for ontolo- pounds, GermaNet offers a freely available gies, established by the W3C Ontol- list of 66,047 nominal compounds split to ogy Lexicon Community Group. their modifier(s) and head, in a tab-delimited 1 1 Introduction (tsv) format . Table 1 shows few examples taken from this list, while Example ?? shows The motivation for our study is the assumption the encoding of the compound Rotsperre (’red that the availability of a formal representation card suspension’) within the XML represen- of components in German compound words tation of the GermaNet lexical semantic net- can help in the detection, processing and anal- work. ysis of new unseen compounds. Compound- Based on the GermaNet description of com- ing in German is a productive process to cre- pounds, several studies on annotating and sys- ate (new) words, and these typically consist of 1http://www.sfs.uni-tuebingen.de/ lexical items partly seen in other compounds GermaNet/documents/compounds/split_ already. Our aim is to create a formal repre- compounds_from_GermaNet11.0.txt tems on processing compounds have been the GermaNet formal representation of com- proposed (Hinrichs et al., 2013; Santos, 2014; pounds in the full context of the lexical se- Dima et al., 2014; Dima and Hinrichs, 2015). mantic net. Then we suggest our extensions to the GermaNet representation, utilizing mod- Compound Modifier(s) Head ules of the lemon2 approach to the encoding Rotschopf rot Schopf of lexical data. Rotschwanz rot Schwanz Rotschwingel rot Schwingel 2 Representation of Compounds in Rotspecht rot Specht the GermaNet lexical semantic net Rotsperre rot Sperre Rotstich rot|Rot Stich The structure of a GermaNet entry containing Rotstift rot Stift the compound word Rotsperre (’red card sus- Rotstiftaktion Rotstift Aktion pension’) is shown in Example 1. The rele- vant information is to be found in the XML Table 1: Examples from the GermaNet list of elements rendered in bold face. nominal compounds. also reflecting the recursive nature of com- R o t s p e r r e parts, as can be seen with the words Rots- r o t ily split Rotstiftaktion into rot, Stift and Ak- S p e r r e tion, on the basis of the segmentation of Rots- tift.

beim Fussball more than one modifier, as in the case of Rot- stich (’tinge of red’), where we have both an Example 1: A compound lexical unit in adjectival (rot) and a nominal (Rot) modifier. GermaNet: Rotsperre (’red card suspension’) GermaNet marks the different part-of-speech In this formal representation, the PoS of the (PoS) properties of the components being in modifier element of the compound (rot, ’red’) the modifier position by using different cases: is explicitly given, while this is not the case Upper case marks a noun (as this is the case for the head, as the PoS of the head element for all the listed compounds), while lower case of a compound is identical to the PoS of the marks either a verb or an adjective. whole compound. However, we advocate that We observe also that the modifier rot is of- explicitly encoding the PoS information of the ten repeated (in fact much more often then in this slice taken from the list: there are also 2The lexicon model for ontologies (lemon) many compounds ending with the component is resulting from the work of the W3C On- tology Lexicon Community Group; https: rot). //www.w3.org/community/ontolex/wiki/ In the following sections we present first Final_Model_Specification. head component can be necessary; for exam- lead’). GermaNet does not include this infor- ple if a tool would access only the repository mation in its XML representation. of components. In this case, the tool would Finally, we notice that the ordering of com- have to infer the PoS information of the head ponents is not explicitly encoded. component from the compounds in which it In order to remedy the above issues, we occurs, adding thus an additional processing suggest to adopt the recently published spec- step, which can be avoided if the PoS of the ifications of the lemon model. In the follow- head component is explicitly marked. ing section, we describe this model and our As already observed for the list of com- suggested representation of GermaNet com- pounds in the tsv format, the GermaNet entry pounds. displays here the adjective modifier in lower- case. In this case we are loosing the informa- 3 The lemon Model tion about the original use of the word form. We suggest to introduce an additional feature The lemon model has been designed using in which the original form of the component the Semantic Web formal representation lan- is preserved. guages OWL, RDFS and RDF3. It also makes 4 By observing the list of compounds pro- use of the SKOS vocabulary . lemon is vided by GermaNet, we noted that the mod- based on the ISO Lexical Markup Frame- 5 ifier component of Rotsperre keep recurring work (LMF) and the W3C Ontology Lexicon in other compounds. This is for sure also Community Group proposed an extension of 6 the case for the head components. For ex- the original lemon model , stressing its mod- ample, the component Sperre (’suspension’, ular design. ’block’, ...) is repeated in the related word The core module of lemon, called ontolex, Gelbsperre (’yellow card suspension’). Such is displayed in Figure 1. In ontolex, each productively recurring components would be element of a lexicon entry is described inde- beneficial to have encoded in a repository so pendently, while typed relation markers, in the that they are included only once in a lexicon, form of OWL, RDF or ontolex properties, are possibly with links to the different compo- interlinking these elements. nents they can be combined with, depending Additionally to the core module of lemon, on their related senses. we make use of its decomposition module, 7 The use of a modifier in a compound can called decomp , designed for the representa- play a disambiguation role. While we can eas- tion of Multiword Expression lexical entries, ily establish a relation between the reduced and which we use for the representation of set of senses of the compound and the set of compound words. senses of the head of the compound, we have 3See respectively http://www.w3. no immediate information on the synsets asso- org/TR/owl-/, https: ciated to the modifier of the compound. This //www.w3.org/TR/rdf-schema/, and is an information we would also like to explic- https://www.w3.org/RDF/ 4https://www.w3.org/2004/02/skos/ itly encode. 5See (Francopoulo et al., 2006) and http://www. Further, we consider the encoding of the lexicalmarkupframework.org/ 6 Fugenelement (’connecting element’) that is See (McCrae et al., 2012) 7http://www.w3.org/community/ often used in the building of compounds; e.g. ontolex/wiki/Final_Model_ the s in Fuhrungs¨ tor (’goal which gives the Specification in the entries (1-3), which are instances of the class ontolex:Component. This way we can encode the surface forms of the compo- nents, as they are used in compounds. The relation between the ontolex:Component instances (the surface forms of the components occurring in the compounds) and the ontolex:Word in- stances (the full lexical entries corresponding the surface form of the components) follows Figure 1: ontolex, the core module of lemon. the schema for the relation between the two Figure created by John P. McCrae for the classes ontolex:LexicalEntry and W3C Ontolex Community Group. ontolex:Component, which is graph- ically shown in Figure 2. For our example Rotsperre, as shown in the entries (1-3) below, the elements Rot and sperre are in- stances of the class ontolex:Component, and as such sperre can be linked/related to other compounds like Loschsperre¨ (’deletion block’) or to the (semantically more closely related) Gelbsperre (’yel- low card suspension’). The property decomp:correspondsTo links the com- ponents to the lexical entries that encode all the lexical properties of those surface forms Figure 2: decomp, the decomposition module used in the compound word. of lemon. Figure created by John P. McCrae A simplified representation of the com- for the W3C Ontolex Community Group. pound entry Rotsperre and of its components is displayed below, in (1-3). In the entry (1) we use rdf 1 and rdf 28 for marking the The relation of decomp to the core mod- order of the two components in the compound ule, and more particularly to the class on- word. We assume that information on the po- tolex:LexicalEntry, is displayed in Figure 2. sition of the elements can be relevant for the Components of a compound (or a multi- interpretation of the compound. word) entry are pointed to by the prop- erty: decomp:constituent. The range of (1) :Rotsperre lex this property is an instance of the class de- rdf:type ontolex:LexicalEntry ; comp:Component. lexinfo:partOfSpeech lexinfo:noun ; Taking again Rotsperre (’red card suspen- rdf: 1 :Rot comp ; sion’) as an example, and which is built of two rdf: 2 :sperre comp ; components, we make use two times of the de- comp:constituent property, the current values 8As instances of the property rdfs:ContainerMembershipProperty, of it being :Rot comp and :sperre comp see http://www.w3.org/TR/rdf-schema/ for (see the corresponding RDF code given below more details. decomp:constituent :Rot comp ; parties or sports clubs, etc. can be found. decomp:constituent :sperre comp ; The same holds for :sperre comp, which decomp:subterm :Sperre lex ; can be linked to the LOD resource http: decomp:subterm :rot lex ; //de.dbpedia.org/page/Sperre. ontolex:denotes Additionally, for the sense of the com- . source: https://www.wikidata.org/ wiki/Q1827, with the specific meaning of Entries (2) and (3) below show the en- suspension from a sports game. The senses coding of the instances of the class repository for Sperre can look as displayed in decomp:Component: the lexicalSense entries (4) and (5). (2) :Rot comp rdf:type decomp:Component ; (4) :sperre sense1 decomp:correspondsTo :rot lex . rdf:type ontolex:LexicalSense ; rdfs:label “A sense for the German (3) :sperre comp word ‘Sperre”’@en ; rdf:type decomp:Component ; ontolex:isSenseOf :Sperre lex ; decomp:correspondsTo ontolex:reference :Sperre lex . . The proposed approach to the representa- tion of elements of compounds seems intuitive (5) :sperre sense2 and economical, since one component can be rdf:type ontolex:LexicalSense ; linked to a large number of other components, rdfs:label “A sense for the German and, next to decomposition, can also be used word ‘Sperre”’@en ; for the generation of compound words, taking ontolex:isSenseOf :Sperre lex ; into account the typical position such compo- ontolex:reference nents are taking in known compounds. . use of the property decomp:subterm. This property links the compound to the full lexical Our current work includes associat- information associated to its components, in- ing GermaNet senses as values of the cluding the senses of such components. The ontolex:LexicalSense property. motivation of the lemon model is the determi- We are also encoding connecting ele- nation of senses of lexical entries by reference ments (Fugenelemente with the help of the to ontological entities outside of the lexicon ontolex:Affix class. proper. We can thus easily extend the repre- sentation of the compound word with sense in- 4 Conclusion formation, by linking the components and the compound word to relevant resources in the We presented an approach for the formal rep- Linked Open Data (LOD) cloud. The sense of resentation of elements that occur in com- :Rot comp is given by a reference to http: pound words. Our motivation is to provide //de.dbpedia.org/page/Rot, where rules for computing compound words on the additional associations of red with political basis of their components. Acknowledgments Bontcheva, Ruslan Mitkov, and Nicolas Ni- colov, editors, RANLP, pages 420–426. RANLP Work presented in this paper has been sup- 2011 Organising Committee. ported by the PHEME FP7 project (grant No. Erhard Hinrichs, Verena Henrich, and Reinhild 611233) and by the FREME H2020 project Barkey. 2013. Using partwhole relations for (grant No. 644771). The author would like to automatic deduction of compound-internal re- thank the anonymous reviewers for their very lations in germanet. Language Resources and helpful comments. Evaluation, 47(3):839–858. Claudia Kunze and Lothar Lemnitzer. 2002. Ger- manet - representation, visualization, applica- References tion. In Proceedings of the Third International Conference on Language Resources and Eval- Corina Dima and Erhard Hinrichs. 2015. Auto- uation (LREC-2002), Las Palmas, Canary Is- matic noun compound interpretation using deep lands - Spain, May. European Language Re- neural networks and word embeddings. In Pro- sources Association (ELRA). ACL Anthology ceedings of the 11th International Conference Identifier: L02-1073. on Computational Semantics, pages 173–183, London, UK, April. Association for Computa- John P. McCrae, Guadalupe Aguado de Cea, Paul tional Linguistics. Buitelaar, Philipp Cimiano, Thierry Declerck, Asuncion´ Gomez-P´ erez,´ Jorge Gracia, Laura Corina Dima, Verena Henrich, Erhard Hinrichs, Hollink, Elena Montiel-Ponsoda, Dennis Spohr, and Christina Hoppermann. 2014. How to and Tobias Wunner. 2012. Interchanging lex- tell a schneemann from a milchmann: An ical resources on the semantic web. Language annotation scheme for compound-internal re- Resources and Evaluation, 46(4):701–719. lations. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Pedro Bispo Santos. 2014. Using compound lists Hrafn Loftsson, Bente Maegaard, Joseph Mar- for german decompounding in a back-off sce- iani, Asuncion Moreno, Jan Odijk, and Stelios nario. In Workshop on Computational, Cogni- Piperidis, editors, Proceedings of the Ninth In- tive, and Linguistic Approaches to the Analysis ternational Conference on Language Resources of Complex Words and Collocations (CCLCC and Evaluation (LREC’14), Reykjavik, Iceland, 2014), pages 51–55. may. European Language Resources Associa- tion (ELRA).

Christiane Fellbaum. 1998. WordNet: An Elec- tronic Lexical Database. Bradford Books.

Gil Francopoulo, Monte George, Nicoletta Calzo- lari, Monica Monachini, Nuria Bel, Y Pet, and Claudia Soria. 2006. Lexical markup frame- work (lmf. In In Proceedings of LREC2006.

Birgit Hamp and Helmut Feldweg. 1997. Ger- manet - a lexical-semantic net for german. In In Proceedings of ACL workshop Automatic Infor- mation Extraction and Building of Lexical Se- mantic Resources for NLP Applications, pages 9–15.

Verena Henrich and Erhard W. Hinrichs. 2011. Determining immediate constituents of com- pounds in germanet. In Galia Angelova, Kalina