PoS(ISGC 2011 & OGF 31)003 http://pos.sissa.it/ ¨ urzburg, Germany b and Werner Wegstein a ∗ ¨ urzburg, Am Hubland, D – 97074 W {dietmar.seipel,werner.wegstein}@uni-wuerzburg.de Speaker. In this paper, we investigateance the in problem a of language buildingtronic in a space ; metaDictionary and that for time. reflectstechniques. obtaining vari- This a is fine–grained done annotation,We by exploiting are we the use developing annotation declarativephemes a of parsing elec- of graphical an tool electronicof for basic . decomposing , The to and constructto goals annotating a obtain of network the complex the showing which entries/mor- the segmentation morphemes, morphemes metaDictionary are and can and to be to the combined obtainmorpheme analyze annotated a decompositions variance list becomes in decomposition, possible,properties space a into which network and account. can analysis time. take of We variance can Based and manage on grammatical and analyzeespecially grids of interesting dictionaries for and our hugethe text analyses, text corpora: since corpora dictionaries they are document containdictionary–based which morphological settled morphemes analysis morpheme has can cores, tocorpus be and be data combined tested best in lexicographically. accessible the via Our context grid of structures. huge network ∗ Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike Licence. Department of Computer Science Institute for German Philology c

The International Symposium onMarch Grids 19–25, and 2011 Clouds andAcademia the Sinica, Open Taipei, Taiwan Grid Forum E–mail: University of W Dietmar Seipel metaDictionary – Towards a Generic e–Infrastructure for Detecting Variance in Language by Exploiting Dictionary Information a b PoS(ISGC 2011 & OGF 31)003 9 9 3 3 4 6 7 8 3 5 7 2 10 Dietmar Seipel Tei 2 Tei ]. 12 The Morpheme Annotation Tool The metaDictionary Network Analysis of Morpheme Decompositions Techniques from Computer Science Encoding Dictionaries in Grammar–Based Parsing Annotated Morpheme Terms Annotation Rules Information and knowledge can be encoded by combining elementary basic components 4.3 2.1 2.2 2.3 3.1 3.2 4.1 4.2 . Conclusions . Annotation of Digitized Print Dictionaries in . Annotating Morpheme Decompositions . Variance in Language and Genome . Introduction according to special rules.structural From properties this point in of commonthe view, which genomes contribute one and substantially hand to language and codewith biological may evolution to a have on focus the on changeIn interdependencies bioinformatics, of research between language data science on onavailable. and genomes the On the have a other already humanities much been hand. smaller seemsin generated scale, promising. the and we Research are humanities try in publicly by towithin build collecting this up the empirical field a last data comparable 500 in research yearsthe the environment research and field into beyond, mutual of based relations language onand between language. change dictionary data structures in information Our interdisciplinary and and German project data toand combines properties bioinformatics, be corpus in informatics, linguistics used philology genomes and for hasand been Research since funded 2008 by the [ German Federal Ministry of Education 1. Introduction 5 3 4 2 Contents 1 A Generic e–Infrastructure Exploiting Dictionary Information PoS(ISGC 2011 & OGF 31)003 ]. ]): 4 11 [ orter- ¨ ¨ uller/Zarncke Dietmar Seipel . A metaDic- 1 ]. In this section, 1 using a declarative . The metaLemma [ orterbuchnetz ¨ 2 Tei Trierer W by Jacob and Wilhelm Grimm. ¨ uhler discussed the importance orterbuch¨ der deutschen Gegen- W alzisches¨ and Luxemburger W orterbuch ¨ Wahlverwandtschaften with the combinability of amino sequences 3 3 Deutsches W assischen¨ bzw. der deutsch–lothringischen Mundarten) and sup- , digitized by the Academy of Sciences of Berlin and Brandenburg [ orterbuch¨ der els Our project goals are the compilation of a metaDictionary – using a fine–grained The usage of in languages varies in space and time, cf. Figure The rest of this paper is organized as follows: In Section 2, we outline some project The project partner from bioinformatics is interested in comparing the combinability The starting point of the project is a collection of digitized dictionaries made publicly of these units for thean understanding estimate of on variety in the languagemorpheme dimensions in count his for of language German 30 theory – pages and of more gave Goethe’s than novel 2000 units –, setting out with a annotation of dictionariesfor – the and comparison the withlexical computation genomes units – of (base based network morphemes). on structures morphological As and analyses early properties of as entries 1934, into Karl basic B grammar formalism. Section 4 presentsdictionary a entries tool and for annotating analyzing their the parts. morphological Some structure of conclusions are2. given in Variance Section in 5. Language and Genome we will briefly sketch theare project using; more goals details and about the the techniques techniques from will computer2.1 be science The given that in metaDictionary we the subsequent sections. tionary summarizes different representations oftionaries the spread same in lexical space units, and taken time, from in the lists dic- of metaLemmas, cf. Figure porting materials of the diachronic objectives investigating variance in language andof genome. the project: Section 3 how explains digitized a print prerequisite dictionaries can be annotated in of (basic) morphemes in words, cf. Figure synchronic dictionaries like the Middle High Germanand Dictionaries (Benecke/M Lexer), early Hightion German of Dictionaries dictionaries on around regional 1800 dialectsbuch, (Adelung (Rheinisches, W Pf and Campe), a selec- For modern German, we canwartssprache use (WDG) Klappenbach/Steinitz, Based on the broad andand representative algorithms data for base, the detectingbase goal and structures is understanding to could variance. developstructures, then and and, Ideally, test be variance methods vice modelled in versa,oped using biological variance for in concepts structures language in detected bioinformatics. might by be analyzing described language using models devel- is based on dictionary entries infor present–day standard units German, that with are special no provision made longer used today. 2.2 Network Analysis of Morpheme Decompositions available by our project partners at the university of Trier ( A Generic e–Infrastructure Exploiting Dictionary Information PoS(ISGC 2011 & OGF 31)003 Dietmar Seipel Time Levels - Gafel Gabel Gawel 4 gabel(e) Variance in Space and Time. ahd mhd nhd gabala 6 Figure 1: A metaLemma Represents a List of Dictionary Entries. Dictionaries wdg lexer ahdwb luxemb lothrwb Figure 2: We want to develop a generic e–Infrastructure for analyzing the inhomogenious dic- tionary entries of 18th,information to 19th a and kind 20thformation of as baseline century possible, encoding lexicographers e.g., keeping and as part for much of of speech, transforming the gender, the valuable and dictionary inflectional in- detail, encoded in lots 2.3 Techniques from Computer Science in genomes. Preliminaryproperties. tests have This could shown be that dueof to the language the corresponding and fact that networks genome the have might generative be similar processes the behind the comparable. evolution A Generic e–Infrastructure Exploiting Dictionary Information PoS(ISGC 2011 & OGF 31)003 ] 9 Xml for data format Dietmar Seipel Xslt ) on the Adelung Xml ] – on the Campe Dic- Dcg 10 Tei for parsing, querying, and transform- 5 Prolog Morpheme Network. in for relational databases, XQuery and ]. There exists quite a variety of well–known declarative Sql Figure 3: 8 , 7 P5 Guidelines for Electronic Text Encoding and Interchange [ , 3 Tei for declarative programming, and rules for decision support systems declarative programming Prolog –based tool for a declarative, rule–based control and annotation of morpheme ] for a fine–grained lexicographic analysis. For compiling the metaDictionary, we 2 We use Researching morpheme structures requires a fine–grained analysis and detailed ac- Prolog and grammars. Their advantages are,less that they error–prone, are and compact, rapidlytion flexibly programmable, by clear, extensible. parsing dictionary We have entries tested with declarative definite knowledge clause grammars extrac- ( processing, Dictionary and – within the framework of the TextGrid–Project [ have split the from thea electronic dictionaries into morpheme terms. We have built segmentations, which we testalignment on methods, WDG which data can at yield present. a more We precise can alignment extend for our the approach3. metaDictionary. by Annotation of Digitized Print Dictionaries in counting of the basicas linguistic units the of basis themorphological helps German combination to . patterns, minimize Using irrespectivehistory. dictionary of influences entries the The of permanent dictionaries inflectional change weman of specialties selected language language fully and during in represent the the thedictionary) Middle and variance lexicographical New High of state High German of German, used Period the by (Lexer, Ger- educated and supported literary by circles from the around Grimm 1775 of different patterns.conformant The with annotated the dictionary datafor researching are into stored variance inavailable comparable for an to other genome researchers. structures, and they will be publicly languages for different areas: tionary [ ing the linguistic data [ A Generic e–Infrastructure Exploiting Dictionary Information PoS(ISGC 2011 & OGF 31)003 is 4 ¨ Alchen, Dietmar Seipel from Figure das Aal ]. 10 ]. And we can apply the 13 elements. E.g., the dictionary entry Tei data format for researching into variance . Parsing of dictionary entries is a complex Verkleinerungswort, 6 Tei Prolog Fine–Grained Structuring. Entry of an Electronic Dictionary. Tei : Mz. die –e, 5 format. E.g., the dictionary entry for ¨ uche, ... Figure 5: Figure 4: Tei des –es, d. Mz. w. d. Ez. 1) Ein langer, runder ... Fisch ... 2) Ein Backwerk aus Butterteig ... 3) Die fal=schen Br Der Aal, des –s, is annotated as follows: With our parsing approach, we can obtain a fine–grained annotation of electronic The obtained structures are represented using For automatically extracting the keyword () and some meta–data, such as Aal structured as shown in Figure for dictionaries in a valid task, since there is athen lot we of can structural flexibly variance;The and if parsed quickly the dictionary adapt desired data data thecomparable are could grammar to stored not genome rules in be structures. without a extracted, breaking other cases. 3.1 Encoding Dictionaries in class and inflexion, we usethe a declarative programming compact language grammar formalism, which we have developed using A Generic e–Infrastructure Exploiting Dictionary Information (Adelung, Campe) up toresults sixties of in our the dictionary analyses 20thtexts in century and a (WDG) early second [ new stepliterary High to texts text German – corpora texts available of soon – middle in starting High the with German TextGrid Luther digital and repository the [ mass of German PoS(ISGC 2011 & OGF 31)003 s ]. 7 [ Edcg Xml . We can Dietmar Seipel s is, that they in other tools, Xml ) of context free . 6 rules: Edcg Ebnf Ebnf is more appropriate for or Edcg Xml Prolog s in , is shown in Figure sequence(*, form:[type:determiner]) Dcg 7 Prolog elements. , an important extension of form Ebnf extension, called Extended Definite Clause Grammar element, which depends on the position of the entry in s and Dcg entry yields a higher precision compared to regular expressions and Dcg . When working with of the Xml grammars xml:id

Der
value="m"/>
Aal
, –notion is well–known from extended Backus–Naur form ( is a common data format for modelling, managing, and exchanging semi–structu- form:[type:], ..., sense. sequence(*, form:[type:determiner]), form:[type:]. ... ...
* ) rules, which is even more compact and directly, generically generates entry ===> form:[type:lemma] ===> sense ===> ... For annotating the large numbers of dictionary entries (which can exceed 100.000 The Parsing with Xml Edcg units), one needs linguistic knowledgemorphological and analyzer, suitable a tools suitable, fromfor compact computer science: knowledge automatic representation, a annotation inference reliable rules,annotation methods tool, and which a we graphical have built user using interface. The architecture of our grammars. Compared to generically generate this generation has to be coded explicitly and thus blows4. up Annotating the code Morpheme drastically. Decompositions offer meta–predicates such as sequence:generates the a call sequence of zero or more the dictionary (and on the headword), is determined in a further processing step. also gain performance compared tohandling relational complex databases, since structures. 3.2 Grammar–Based Parsing statistical parsers. We( use a E.g., for generating the entry elements, we can use the following The attribute red data. Powerful query, transformation and update languages exist for A Generic e–Infrastructure Exploiting Dictionary Information PoS(ISGC 2011 & OGF 31)003 , . ]. 6 Prolog Dietmar Seipel craftsmanship for gap element, ge Morfessor Visualization  of the English word 6 6 for basic morphem and 8 bm System Architecture. Term Notation Morpheme Decomposition. Annotation Rules Morphem Analyses and imported into standard tools such as Prot´eg´e[ , can be annotated as follows: Figure 6:  Owl 7 Figure 7: 6 ((craft + s) + man) + ship Owl Prot´eg´e ((craft*bm + s*ge) + man)*noun + ship The decomposition of complex morphemes into basic units is based on the Whole Word Using unique abbreviations, such as We use a compact knowledge representation of the annotated morpheme decomposi- Morphology. The initial morpheme decompositionssuch are as derived using Morfessor. well–established Then, tools theyelectronic dictionaries. are pre–annotated We use based dictionary on informationanalysis, the provided e.g., by fine–grained information the information on WDG word of to classes supportunits and the found conditions of can usage. later The beas basic corrected, morphological much refined of and the annotatedour with annotations parsing our as techniques, tool. and possible Wefor we try from have inferring to the developed further extract underlying a annotations. set electronicthe The of Web dictionaries final Ontology about using morpheme Language 50 decompositions generic can annotation be rules exported to the morpheme term which is visualized in Figure tions as suitable term structures that can be managed and extended elegantly in 4.1 Annotated Morpheme Terms A Generic e–Infrastructure Exploiting Dictionary Information PoS(ISGC 2011 & OGF 31)003 : side. In Dietmar Seipel . Prolog is recognized as a 8 9 (((craft + s) + man)*noun +*noun ship) rules is managed, which can automatically infer further ((craft + s) + man)*noun + ship , consists of the morpheme editor, a visualization of the Prolog Prolog The Graphical User Interface of the Morpheme Annotation Tool. Figure 8: mc(X, A, B), has_word_class(A, noun), text_form(B, [ship, ...]). has_word_class(X, noun) :- The tree structure of the annotated morpheme terms can be visualized for better For persistently storing the morpheme terms and for efficiently accessing the morpheme The processing of the morpheme terms is completely done on the The morpheme annotation tool supports the user in checking all pre–annotated mor- readability. The annotatedpossible morpheme to efficiently terms search for areterms individual stored annotated containing morpheme a in terms search or a string. for relational all morpheme database. It is morpheme terms as tree structures, and a database browser, cf. Figure terms based on theirit text contains form, further a information, relationalthe such database name as is of the used. the text user For form, who every the annotated morpheme the identifier term, 4.2 term from Annotation as the Rules well WDG, as and the timestamp of the annotation. pheme decompositions. The user canon annotate a parts set of of a annotationposition morpheme rules decomposition. is – Based further which can annotatedis be as extended also detailed dynamically implemented as – in the possible. morpheme decom- The graphical user interface, which A Generic e–Infrastructure Exploiting Dictionary Information the background, a set of annotations from the annotations and decompositionslogical given by annotation the user. rule, the Withnoun the term following and can be further annotated to 4.3 The Morpheme Annotation Tool PoS(ISGC 2011 & OGF 31)003 Dietmar Seipel . . . An Introduction to Declarative Parsing Jena, 1934, p. 34. Prolog . 5 Vol., 1807–1811. 52% of the English – ator,¨ Klaus: Proc. 17th Workshop on Logic . , that orterbuch¨ der deutschen Prolog W . Proc. 6th International Workshop on Natural 10 Culturomics , de Gruyter, Berlin / New York, 1990. http://www.sprache-und-genome.de . : Project funded by the German Ministry of Education –Documents in Natural Language Processing in ]. We expect to contribute with our e–Infrastructure in http://protege.stanford.edu/ ¨ ucher / Dictionaries / Dictionnaires. ... / An International Quantitative Analysis of Culture Using Millions of Digitized 5 orterbuch¨ der deutschen Sprache. [ Xml Addison–Wesley, 1989. W orterb ¨ W 6 Vol., Akademie–Verlag, Berlin, 1961–1977. http://www.textgrid.de Processing Sprachtheorie. Die Darstellungsfunktion der Sprache. 2009, orterbuchnetz¨ – Network of Electronic Dictionaries of the University of Trier. Science 14, January 2011, pp. 176–182. ¨ uhler, Karl: h rte´Otlg Editor. The Prot´eg´eOntology Gegenwartssprache. Books. Variance in Language and Genome Encyclopedia of / ... Vol. 2 http://www.woerterbuchnetz.de/ Computational Linguistics. and Annotation of Electronic Dictionaries. Language Processing and Cognitive Science (NLPCS), 2009. Programmierung (WLP), 2002. P5: Guidelines for Electronichttp://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html Text Encoding and Interchange. Textgrid: A Modular PlatformHumanities. for Collaborative Textual Editing – aTrierer W Community Grid for the and Research (BMBF) since 2008. The metaDictionary of the German language, based on the analyses of a network of B [1] [6] [5] Michel, Jean–Baptiste, et al.: [2] Campe, Joachim Heinrich: [3] Gazdar, Gerald; Mellish, Chris: [7] Schneiker, Christian; Seipel, Dietmar; Wegstein, Werner; Pr [8] Seipel, Dietmar: [9] [4] Klappenbach, Ruth; Steinitz, Wolfgang (Eds.): a Grid environment to theunderstanding variance development in of language test that ideally methods are and applicable to algorithms genome for structures detecting asReferences well. and dictionaries, forms the corebasic part morphemes of used in thewill the generic be German e–Infrastructure to language designed test withinered to the the by identify last dictionaries, data the 500 and usingtexts. years. second text to Here corpora: The we find next expect out first step well new all to as insights combinations qualitatively, check into because of for the dictionaries basic combinabilitycomplex of basic morphemes of the morpheme morphemes used basic German structures not language units, in normally quantitatively cov- representingpare do as semantically our not regular register results patterns. with the We observations will in com- A Generic e–Infrastructure Exploiting Dictionary Information 5. Conclusions the majority of themented words in used standard in references English books – consists of lexical dark matter undocu- [12] [10] [11] [13] Gouws, Rufus et. al.: