PoS(ISGC 2011 & OGF 31)003 http://pos.sissa.it/ ¨ urzburg, Germany b and Werner Wegstein a ∗ ¨ urzburg, Am Hubland, D – 97074 W {dietmar.seipel,werner.wegstein}@uni-wuerzburg.de Speaker. In this paper, we investigateance the in problem a of language buildingtronic in a space dictionaries; metaDictionary and that for time. reflectstechniques. obtaining vari- This a is fine–grained done annotation,We by exploiting are we the use developing annotation declarativephemes a of parsing elec- of graphical an tool electronicof for basic dictionary. decomposing morphemes, The to and constructto goals annotating a obtain of network the complex the showing which entries/mor- the segmentation morphemes, morphemes metaDictionary are and can and to be to the combined obtainmorpheme analyze annotated a decompositions variance morpheme list becomes in decomposition, possible,properties space a into which network and account. can analysis time. take of We variance can Based and manage on grammatical and analyzeespecially grids of interesting dictionaries for and our hugethe text analyses, text corpora: since corpora dictionaries they are document containdictionary–based which morphological settled morphemes analysis morpheme has can cores, tocorpus be and be data combined tested best in lexicographically. accessible the via Our context grid of structures. huge network ∗ Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike Licence. Department of Computer Science Institute for German Philology c
The International Symposium onMarch Grids 19–25, and 2011 Clouds andAcademia the Sinica, Open Taipei, Taiwan Grid Forum E–mail: University of W Dietmar Seipel metaDictionary – Towards a Generic e–Infrastructure for Detecting Variance in Language by Exploiting Dictionary Information a b PoS(ISGC 2011 & OGF 31)003 9 9 3 3 4 6 7 8 3 5 7 2 10 Dietmar Seipel Tei 2 Tei ]. 12 The Morpheme Annotation Tool The metaDictionary Network Analysis of Morpheme Decompositions Techniques from Computer Science Encoding Dictionaries in Grammar–Based Parsing Annotated Morpheme Terms Annotation Rules Information and knowledge can be encoded by combining elementary basic components 4.3 2.1 2.2 2.3 3.1 3.2 4.1 4.2 . Conclusions . Annotation of Digitized Print Dictionaries in . Annotating Morpheme Decompositions . Variance in Language and Genome . Introduction according to special rules.structural From properties this point in of commonthe view, which genomes contribute one and substantially hand to language and codewith biological may evolution to a have on focus the on changeIn interdependencies bioinformatics, of research between language data science on onavailable. and genomes the On the have a other already humanities much been hand. smaller seemsin generated scale, promising. the and we Research are humanities try in publicly by towithin build collecting this up the empirical field a last data comparable 500 in research yearsthe the environment research and field into beyond, mutual of based relations language onand between language. change dictionary data structures in information Our interdisciplinary and and German project data toand combines properties bioinformatics, be corpus in informatics, linguistics used philology genomes and for hasand been Research since funded 2008 by the [ German Federal Ministry of Education 1. Introduction 5 3 4 2 Contents 1 A Generic e–Infrastructure Exploiting Dictionary Information PoS(ISGC 2011 & OGF 31)003 ]. ]): 4 11 [ orter- ¨ ¨ uller/Zarncke Dietmar Seipel . A metaDic- 1 ]. In this section, 1 using a declarative . The metaLemma [ orterbuchnetz ¨ 2 Tei Trierer W by Jacob and Wilhelm Grimm. ¨ uhler discussed the importance orterbuch¨ der deutschen Gegen- W alzisches¨ and Luxemburger W orterbuch ¨ Wahlverwandtschaften with the combinability of amino sequences 3 3 Deutsches W assischen¨ bzw. der deutsch–lothringischen Mundarten) and sup- , digitized by the Academy of Sciences of Berlin and Brandenburg [ orterbuch¨ der els Our project goals are the compilation of a metaDictionary – using a fine–grained The usage of words in languages varies in space and time, cf. Figure The rest of this paper is organized as follows: In Section 2, we outline some project The project partner from bioinformatics is interested in comparing the combinability The starting point of the project is a collection of digitized dictionaries made publicly of these units for thean understanding estimate of on variety in the languagemorpheme dimensions in count his for of language German 30 theory – pages and of more gave Goethe’s than novel 2000 units –, setting out with a annotation of dictionariesfor – the and comparison the withlexical computation genomes units – of (base based network morphemes). on structures morphological As and analyses early properties of as entries 1934, into Karl basic B grammar formalism. Section 4 presentsdictionary a entries tool and for annotating analyzing their the parts. morphological Some structure of conclusions are2. given in Variance Section in 5. Language and Genome we will briefly sketch theare project using; more goals details and about the the techniques techniques from will computer2.1 be science The given that in metaDictionary we the subsequent sections. tionary summarizes different representations oftionaries the spread same in lexical space units, and taken time, from in the lists dic- of metaLemmas, cf. Figure porting materials of the diachronic objectives investigating variance in language andof genome. the project: Section 3 how explains digitized a print prerequisite dictionaries can be annotated in of (basic) morphemes in words, cf. Figure synchronic dictionaries like the Middle High Germanand Dictionaries (Benecke/M Lexer), early Hightion German of Dictionaries dictionaries on around regional 1800 dialectsbuch, (Adelung (Rheinisches, W Pf and Campe), a selec- For modern German, we canwartssprache use (WDG) Klappenbach/Steinitz, Based on the broad andand representative algorithms data for base, the detectingbase goal and structures is understanding to could variance. developstructures, then and and, Ideally, test be variance methods vice modelled in versa,oped using biological variance for in concepts structures language in detected bioinformatics. might by be analyzing described language using models devel- is based on dictionary entries infor present–day standard units German, that with are special no provision made longer used today. 2.2 Network Analysis of Morpheme Decompositions available by our project partners at the university of Trier ( A Generic e–Infrastructure Exploiting Dictionary Information PoS(ISGC 2011 & OGF 31)003 Dietmar Seipel Time Levels - Gafel Gabel Gawel 4 gabel(e) Variance in Space and Time. ahd mhd nhd gabala 6 Figure 1: A metaLemma Represents a List of Dictionary Entries. Dictionaries wdg lexer ahdwb luxemb lothrwb Figure 2: We want to develop a generic e–Infrastructure for analyzing the inhomogenious dic- tionary entries of 18th,information to 19th a and kind 20thformation of as baseline century possible, encoding lexicographers e.g., keeping and as part for much of of speech, transforming the gender, the valuable and dictionary inflectional in- detail, encoded in lots 2.3 Techniques from Computer Science in genomes. Preliminaryproperties. tests have This could shown be that dueof to the language the corresponding and fact that networks genome the have might generative be similar processes the behind the comparable. evolution A Generic e–Infrastructure Exploiting Dictionary Information PoS(ISGC 2011 & OGF 31)003 ] 9 Xml for data format Dietmar Seipel Xslt ) on the Adelung Xml ] – on the Campe Dic- Dcg 10 Tei for parsing, querying, and transform- 5 Prolog Morpheme Network. in for relational databases, XQuery and ]. There exists quite a variety of well–known declarative Sql Figure 3: 8 , 7 P5 Guidelines for Electronic Text Encoding and Interchange [ , 3 Tei for declarative programming, and rules for decision support systems declarative programming Prolog –based tool for a declarative, rule–based control and annotation of morpheme ] for a fine–grained lexicographic analysis. For compiling the metaDictionary, we 2 We use Researching morpheme structures requires a fine–grained analysis and detailed ac- Prolog and grammars. Their advantages are,less that they error–prone, are and compact, rapidlytion flexibly programmable, by clear, extensible. parsing dictionary We have entries tested with declarative definite knowledge clause grammars extrac- ( processing, Dictionary and – within the framework of the TextGrid–Project [ have split the lexemes from thea electronic dictionaries into morpheme terms. We have built segmentations, which we testalignment on methods, WDG which data can at yield present. a more We precise can alignment extend for our the approach3. metaDictionary. by Annotation of Digitized Print Dictionaries in counting of the basicas linguistic units the of basis themorphological helps German combination to vocabulary. patterns, minimize Using irrespectivehistory. dictionary of influences entries the The of permanent dictionaries inflectional change weman of specialties selected language language fully and during in represent the the thedictionary) Middle and variance lexicographical New High of state High German of German, used Period the by (Lexer, Ger- educated and supported literary by circles from the around Grimm 1775 of different patterns.conformant The with annotated the dictionary datafor researching are into stored variance inavailable comparable for an to other genome researchers. structures, and they will be publicly languages for different areas: tionary [ ing the linguistic data [ A Generic e–Infrastructure Exploiting Dictionary Information PoS(ISGC 2011 & OGF 31)003 is 4 ¨ Alchen, Dietmar Seipel from Figure das Aal ]. 10 ]. And we can apply the 13 elements. E.g., the dictionary entry Tei data format for researching into variance . Parsing of dictionary entries is a complex Verkleinerungswort, 6 Tei Prolog Fine–Grained Structuring. Entry of an Electronic Dictionary. Tei : Mz. die –e, 5 format. E.g., the dictionary entry for ¨ uche, ... Figure 5: Figure 4: Tei des –es, d. Mz. w. d. Ez. 1) Ein langer, runder ... Fisch ... 2) Ein Backwerk aus Butterteig ... 3) Die fal=schen Br Der Aal, des –s, is annotated as follows: