Nomen Omen. Enhancing the Latin Morphological Analyser Lemlat with an Onomasticon
Total Page:16
File Type:pdf, Size:1020Kb
Nomen Omen. Enhancing the Latin Morphological Analyser Lemlat with an Onomasticon Marco Budassi Marco Passarotti Università di Pavia Università Cattolica del Sacro Cuore Corso Strada Nuova, 65 Largo Gemelli, 1 27100 Pavia - Italy 20123 Milan - Italy [email protected] [email protected] level of syntax, by performing semantic-based Abstract tasks like semantic role labelling and anaphora and ellipsis resolution (Passarotti, 2014). In par- Lemlat is a morphological analyser for Latin, ticular, in the area of Digital Humanities there is which shows a remarkably wide coverage of growing interest in Named Entity Recognition the Latin lexicon. However, the performance (NER), especially for purposes of geographical- of the tool is limited by the absence of proper based analysis of texts. names in its lexical basis. In this paper we NER is a sub-branch of Information Extrac- present the extension of Lemlat with a large Onomasticon for Latin. First, we describe and tion, whose inception goes back to the Sixth motivate the automatic and manual proce- Message Understanding Conference (MUC-6) dures for including the proper names in Lem- (Grishman and Sundheim, 1996). NER aims at lat. Then, we compare the new version of recognising and labelling (multi)words, as names Lemlat with the previous one, by evaluating of people, things, places, etc. Since MUC-6, their lexical coverage of four Latin texts of NER has largely expanded, with several applica- different era and genre. tions also on ancient languages (see, for exam- ple, Depauw and Van Beek, 2009). 1 Introduction Although Lemlat provides quite a large cover- age of the Latin lexicon, its performance is lim- Since the time of the Index Thomisticus by father ited by the absence of an Onomasticon in its lex- Roberto Busa (Busa, 1974-1980), which is usual- ical basis, which would be helpful for tasks like ly mentioned among the first electronic (nowa- NER. Given that in Latin proper names undergo days called “digital”) annotated corpora availa- morphological inflection, in this paper we de- ble, NLP tools for automatic morphological scribe our work of enhancing Lemlat with an analysis and lemmatisation of a richly inflected Onomasticon. The paper is organised as follows. language like Latin were needed. Over the last Section 2 presents the basic features of Lemlat. decades, this need was fulfilled by a number of Section 3 describes our method to enhance Lem- morphological analysers for Latin. Among the lat with an Onomasticon, by detailing the rules most widespread ones are Morpheus (Crane, for the automatic enhancement and discussing 1991), Whitaker’s Words the most problematic kinds of words. Section 4 (http://archives.nd.edu/words.html) and Lemlat evaluates the rules and presents one experiment (Passarotti, 2004). Over the past ten years, such run on four Latin texts. Section 5 is a short con- tools have become essential, in light of a number clusion and sketches the future work. of projects aimed at developing advanced lan- 1 guage resources for Latin, like treebanks. 2 Lemlat The most recent advances in linguistic annota- tion of Latin treebanks are moving beyond the The lexical basis of Lemlat results from the col- lation of three Latin dictionaries (Georges and Georges, 1913-1918; Glare, 1982; Gradenwitz, 1 Three dependency treebanks are currently available for Latin: the Latin Dependency Treebank (Bamman and 1904). It counts 40,014 lexical entries and 43,432 Crane, 2006), the Index Thomisticus Treebank (Passarotti, lemmas, as more than one lemma can be includ- 2009) and the Latin portion of the PROIEL corpus (Haug ed into the same lexical entry. and Jøndal, 2008). 90 Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH), pages 90–94, Berlin, Germany, August 11, 2016. c 2016 Association for Computational Linguistics Given an input wordform that is recognised by masculine nouns) and (c) the number of lines of Lemlat, the tool produces in output the corre- the lexical entry for the lemma in Forcellini. sponding lemma(s) and a number of tags convey- In order to enhance Lemlat with Forcellini’s ing (a) the inflectional paradigm of the lemma(s) Onomasticon, we first extracted from Busa (e.g. first declension noun) and (b) the morpho- (1988) the list of those lemmas that occur in the logical features of the input wordform (e.g. sin- ON section. This list counts 28,178 lemmas. gular nominative), as well as the identification Then, we built a number of rules to automatically number (ID) of the lemma(s) in the lexical basis include the lemmas of the Onomasticon into the of Lemlat. No contextual disambiguation is per- lexical basis of Lemlat. formed. For instance, receiving in input the wordform 3.1 Types of Rules abamitae (“great-aunt”), Lemlat outputs the cor- Including the Onomasticon of Forcellini into responding lemma (abamita, ID: A0019), the tags Lemlat means converting the list of proper for its inflectional paradigm (N1: first declension names provided by Busa (1988) into the same noun) and those for the morphological features format of the LES archive. In order to perform of the input wordform (feminine singular geni- this task as automatically as possible, we built a tive and dative; feminine plural nominative and number of rules to extract the relevant infor- vocative). mation for each lemma in the list, namely its LES, The basic component of the lexical look-up CODLES and gender. By exploiting the morpho- table used by Lemlat to analyse input wordforms logical tagging of Busa (1988), which groups is the so-called LES (“LExical Segment”). The sets of lemmas showing common inflectional LES is defined as the invariable part of the in- features, our rules treat automatically such in- flected form (e.g. abamit for abamit-ae). In other flectionally regular groups. In total, we wrote words, the LES is the sequence (or one of the se- 122 rules, which fall into four types. quences) of characters that remains the same in The first type (60 rules) builds the LES by re- the inflectional paradigm of a lemma (hence, the moving one or more characters from the right LES does not necessarily correspond to the word side of the lemma. Such a removal is constrained stem). by the code for the inflectional paradigm of the Lemlat includes a LES archive, in which LES lemma, which is then used to create both the are assigned an ID and a number of inflectional CODLES and the tag for the gender. For instance, features among which are a tag for the gender of the lemma marcus (“Mark”) is assigned the in- the lemma (for nouns only) and a code (called flectional paradigm BM in Busa (1988). One rule CODLES) for its inflectional category. According states that the LES for BM lemmas ending in -us is to the CODLES, the LES is compatible with the built by removing the last two characters from endings of its inflectional paradigm. For in- the lemma (marcus > marc) The inflectional stance, the CODLES for the LES abamit is N1 (first code BM stands for second declension (B) mascu- declension nouns) and its gender is F (feminine). line (M) nouns: this is converted into the CODLES The wordform abamitae is thus analysed as be- of Lemlat for second declension nouns (B > N2) longing to the LES abamit because the segment - and into the tag for masculine gender (M > m). ae is recognised as an ending compatible with a The second type of rules (19) adds one or LES with CODLES N1. more characters on the right side of the lemma to build the LES. Again, this is done according both 3 Enhancing Lemlat. Method to the inflectional paradigm and to the ending of The bedrock of our work is Busa’s (1988) Totius the lemma in Busa (1988). For instance, the LES Latinitatis Lemmata, which contains the list of for lemmas with inflectional code CM (third de- the lemmas (92,052) from the 5th edition of Lexi- clension masculine nouns) and ending in -o is con Totius Latinitatis (Forcellini, 1940). In Busa built by adding an -n after the last character. One (1988), three kinds of metadata are assigned to example is the lemma bappo (“Bappo”), whose each lemma: (a) a code for the section of the dic- LES is bappon, as third declension imparisyllable nouns are analysed by Lemlat by using the basis tionary in which the lemma occurs (e.g. ON: the lemma occurs in the Onomasticon), (b) a code for their singular genitive (bappon-is). for the inflectional paradigm the lemma belongs The third type of rules (19) replaces one or more characters on the right side of the lemma to and its gender (e.g. BM: second declension with others. For instance, the LES of clemens (“Clement”, third declension masculine noun 91 ending in -s, with singular genitive clement-is) is 4 Evaluating the Enhancement built by replacing the final -s with a -t (clement). The last type of rules (24) deals with those We evaluated the enhancement of Lemlat with lemmas that are equal to their LES (no change is the Onomasticon of Forcellini in two steps. First, needed). These are uninflected nouns, (like ha- we focused on the accuracy of the rules for au- milcar - “Hamilcar”), which can be easily re- tomatic enhancement. Then, we compared the trieved because they are assigned a specific in- new version of Lemlat with the previous one by flectional code in Busa (1988). the lexical coverage they provide for four Latin texts. 3.2 Problematic Cases 4.1 Rules Not all inflectional paradigms are as much regu- lar as to allow for a fully automatic rule-based We evaluated the quality of the rules for auto- treatment.