Enhancing the Accuracy of Ancient Greek Wordnet by Multilingual Distributional Semantics Yuri Bizzoni, Riccardo Del Gratta, Federico Boschetti, Marianne Reboul

Home , Distributional semantics

Enhancing the Accuracy of Ancient Greek WordNet by Multilingual Distributional Semantics Yuri Bizzoni, Riccardo del Gratta, Federico Boschetti, Marianne Reboul

To cite this version:

Yuri Bizzoni, Riccardo del Gratta, Federico Boschetti, Marianne Reboul. Enhancing the Accuracy of Ancient Greek WordNet by Multilingual Distributional Semantics. Proceedings of the Second Italian Conference onComputational Linguistics, 2015, Trento, Italy. hal-03167983

HAL Id: hal-03167983 https://hal.archives-ouvertes.fr/hal-03167983 Submitted on 16 Mar 2021

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Enhancing the Accuracy of Ancient Greek WordNet by Multilingual Distributional Semantics

Yuri Bizzoni1, Riccardo Del Gratta1, Federico Boschetti1, Marianne Reboul2 1ILC-CNR, Pisa 2Université de Paris 4, Paris {yuri.bizzoni,riccardo.delgratta}@gmail.com, [email protected], [email protected]

Abstract in the synsets, adapt the glosses and validate the lexico-semantic relations.3 We first cre- English. We discuss a method to ated AGWN by bootstrapping Greek-English enhance the accuracy of a subset of the pairs from bilingual dictionaries and by as- Ancient Greek WordNet based on the signing Greek words to PWN synsets associa- Homeric lexicon and the related con- ted to the corresponding English translations. ceptual network, by using multilingual As a drawback of this method, a large num- semantic spaces built from aligned cor- ber of synsets and lexico-semantic relations pora. are spuriously over-generated by English ho- Italiano. Esponiamo un metodo per mi- monymy and polysemy. As exposed in (Biz- gliorare l’accuratezza di un sottoinsieme zoni et al., 2014), to have PWN as a pivo- dell’ Ancient Greek WordNet, basato sul ting resource4 propagates the same drawback lessico Omerico e sulla relativa rete con- to other connected WordNet in CoPhiWord- cettuale, attraverso l’uso di spazi seman- Net Platform. In order to improve the accu- tici plurilingui costruiti su corpora paral- racy of a subset of AGWN synsets related to leli allineati. the Homeric lexicon and the related concep- tual network, we have automatically extracted 1 Introduction word translations from Greek-Italian parallel texts by applying distributional semantic stra- The Ancient Greek WordNet (AGWN) repre- tegies illustrated in the following sections and sents the first attempt to build a WordNet for verified how many of these translation were Ancient Greek (Bizzoni et al., 2014). in CoPhiWn. According to the methodology The AGWN synsets are aligned to Prince- explained in (Francis Bond and Uchimoto, ton WordNet (PWN) (Fellbaum, 1998), to Ita- 2008), trilingual resources (in our case the ori- lian WordNet (IWN) (Roventini et al., 2003), ginal Greek-English pairs extracted from dic- developed at the Institute for Computatio- tionaries and the Greek-Italian pairs extrac- nal Linguistic “A. Zampolli” in Pisa, to the ted from aligned translations) are useful to 1 Italian section of MultiWordNet, developed enhance the accuracy of a bootstrapped Word- at Bruno Kessler Foundation and to a Latin Nets. WordNet (LWN) created with the same cri- teria of AGWN and linked to Minozzi’s La- 2 Translation Mining through tin WordNet (Minozzi, 2009) and (McGilli- Semantic Spaces vray, 2010), developed at the University of Ve- rona. In this way the user is allowed to find We present a way to automatically improve the equivalents of a set of synonyms into dif- the accuracy of Ancient Greek word transla- ferent languages. The AGWN can be freely tions by applying the principles of distributi- accessed through a Web interface,2 which al- 3 lows enabled users to add or delete words In the following, when we use the term CoPhiWord- Net Platform (CoPhiWn) we mean the three WordNets: 1http://multiwordnet.fbk.eu AGWN, IWNand PWN. 2GUI beta-version at 4For example,PWN links through ILI (Vossen, 1998) http://www.languagelibrary.eu/new_ewnui AGWN to IWN

47 onal semantics to aligned corpora (Dumais et 2.3 Semantic Spaces based on aligned al., 1997) and (Yuri, 2015). We will first explain corpora the ratio of this method and then show how it There are several kinds of linguistic contexts is useful to improve AGWN in several ways that can be selected to study word similarity (see Section 2.7). Although Ancient Greek ob- (Lenci, 2008): viously does not have native speakers, we dis- pose of a great variety of translations of the • window-based collocates: two words co- same classical texts written in several langua- occur if they appear in a given context ges and different historical periods. The study window; of large diachronical corpora of translations is both relevant in classical studies and a valua- • text regions: two words co-occur if they ble source of information to build or improve appear in a same textual area such as a do- the accuracy of multilingual lexico-semantic cument, a paragraph, and so on; resources (see Section 3). • syntactic collocates: two words co-occur if they appear in a same syntactic pattern, 2.1 Aligning long and literary-biased for example if they are the direct objects translations to the original text of a verb, etc. We applied a strategy to automatically align Although the most typical approach to dis- Greek-Italian parallel corpora through two tributional semantics is the use of window- main steps: in the first step we segmented based collocates, this kind of context becomes texts in small portions; in the second step we useless in multilingual corpora, since words linked those texts together. The result is that in different languages do not share a common each Ancient Greek segment is aligned to its context. We use the method based on text re- translations. After the segment-to-segment gions collocates, which considers every cou- alignment, we applied the distributional se- ple of aligned segments as the default tex- mantics method illustrated below, in order to tual area. Word vectors of 0s and 1s in both identify word-to-word translations. languages are constructed accordingly to the absence/presence of the word in the aligned 2.2 Distributional Semantics couple. Thus, Ancient Greek and Italian words are It is argued by several linguists (Miller, 1971) mingled together in the vectorial space.5 and (Firth, 1975) that one of the best ways to define the meaning of a word is the study of 2.4 Words and their translations tend to be the relations with the other words in the close neighbors context. So it is possible to hypothesize that With a similar procedure, Ancient Greek and we learn the meaning of many new words Italian equivalent words will happen to have thanks to the way they are linked to words similar vectors, since they will appear in the we already know, and in general, that we le- same aligned chunks. Consequently they will arn the meaning of words by perceiving their be close in the resulting semantic space. To verbal as well as non-verbal context. We can compute the proximity of vectors we used the study semantic similarities between terms by cosine similarity measure (Sahlgren, 2006). quantifying their distribution: similar words will have similar contexts. In the same way, 2.5 Parts of Speech TRanslations we can suppose that, in an aligned parallel corpus, a word and its translation will tend to Performance on nouns is higher than perfor- appear in the same aligned segments. For this mance on verbs, adjectives and adverbs, due reason, the contextual segment of the original to larger translational fluctuations for the lat- Greek word and the contextual aligned seg- ter parts of speech. Anyway, although verbs ment of the translation have the same identi- 5In our experiment the resulting vector has a dimen- fier. sion of 60k ∼

48 are more polysemous than nouns, we appa- to an Italian synset composed by the words rently are able to find relevant verb translati- guerra, battaglia, ostilità. The first two terms ons: uccidere - kteíno (to kill), morire - thnésko appear also to be the nearest Italian terms to (to die), amare - philéo (to love) and even es- the word pólemos (war, battle) in our seman- sere - eimí (to be). The same holds for adjec- tic space. This match helps us to increase the tives, but, however,we found acceptable re- probability that guerra and battaglia are sound sults also in this category: bello - kalòs (be- translations of pólemos, and thus that the Ita- autiful), nobile - agauòs (noble). Interestingly, lian and Greek synsets are correctly interlin- from color adjectives we were only able to ked. retrieve black and white translations: nero- In CoPhiWn the word hémar (day) is linked mélas (black), bianco-leukós (white). Color ad- to the synonyms giorno, giornata, and in our jectives in Ancient Greek are naturally com- semantic space it appears very similar to the plex to analyze, since it is hard to retrieve their word giorno only. But the distributional infor- exact meaning in absence of speakers; this in- mation from our semantic space reinforces the determination apparently propagates to our association between hémar and the overall Ita- outcomes. lian synset. Finally, it is also relevant to observe that ex- This way to retrieve crosslingual informa- tremely polysemous categories like adverbs in tion from textual corpora is highly helpful to some cases find a correct translation: ek - fuori discover errors due to the employ of poly- (out), non-ou - non (not). semy in different languages. For instance, in 2.6 Data Presentation and Some Results CoPhiWn, the word astér (star) is linked both to the synset associated to the word stella, We extracted the five most similar items for glossed as star in the sky and to the synset 121 Ancient Greek words (randomly chosen associated to the word divo, glossed as star in from different groups of frequency) from a the show business, due to the intermediation semantic space built on the original texts, i.e of the English word star,6 while, as expected, five complete Iliad translations and four com- astér is distributionally similar only to stella in plete Odyssey translation in Italian aligned our semantic space. The word dóru (spear and to the original texts. The original data resul- mast) is linked on one hand to asta, arma synset ted in 605 rows (121 time 5pairs); when it co- and on the other hand to prora, prua, glossed mes to verify whether a Greek/Italian pair is as parts of the boat, which is synecdochically mapped in CoPhiWn, we expect that the mo- related to the mast, but in our semantic space dern polysemy, the one inducted by English to it appears near only to the words of the first Italian mapping will increase the number of group, allowing us to score higher only the pairs. Indeed, we found that 605 pairs corres- first equivalence. It is important to remember pond to 736 Greek-English-Italian possible tri- that we can incur in cases of stylistically bia- ples. However, only 176 triples have been suc- sed translations and synonyms: árma (charriot) cessfully found in CoPhiWn. A manual vali- can be cocchio or carro in different translations. dation of the resulting set excluded 13 triples Additional examples are the following: the which are caused by the modern polysemy re- most similar terms to Italian mare in our se- ducing the found triples to 164. Not surpri- mantic space are thálassa, háls, póntos, three singly, the coverage of the triples in CoPhiWn words indicating the concept of sea clustered 23% is quite close to the coverage of AGWN, ∼ together by their common translation. scudo cf. (Bizzoni et al., 2014) ( 28%). ∼ (shield) is associated both to aspís and sakós, 2.7 AGWN: strenghtening bilingual links soffio (breath) leads to pnóe and ánemos, through popolo (people) we find láos, démos and If an Ancient Greek word is linked to an Ita- among the most similar words of dolore (pain) lian word in CoPhiWn and it is distributio- we find both pénthos and álgos. With the same nally near to the same Italian word in a se- mechanisms that allow to find word to word mantic space, the probability that this link is correct is high. For instance, the word póle- 6This is one effect of the modern polysemy described mos, frequent in Homer, is linked in CoPhiWn in Section 2.6.

49 translations, we can ﬁnd also some small sets Christiane Fellbaum, editor. 1998. WordNet: An of potential synonyms in the same language Electronic Lexical Database (Language, Speech, and looking at their distributional behavior: so Communication). The MIT Press, Cambridge, MA, USA. aithér is near to oúranos and hétor is near to thu- mós. John Rupert Firth. 1975. Modes of meaning. College Division of Bobbs-Merrill Company. 2.8 CoPiWn! (CoPiWn!): supporting Kyoko Kanzaki Francis Bond, Hitoshi Isahara and hypernym/hyponym relations Kiyotaka Uchimoto. 2008. Boot-strapping A system based on distributional semantics a wordnet using multiple existing wordnets. tends to cluster together not only bilingual sy- In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios nonyms and translations, but also hypernyms Piperidis, and Daniel Tapias, editors, Procee- and hyponyms. They tend to have distributi- dings of the Sixth International Conference on Lan- onally similar, although not identical, behavi- guage Resources and Evaluation (LREC’08), Mar- ors, and it can easily happen that a word is rakech, Morocco, may. European Language Re- sources Association (ELRA). http://www.lrec- translated with a hypernym, or more rarely conf.org/proceedings/lrec2008/. with a hyponym, in another language. Sys- tems to discriminate between hypernyms and Alessandro Lenci. 2008. Distributional semantics in linguistic and cognitive research. From context synonyms in semantic spaces could become to meaning: Distributional models of the lexicon in very useful in this context. See for example linguistics and cognitive science, special issue of the (Benotto, 2013) and Lenci et al. 2012. Italian Journal of Linguistics, 20(1):1–31. 3 Conclusions and Future Work Barbara McGillivray. 2010. Automatic selectional preference acquisition for Latin verbs. In Proce- We have elaborated a system to enhance the edings of the ACL 2010 Student Research Workshop, accuracy of Ancient Greek WordNet. This ACLstudent ’10, pages 73–78. ACL. system appears to be useful to verify the George A Miller. 1971. Empirical methods in the soundness of automatically generated links study of semantics. Semantics, an interdiscipli- between the Ancient Greek WordNet and nary reader in philosophy, linguistics, and psycho- logy, pages 569–585. WordNet in other languages. The method aims at increasing the precision of the Greek- Stefano Minozzi. 2009. The Latin WordNet Pro- Italian pairs within their translations, since ject. In Peter Anreiter and Manfred Kienpoint- it removes modern polysemy and discards ner, editors, Latin Linguistics Today. Akten des 15. Internationalem Kolloquiums zur Lateinischen Lin- translations in CoPhiWn that are not suppor- guistik, volume 137 of Innsbrucker Beiträge zur ted by actual texts’ translations. Sprachwissenschaft, pages 707–716. Adriana Roventini, Antonietta Alonge, Francesca References Bertagna, Nicoletta Calzolari, Christian Girardi, Bernardo Magnini, Rita Marinelli, and Antonio Giulia Benotto. 2013. Modelli distribu- Zampolli. 2003. Italwordnet: building a large zionali delle relazioni semantiche: il caso semantic database for the automatic treatment dell’iperonimia. Animali, Umani, Macchine. Atti of Italian. Computational Linguistics in Pisa, Spe- del convegno 2012 del CODISCO. cial Issue, XVIII-XIX, Pisa-Roma, IEPI, 2:745–791. Yuri Bizzoni, Federico Boschetti, Harry Diakoﬀ, Magnus Sahlgren. 2006. The word-space model: Riccardo Del Gratta , Monica Monachini, and Using distributional analysis to represent syn- Gregory Crane. 2014. The Making of An- tagmatic and paradigmatic relations between cient Greek WordNet. In Proceedings of the words in high-dimensional vector spaces. Ninth International Conference on Language Re- sources and Evaluation (LREC’14), Reykjavik, Ice- Piek Vossen, editor. 1998. EuroWordNet: A Mul- land, may. European Language Resources Asso- tilingual Database with Lexical Semantic Networks. ciation (ELRA). Kluwer Academic Publishers, Norwell, MA, USA. Susan T Dumais, Todd A Letsche, Michael L Litt- man, and Thomas K Landauer. 1997. Auto- Bizzoni Yuri. 2015. The Italian Homer - The matic cross-language retrieval using latent se- Evolutions of Translation Patterns between the mantic indexing. In AAAI spring symposium on XVIII and the XXI century. Master’s thesis, Uni- cross-language text and speech retrieval, volume 15, versity of Pisa. page 21.