Lenka Bajˇceti´C , Thierry Declerck , John P. Mccrae

Heteronym Sense Linking Lenka Bajˇceti´c1, Thierry Declerck1,2, John P. McCrae3 1Austrian Centre for Digital Humanities and Cultural Heritage 2German Research Center for Artiﬁcial Intelligence (DFKI) 3 Data Science Insitute, NUI Galway

Abstract

This paper presents ongoing work which aims to semi-automatically connect pronunciation information to lexical semantic resources which currently lack such information, with a focus on WordNet. This is particularly relevant for the cases of heteronyms — homographs that have different meanings associated with different pronunciations — as this is a factor that implies a re-design and adaptation of the formal representation of the targeted lexical semantic resources: in the case of heteronyms it is not enough to just add a slot for pronunciation information to each WordNet entry. Also, there are numerous tools and resources which rely on WordNet, so we hope that enriching WordNet with valuable pronunciation information can prove beneﬁcial for many applications in the future.

Introduction

There are many types of ambiguity in language, and one interesting example are homographs. These are words that are spelled the same, but they have different pronunciations. Specifically, homographs that have different meanings associated with different pronunciations, are called heteronyms [3]. Heteronyms can cause great challenges for speech-to-text Figure 2: An abstraction of our approach and text-to-speech systems. They also provide an interesting use-case for our endeavour to enrich WordNet with pronunciation information. Recently, the Global WordNet Association (GWA) updated its Global Wordnet Formats [4], which have been introduced to enable wordnets to have a common representation. One of Results the updates performed by GWA concerns the possibility to add pronunciation information In the table below we compare the results of the four classifier models: Naive Bayes, Deci- to the entries of wordnets. This update is a great step towards integrating pronunciation sion Tree, and two versions of the Random Forest classifier. We can see that the results are information in wordnets. In this work, we start with the task of supporting an automated quite promising, especially for the Random Forest classifier. linking between the senses of the heteronyms we extracted from Wiktionary and those in- cluded in the Open English WordNet [5]. Since English WordNet is a manually curated Classifier Accuracy Precision Recall F1-score gold standard resource, this would lead to the possibility to get an evaluation of the linking Naive-Bayes 0.65 0.67 0.65 0.65 work for this specific type of phenomenon and also to the building of a training set for an Decision Tree 0.71 0.70 0.71 0.71 extension of the linking work. Random Forest - STS 0.79 0.80 0.79 0.79 Random Forest - Paraphrase 0.84 0.85 0.84 0.84

Table 2: Results of the sense linking task In Table 3 we can see the importance of different features for the Decision Tree and Ran- dom Forest classiﬁer. Interestingly, Decision Tree relies most on the TFIDF score, while the S-Bert similarity score is the most useful for our Random Forest model. Classiﬁer S-Bert LASER TFIDF POS1 POS2 Decision Tree 0.36 0.15 0.43 0.05 0.02 Random Forest 0.39 0.23 0.28 0.06 0.04

Figure 1: Bass is an example of a heteronimous word Table 3: Relevance of different features for the classiﬁers

Conclusions and Future Work

Since this is ongoing work, there is quite some space for future work. One of the most Method important things to be done next is to compile a bigger gold standard dataset, also in a mul- tilingual setting. Another possibility to increase our dataset is to explore ways to up-sample Our work consists of compiling a small gold standard dataset of heteronymous words, data, or generate artificial data to increase the size of our corpus. Since the size of the data which contains short documents created for each WordNet sense, in total 136 senses can negatively affect generalization and create difficulty in reaching the global optimum, matched with their pronunciation from Wiktionary. For the task of matching WordNet this is an important issue when creating supervised classifiers. A promising next step to senses with their corresponding Wiktionary entries, we train several supervised classi- increase the impact of our work includes handling compounds or phrasal entries in which fiers which rely on various similarity metrics, and we explore whether these metrics can a component is a heteronym, like for example “lead pencil”. Ultimately, we hope this work serve as useful features as well as the quality of the different classifiers tested on our will prove beneficial for handling heteronyms in text-to-speech systems as well [2] and [6], dataset. Finally, we explain in what way these results could be stored in OntoLex-Lemon with the help of enriched wordnets. and integrated to the Open English WordNet. The code and data are available here: https://github.com/acdh-oeaw/heteronym sl. Acknowledgements

o Contributions by the German Research Center for Artificial Intelligence (DFKI GmbH) were Word Pronunciation 1 Pronunciation 2 N of senses supported in part by the H2020 project Pret-ˆ a-LLOD` with Grant Agreement number 825182. bass bæs beIs 9 Contributions by the Austrian Centre for Digital Humanities and Cultural Heritage at the bow baU boU 14 Austrian Academy of Sciences were supported in part by the H2020 project “ELEXIS” with desert dI"zE:t "dEz9t 4 Grant Agreement number 731015. The work described in this paper was also pursued in house haUs haUz 14 part in the larger context of the COST Action CA18209 — NexusLinguarum “European net- lead lEd li:d 31 work for Web-centred linguistic data science”. live lIv laIv 19 raven "ôeIv9n "ôæv9n 5 row ôaU ô9U 10 References subject "s2b.dZEkt s9b"dZEkt 15 wind wInd waInd 15 [1] Thierry Declerck and Lenka Bajcetiˇ c.´ Towards the addition of pronunciation information to lexical semantic resources. In Proceedings of the 11th Global Wordnet Conference, pages Table 1: The gold standard 284–291, University of South Africa (UNISA), January 2021. Global Wordnet Association. [2] Caroline Henton and Devang Naik. Disambiguating heteronyms in speech synthesis, 2014. We compared Naive Bayes, Decision Tree, and Random Forest classifiers from Sklearn [3] Maryanne Martin, Gregory Jones, Douglas Nelson, and Louise Nelson. Heteronyms and library. Additionally, since the random forest classifier has proven to have the best results, polyphones: Categories of words with multiple phonemic representations. Behavior Re- we have decided to fine-tune it by trying different parameter values for the number of trees search Methods & Instrumentation, 13:299–307, 05 1981. in the forest (estimators) and the maximum number of levels in each decision tree (max depth). The classifiers rely on five features: Wiktionary POS, WordNet POS, S-BERT simi- [4] John P. McCrae, Michael Wayne Goodman, Francis Bond, Alexandre Rademaker, Ewa larity score, Laser similarity score, and TFIDF similarity score. Rudnicka, and Luis Morgado Da Costa. The GlobalWordNet formats: Updates for 2020. Proceedings of the 11th Global Wordnet Conference The final step of our work entails connecting to Open English WordNet and representation In , pages 91–99, University of South of heteronyms in OntoLex-Lemon. Currently, there are no distinctions between heteronyms Africa (UNISA), January 2021. Global Wordnet Association. in WordNet, so these would need to be introduced. Our previous work [1] discusses the [5] John Philip McCrae, Alexandre Rademaker, Ewa Rudnicka, and Francis Bond. En- addition of pronunciation information in wordnets, with a focus on heteronyms. We use the glish wordnet 2020: Improving and extending a wordnet for english using an open- OntoLex-Lemon representation model, as it has proven to be well adapted for linking the source methodology. In Proceedings of the LREC 2020 Workshop on Multimodal Wordnets, conceptual type of resources, and we propose a way to use OntoLex-Lemon to represent the MMW@LREC 2020, Marseille, France, May 2020, pages 14–19, 2020. combination of wordnet entries and lexical entries, whifch are themselves pointing to form [6] Xi Wang, Xiaoyan Lou, and Jian Li. Speech synthesis with fuzzy heteronym prediction variants displaying the corresponding pronunciation information. using decision trees, 2011.