Towards the Addition of Pronunciation Information to Lexical Semantic Resources

Towards the Addition of Pronunciation Information to Lexical Semantic Resources Thierry Declerck Lenka Bajcetiˇ c´ German Research Center for AI Austrian Centre for Digital Humanities and Multilinguality and Language Technology Cultural Heritage Stuhsatzenhausweg 3 Sonnenfelsgasse 19 D-66123 Saarbrucken¨ Germany Wien 1010, Austria [email protected] [email protected] Abstract plified in the combination of the IPA3 code [/lEd/] and the definition: This paper describes ongoing work aiming at adding pronunciation information to lexical se- (“A heavy, pliable, inelastic metal ele- mantic resources, with a focus on open word- ment, having a bright, bluish color, but nets. Our goal is not only to add a new modal- easily tarnished; both malleable and duc- ity to those semantic networks, but also to tile,though with little tenacity. It is easily mark heteronyms listed in them with the pro- fusible, forms alloys with other metals, nunciation information associated with their and is an ingredient of solder and type different meanings. This work could con- metal. Atomic number 82, symbol Pb tribute in the longer term to the disambigua- tion of multi-modal resources, which are com- (from Latin plumbum).”) bining text and speech. and of the IPA code [/li:d/] and the definition: 1 Introduction (“The act of leading or conducting; guid- ance; direction, course”). The work described in this paper aims at enriching lexical semantic databases by adding the modality This phenomenon is called “heteronymy”. Al- of pronunciation, primarily targeting in our current though they share the same spelling, heteronyms work the Open English WordNet (McCrae et al., have two different possible pronunciations that are 2019a, 2020).1 Pronunciation information is typi- associated with two (or more) different meanings cally not associated with WordNet, but can be par- (Martin et al., 1981). By definition, these words ticularly relevant within the vision of contributing are homographs which are not homophones. They directly or indirectly to integrated lexical resources can be considered as the opposite of polyphones, and architectures, like the ELEXIS Dictionary Ma- which are words with different pronunciations that trix (McCrae et al., 2019b) or BabelNet (Navigli are not associated with different meanings. Typi- and Ponzetto, 2010), as well as text-to-speech sys- cal heteronym examples in English include “tear” , tems which use WordNet or WordNet-based lexical “bow”, and “row”. resources or tools. The frequency of heteronymy varies across dif- In a number of cases, homographs with different ferent languages. For example, as for today, Wik- meanings are also characterised by different pro- tionary counts 723 cases for English,4 while only nunciations. This can be the case across syntactic 21 cases are listed for French.5 But the number categories, but also within one category, like for of concerned entries increases considerably if we example for the noun “lead”,2 which is having a take into account all the derived terms (including different pronunciation per sense, as this is exem- 3IPA stands for ”International Pho- netic Alphabet”. See also https://www. 1See also https://github.com/ internationalphoneticassociation.org/. globalwordnet/english-wordnet. 4https://en.wiktionary.org/wiki/ 2The two pronunciation and definition pairs for the noun Category:English_heteronyms, [consulted: “lead” displayed here are taken from the XML dump of the 2021.01.28] English edition of Wiktionary. The human readable page 5https://en.wiktionary.org/wiki/ can be consulted at https://en.wiktionary.org/ Category:French_heteronyms, [consulted: wiki/lead#Noun. 2021.01.28] compounds and phrasal expressions) in which a 2.2 BabelNet heteronym entry is occurring. So, for the “metal” While BabelNet already combines wordnets and sense of the “lead” entry, Wiktionary is listing 77 wiktionaries, as well as many other resources, it derived terms, 32 of them being currently included does not yet provide the phonetic transcription that as an entry in the dictionary. Some of them are it has extracted from various language versions of carrying pronunciation information (“leadsman”), Wiktionary. Although BabelNet provides sound and some are not (“lead pencil”). Similarly, for files in its word entries, those pronunciations are the “curved” sense of “bow” Wiktionary lists 19 given by an external library that do not read from derived terms, like for example “longbow”, all in- IPA codes. This library seems to be connected to cluded as an entry in the dictionary. Some of them the text-to-speech modules of the browser access- are also not carrying pronunciation information, ing the server, and utilises it to add pronunciation like for example “bow harp”. Hence, a much larger to some textual information on the BabelNet pages, number of Wiktionary entries can be considered as like the entry and its associated definition(s) and instances of heteronymy, if one lexical item in a example sentence(s). compound or in a phrasal entry is itself included in Experimenting with BabelNet, we discovered Wiktionary as a heteronym. that in fact a unique pronunciation for homographs is provided, leading thus to a number of wrong 2 Targeted Lexical Databases pronunciation examples. In this case we can see the importance of considering the IPA phonetic Although our current work is primarily intended at transcriptions for all senses of a heteronym. This enriching WordNet, ultimately we aim at adding way, the disambiguated IPA code of each sense disambiguated pronunciation information to a se- could be used as input to the sound file generator ries of lexical databases. Once the phonetic tran- of BabelNet. We hope that our work will prove scriptions are correctly stored in WordNet, this in- beneficial in this endeavour. formation can be propagated to BabelNet (Navigli and Ponzetto, 2012)6 and all other lexical resources 2.3 ELEXIS – Dictionary Matrix which are making use of WordNet. The Dictionary Matrix, under development within the ELEXIS project,9 is a collection of linked dic- 2.1 Wordnets tionaries. The goal of this matrix is to enhance interoperability across resources and languages. As each WordNet is a sense inventory, it is particu- For this, ELEXIS provides services for linking re- larly relevant to associate pronunciation informa- sources semi-automatically across languages at var- tion with the heteronyms it lists. Recently we wit- ious matching levels such as headword, sense and nessed the development of a new WordNet for En- lexeme. We plan to add pronunciation informa- glish (McCrae et al., 2020), which is based on the tion to WordNet resources that are included in this Princeton WordNet (PWN, see (Fellbaum, 1998)), linking exercise, as this can help in the particularly but aiming at an open source development policy. challenging sense linking task. This makes this version of WordNet a good can- didate for testing in a near future the addition of 3 Our Approach pronunciation information in a collaborative man- The first step of our work consisted in accessing the ner, using the corresponding GitHub platform.7 XML dump of the English Wiktionary resource,10 The Open English WordNet (OEW) data can be and extracting from there, with the help of cus- downloaded in various formats, including XML, tomised Python scripts, the pronunciation informa- LMF8 and RDF. tion associated with nouns, verbs, adjectives, and adverbs. As we can see in Figure1, we also ex- 6See also https://babelnet.org/. 7Open English WordNet is accessible at https:// tracted the corresponding senses and associated ex- github.com/globalwordnet/english-wordnet amples sentences, as we need to keep the relation of It is also accessible via a GUI: https://en-word.net/. 8LMF stands for “Lexical Markup Language”, an ISO stan- 9https://elex.is/. dard (Francopoulo et al., 2006), which has also be employed 10The XML dumps of recent versions of the English for encoding WordNet, as this is described for example in edition of Wiktionary are available at https://dumps. (Henrich and Hinrichs, 2010). wikimedia.org/enwiktionary/. the pronunciation information with the correspond- elements are the ones we call “entries” in the list ing meaning and the associated example sentences, of figures displayed just above. if any is provided. On average, there is only 1,07 entries per English While we can report good progress in this task, section in the selected pages. Many Wiktionary there are still a few issues to solve, mainly due to pages are about morphological variants of a lemma the sometimes idiosyncratic way of encoding in- form, and those typically do not include PoS ambi- formation in Wiktionary. While the overall XML guities. Therefore, we do not observe a significant structures of the lexical entries in Wiktionary is amount of such PoS ambiguities in the English quite consistent, the linguistic information itself section of the total amount of selected Wiktionary is encoded by making use of the Wiki mark-up pages, but there are many more ambiguities to be language and with a number of options left to the seen, if one concentrates on the Wiktionary pages (volunteering) encoders of the entries, so that ex- that are leading to the lemma forms. tra lines of codes are necessary for dealing with We observe that 815.192 English entries are with- those recurrent idiosyncratic cases. Still, we have out pronunciation information. Inspecting those, extracted a large amount of lexical information that we see that in many cases the entries are in fact we have checked for validity. The numbers are dealing with morphological variations (e.g. plural) given and discussed in the next section. of the ground form. In such cases we see the rela- tively straightforward possibility to automatically 3.1 Some Figures accommodate the pronunciation information of the In this section we give some quantitative details lemma to the derived form. Also compound words on our current extraction work from Wiktionary.11 are most often lacking the pronunciation informa- A Wiktionary page is selected for processing if it tion.

Load more