Towards the Addition of Pronunciation Information to Lexical Semantic Resources

Total Page:16

File Type:pdf, Size:1020Kb

Towards the Addition of Pronunciation Information to Lexical Semantic Resources Towards the Addition of Pronunciation Information to Lexical Semantic Resources Thierry Declerck Lenka Bajcetiˇ c´ German Research Center for AI Austrian Centre for Digital Humanities and Multilinguality and Language Technology Cultural Heritage Stuhsatzenhausweg 3 Sonnenfelsgasse 19 D-66123 Saarbrucken¨ Germany Wien 1010, Austria [email protected] [email protected] Abstract plified in the combination of the IPA3 code [/lEd/] and the definition: This paper describes ongoing work aiming at adding pronunciation information to lexical se- (“A heavy, pliable, inelastic metal ele- mantic resources, with a focus on open word- ment, having a bright, bluish color, but nets. Our goal is not only to add a new modal- easily tarnished; both malleable and duc- ity to those semantic networks, but also to tile,though with little tenacity. It is easily mark heteronyms listed in them with the pro- fusible, forms alloys with other metals, nunciation information associated with their and is an ingredient of solder and type different meanings. This work could con- metal. Atomic number 82, symbol Pb tribute in the longer term to the disambigua- tion of multi-modal resources, which are com- (from Latin plumbum).”) bining text and speech. and of the IPA code [/li:d/] and the definition: 1 Introduction (“The act of leading or conducting; guid- ance; direction, course”). The work described in this paper aims at enriching lexical semantic databases by adding the modality This phenomenon is called “heteronymy”. Al- of pronunciation, primarily targeting in our current though they share the same spelling, heteronyms work the Open English WordNet (McCrae et al., have two different possible pronunciations that are 2019a, 2020).1 Pronunciation information is typi- associated with two (or more) different meanings cally not associated with WordNet, but can be par- (Martin et al., 1981). By definition, these words ticularly relevant within the vision of contributing are homographs which are not homophones. They directly or indirectly to integrated lexical resources can be considered as the opposite of polyphones, and architectures, like the ELEXIS Dictionary Ma- which are words with different pronunciations that trix (McCrae et al., 2019b) or BabelNet (Navigli are not associated with different meanings. Typi- and Ponzetto, 2010), as well as text-to-speech sys- cal heteronym examples in English include “tear” , tems which use WordNet or WordNet-based lexical “bow”, and “row”. resources or tools. The frequency of heteronymy varies across dif- In a number of cases, homographs with different ferent languages. For example, as for today, Wik- meanings are also characterised by different pro- tionary counts 723 cases for English,4 while only nunciations. This can be the case across syntactic 21 cases are listed for French.5 But the number categories, but also within one category, like for of concerned entries increases considerably if we example for the noun “lead”,2 which is having a take into account all the derived terms (including different pronunciation per sense, as this is exem- 3IPA stands for ”International Pho- netic Alphabet”. See also https://www. 1See also https://github.com/ internationalphoneticassociation.org/. globalwordnet/english-wordnet. 4https://en.wiktionary.org/wiki/ 2The two pronunciation and definition pairs for the noun Category:English_heteronyms, [consulted: “lead” displayed here are taken from the XML dump of the 2021.01.28] English edition of Wiktionary. The human readable page 5https://en.wiktionary.org/wiki/ can be consulted at https://en.wiktionary.org/ Category:French_heteronyms, [consulted: wiki/lead#Noun. 2021.01.28] compounds and phrasal expressions) in which a 2.2 BabelNet heteronym entry is occurring. So, for the “metal” While BabelNet already combines wordnets and sense of the “lead” entry, Wiktionary is listing 77 wiktionaries, as well as many other resources, it derived terms, 32 of them being currently included does not yet provide the phonetic transcription that as an entry in the dictionary. Some of them are it has extracted from various language versions of carrying pronunciation information (“leadsman”), Wiktionary. Although BabelNet provides sound and some are not (“lead pencil”). Similarly, for files in its word entries, those pronunciations are the “curved” sense of “bow” Wiktionary lists 19 given by an external library that do not read from derived terms, like for example “longbow”, all in- IPA codes. This library seems to be connected to cluded as an entry in the dictionary. Some of them the text-to-speech modules of the browser access- are also not carrying pronunciation information, ing the server, and utilises it to add pronunciation like for example “bow harp”. Hence, a much larger to some textual information on the BabelNet pages, number of Wiktionary entries can be considered as like the entry and its associated definition(s) and instances of heteronymy, if one lexical item in a example sentence(s). compound or in a phrasal entry is itself included in Experimenting with BabelNet, we discovered Wiktionary as a heteronym. that in fact a unique pronunciation for homographs is provided, leading thus to a number of wrong 2 Targeted Lexical Databases pronunciation examples. In this case we can see the importance of considering the IPA phonetic Although our current work is primarily intended at transcriptions for all senses of a heteronym. This enriching WordNet, ultimately we aim at adding way, the disambiguated IPA code of each sense disambiguated pronunciation information to a se- could be used as input to the sound file generator ries of lexical databases. Once the phonetic tran- of BabelNet. We hope that our work will prove scriptions are correctly stored in WordNet, this in- beneficial in this endeavour. formation can be propagated to BabelNet (Navigli and Ponzetto, 2012)6 and all other lexical resources 2.3 ELEXIS – Dictionary Matrix which are making use of WordNet. The Dictionary Matrix, under development within the ELEXIS project,9 is a collection of linked dic- 2.1 Wordnets tionaries. The goal of this matrix is to enhance interoperability across resources and languages. As each WordNet is a sense inventory, it is particu- For this, ELEXIS provides services for linking re- larly relevant to associate pronunciation informa- sources semi-automatically across languages at var- tion with the heteronyms it lists. Recently we wit- ious matching levels such as headword, sense and nessed the development of a new WordNet for En- lexeme. We plan to add pronunciation informa- glish (McCrae et al., 2020), which is based on the tion to WordNet resources that are included in this Princeton WordNet (PWN, see (Fellbaum, 1998)), linking exercise, as this can help in the particularly but aiming at an open source development policy. challenging sense linking task. This makes this version of WordNet a good can- didate for testing in a near future the addition of 3 Our Approach pronunciation information in a collaborative man- The first step of our work consisted in accessing the ner, using the corresponding GitHub platform.7 XML dump of the English Wiktionary resource,10 The Open English WordNet (OEW) data can be and extracting from there, with the help of cus- downloaded in various formats, including XML, tomised Python scripts, the pronunciation informa- LMF8 and RDF. tion associated with nouns, verbs, adjectives, and adverbs. As we can see in Figure1, we also ex- 6See also https://babelnet.org/. 7Open English WordNet is accessible at https:// tracted the corresponding senses and associated ex- github.com/globalwordnet/english-wordnet amples sentences, as we need to keep the relation of It is also accessible via a GUI: https://en-word.net/. 8LMF stands for “Lexical Markup Language”, an ISO stan- 9https://elex.is/. dard (Francopoulo et al., 2006), which has also be employed 10The XML dumps of recent versions of the English for encoding WordNet, as this is described for example in edition of Wiktionary are available at https://dumps. (Henrich and Hinrichs, 2010). wikimedia.org/enwiktionary/. the pronunciation information with the correspond- elements are the ones we call “entries” in the list ing meaning and the associated example sentences, of figures displayed just above. if any is provided. On average, there is only 1,07 entries per English While we can report good progress in this task, section in the selected pages. Many Wiktionary there are still a few issues to solve, mainly due to pages are about morphological variants of a lemma the sometimes idiosyncratic way of encoding in- form, and those typically do not include PoS ambi- formation in Wiktionary. While the overall XML guities. Therefore, we do not observe a significant structures of the lexical entries in Wiktionary is amount of such PoS ambiguities in the English quite consistent, the linguistic information itself section of the total amount of selected Wiktionary is encoded by making use of the Wiki mark-up pages, but there are many more ambiguities to be language and with a number of options left to the seen, if one concentrates on the Wiktionary pages (volunteering) encoders of the entries, so that ex- that are leading to the lemma forms. tra lines of codes are necessary for dealing with We observe that 815.192 English entries are with- those recurrent idiosyncratic cases. Still, we have out pronunciation information. Inspecting those, extracted a large amount of lexical information that we see that in many cases the entries are in fact we have checked for validity. The numbers are dealing with morphological variations (e.g. plural) given and discussed in the next section. of the ground form. In such cases we see the rela- tively straightforward possibility to automatically 3.1 Some Figures accommodate the pronunciation information of the In this section we give some quantitative details lemma to the derived form. Also compound words on our current extraction work from Wiktionary.11 are most often lacking the pronunciation informa- A Wiktionary page is selected for processing if it tion.
Recommended publications
  • Linguishtik Review
    LinguiSHTIK Review Sentences: The player who rolls the dice must declare the type of sentence to be used that game. 1. Simple Sentence: A single independent clause. Examples: Sam likes pizza. The dog ran away from home. 2. Compound Sentence: Has two or more simple sentences joined together with a conjunction like FOR, AND, NOR, BUT, OR, YET (FANBOYS) Examples: John ate the pizza, but Eliza ate the hotdog. Ms. Shipley sang a song, and the dog ran away from her. 3. Complex Sentence: Has one main simple sentence and at least one subordinate clause (cannot stand alone). A subordinate clause often starts with since, although, until, however, therefore, or because. Examples: The boy wanted the football because it was his birthday. Since it is not a school night, you may stay up later. 4. Compound-Complex Sentence: Has two simple sentences with one subordinate clause. Examples: Because my homework was difficult, I had to get help, and I stayed up late. Remember: Sentences must begin with a capital letter, end with the correct punctuation. The word in the challenge MUST be underlines and spelled correctly. Demands: There are three types of demands: 1. Type: Part of Speech: noun, verb, adjective, adverb, pronoun, interjection, conjunction, preposition. 2. Function: Usage of the part of speech Examples: noun must be the predicate nominative, noun must be an object of the preposition, abstract noun. 3. General: This demand can be almost anything: must be a palindrome, must be an animal, etc. Part of Speech: Nouns Noun: a word that names a person, place or thing.
    [Show full text]
  • Techniques and Challenges in Speech Synthesis Final Report for ELEC4840B
    Techniques and Challenges in Speech Synthesis Final Report for ELEC4840B David Ferris - 3109837 04/11/2016 A thesis submitted in partial fulfilment of the requirements for the degree of Bachelor of Engineering in Electrical Engineering at The University of Newcastle, Australia. Abstract The aim of this project was to develop and implement an English language Text-to-Speech synthesis system. This first involved an extensive study of the mechanisms of human speech production, a review of modern techniques in speech synthesis, and analysis of tests used to evaluate the effectiveness of synthesized speech. It was determined that a diphone synthesis system was the most effective choice for the scope of this project. A diphone synthesis system operates by concatenating sections of recorded human speech, with each section containing exactly one phonetic transition. By using a database that contains recordings of all possible phonetic transitions within a language, or diphones, a diphone synthesis system can produce any word by concatenating the correct diphone sequence. A method of automatically identifying and extracting diphones from prompted speech was designed, allowing for the creation of a diphone database by a speaker in less than 40 minutes. The Carnegie Mellon University Pronouncing Dictionary, or CMUdict, was used to determine the pronunciation of known words. A system for smoothing the transitions between diphone recordings was designed and implemented. CMUdict was then used to train a maximum-likelihood prediction system to determine the correct pronunciation of unknown English language alphabetic words. Using this, the system was able to find an identical or reasonably similar pronunciation for over 76% of the training set.
    [Show full text]
  • 18. Principles of English Spelling in Relation to Language
    Spelling Reform Anthology edited by Newell W. Tune §18. Principles of English Spelling in Relation to Language Contents 1. Tune, Newell, The Deceitful Words of English. 2. Tune, Newell, Readability, an Analysis of What it is. 3. Yule, Valerie, The Etymological Argument FOR Spelling Reform. 4. Haas, Wm. Spelling and Spelling Reform. 5. Bonnema, Helen B. A Glance Toward Norway. 6. Anonymous. Verbs, by One Perturbed. (Humor). [Spelling Reform Anthology §18.1 pp238-240 in the printed version] [This text is different from the article of the same title in SPB Spring 1971 pp4-15 in the printed version] 1. The Deceitful Words of English, by Newell W. Tune. While not the only cause of confusion in the English language, these "look-alike, say differently" pairs or triplets of words are certainly the most irritating of vexing. In "The Psychology & Teaching of Spelling," (1934), Thomas G. Foran says, "Homonyms form one of the most difficult groups of words that pupils are called upon to spell." The English language is not the only language plagued with homonyms yet it appears to have more than its share and certainly more than are necessary - if any are really necessary at all' The Chinese have developed a method of distinguishing between homophones by the change of pitch for the several meanings of words pronounced the same. Unfortunately, it is not very practical to indicate pitch in printing, altho the word could be printed differently. Hence, the spoken language must be handed down from one person to another - from mother to child - largely by the sound and from one generation to another.
    [Show full text]
  • Heteronyms Examples Noun and Verb with Stress
    Heteronyms Examples Noun And Verb With Stress Vaclav log her appropriators adhesively, she circumscribe it possessively. Hypoeutectic and octogenarian Briggs optimisticallyprofessionalises or claver his liveners any relievo cross-questions profitlessly. respires fugato. Overstrong Hamnet never auspicate so Was inspired us bothered to support such is with heteronyms Module stress and intonation facilitator by Chem Engine issuu. It three possible to predict which syllable stress a word carries stress eg Hungarian. An established name spelled the verb examples and noun with heteronyms are used to assist students a very often still confusing because the stress and adjective will get an enrichment of. The file is too simply to be uploaded. WordNet is a lexical database of English nouns verbs adjectives and adverbs. Use homographs in any sentence RhymeZone. Juan hung up and another argument against such an angle on stress and examples based on a noun versus pronunciation with other examples of. All words that young in a lax vowel followed by a voiced stop must book the creed from their final syllable. Thus parallel to verbs can say them insert your icon can then put stress, heteronyms are needed to their problems students. It makes you. How old I insulate my NJ real estate license to another broker? Nouns and verbs are parts of speech which serve vital to understanding English. Their company, ESL RULES, LLC, conducts workshops and develops training materials for nonnative English speakers. Speech and examples so that can feel like with example, a heteronym based on stress sound? Stressed ion-words such as humiliation in which secondary stress interchangeably.
    [Show full text]
  • A Compendium of English Orthography
    A Compendium of English Orthography Items in blue are hyperlinks to the appropriate sections of the compendium and to files elsewhere on and off this site. One easy way to return to the location of the original hyperlink is to use the page thumbnails, which can be revealed by pressing F4. A, an Ablaut, umlaut -able), -ible) Accede, exceed, proceed, succeed Accent, assent, ascent Access, excess Accuse, excuse Adapt, adopt Addict, edict Addition, edition Adjectives, regular and nonregular Admirable, admiral Adoptions and adaptions Adverbs Advise, advice Affect, effect Affixes Affluent, effluent Affricate sounds Alfred the Great All and its compounds Alley, ally Alliteration Already, all ready; altogether, all together; anyway, any way; awhile, a while Alveolar sounds American Sign Language (ASL) Anagrams Analogy -ance), -ence); -ant), -ent) Angel, angle Angles, Saxons, Jutes, and Frisians Anglo-Saxon, or Old English Annual, annul Apostrophe Arctic Artificial Assimilation Assure, ensure, insure Attack, attach Auxiliary (or helping) verbs Back formation Bases Believe, belief Beneficial, beneficiary Beowulf Bilabial sounds Blends Braille British and American spelling Capital and lowercase letters Capital, capitol Cardinal and ordinal numbers Case Casual, causal Cavalry, Calvary Caxton, William Changes in Some Indo-European Sounds Changing <y> to <i> and <i> to <y> Chaucer, Geoffrey Clauses and sentences Closed syllables Code and performance College, collage Comma, coma Comparative and superlative Compliment, complement Compound words Concatenation
    [Show full text]
  • Reading Terms for Florida Educators
    DOCUMENT RESUME ED 101 306 CS 001 600 TITLE Glossary of Reading Terms for Florida Educators. INSTITUTION Florida State Dept. of Education, Tallahassee. PUB DATE 74 NOTE 93p. EDRS PRICE MF-$0.76 HC-$4.43 PLUS POSTAGE DESCRIPTORS *Definitions; *Glossaries; *Reading; Reading Programs IDENTIFIERS Florida; *Right to Read ABSTRACT This glossary is a compilation of terms commonly used in the area of reading. It is intended to serve as a guide for Florida educators at both the administrative level and the classroom level. Its purpose is to provide a clearer and more accurate means of communication and to encourage more consistent usage and understanding of the reading terms across the state. The definitions of these terms are not all inclusive but are more specifically confined to their use in Florida education. (Author/WR) u S DEPARTMENT OF HEALTH. EDUCATIONAL WELFARE NATIONAL INSTITUTE OF EDUCATION . UMI NIu:,,NI f N REPRO 1,11 1 01 xAt It r AsITT t 1 ivt D 1ROM 110 PI 14NON Ow ORGAN t/ATiONORIGIN A 11Nt, IT POIN IS 01 V.F 'NOR OPINIONS %TAIT IT f/0 NO1 Mil SSAItit V ITFPRE SINT 01$ IT NATIONAL INSTITUTE OF EDUCATION POSITION OR Pot !CV BEST COPYAVAILABLE GLOSSARY OF READING TERMS FOR FLORIDA EDUCATORS DEPARTMENT OF EDUCATION, TALLAHASSEE, FLORIDA, RALPH D. TURLINOTON, COMMISSIONER Allmawr. This reprint of a public document was promulgated at an annual cost of $623.96 or $.62 per copy to provide Florida Educators with a Glossary of Reading Terms leading to more consistent usage and understanding of such terms. 2 BEST COPYAMIABLE THE GLOSSARY OF READING TERMS FOR FLORIDA EDUCATORS IS A COMPILATION OF TERMS COMMONLY USED IN THE AREA OF READING WHICH WILL SERVE AS A REFERENCE GUIDE FOR FLORIDA EDUCATORS (ADMINISTRATORS, CONSULTANTS, RESOURCE PERSONNEL, AND CLASSROOM TEACHERS).
    [Show full text]
  • Foundational Literacy Glossary of Terms
    Foundational Literacy Glossary of Terms Foundational Literacy Glossary of Terms Accuracy The ability to recognize words correctly. Advanced Phonics Strategies for decoding multisyllabic words that include morphology and information about the meaning, pronunciation, and parts of speech of words gained from knowledge of prefixes, roots, and suffixes. Affixes Affixes are word parts that are "fixed to" either the beginnings of words (prefixes) or the endings of words (suffixes). The word disrespectful has two affixes, a prefix (dis-) and a suffix (-ful). Alphabetic Awareness Knowledge of letters of the alphabet coupled with the understanding that the alphabet represents the sounds of spoken language and the correspondence of spoken sounds to written language. Alphabetic Code Sound-symbol relationships to recognize words Alphabetic Principle The concept that letters and letter combinations represent individual phonemes in written words. Alphabetic Understanding Understanding that the left-to-right spellings of printed words represent their phonemes from first to last. Automaticity The ability to translate letters-to-sounds-to-words fluently, effortlessly. With practice and good instruction, students become automatic at word recognition, that is, retrieving words from memory, and are able to focus attention on constructing meaning from the text, rather than decoding. Base Word Base words are words that can stand on their own. In the absence of any affixes, a base word is still a “real” word. A base word is also called a free morpheme. Blend A blend is a consonant sequence before or after a vowel within a syllable, such as cl, br, or st; it is the written language equivalent of consonant cluster.
    [Show full text]
  • Mongolian-Chinese Dictionary: Online Version"
    Instructions for the "Mongolian-Chinese Dictionary: Online Version" 2017.4.24 Table of Contents 1. Outline 1.1. The prologue 1.2. Main features 1.3. System requirements 1.4. Usage requirements 2. Components of entries in the dictionary 3. Usage method 3.1. Start-up 3.2. Components of the window 3.2.1. Search area 3.2.2. Buttons 3.2.3. Search result display area 3.2.4. The image of the original dictionary 3.2.5. Listen to the pronunciation 3.2.6. Cyrillic word 3.2.7. The links to the Cyrillic Mongolian dictionaries 3.3. Search method 3.3.1. Search object 3.3.1.1. “Mongolian Words” (Searching by traditional Mongolian) 3.3.1.2. “Roman Transcript” (Search by Romanized transcription) 3.3.1.3. “Search All” (Searching all letter strings) ■The benefits of using “Search All” mode. 3.3.2. Types of search methods 3.3.3. Search options 3.3.3.1. Fuzzy Search ■The advantage of “Fuzzy Search” 3.3.3.2. Case-insensitive 3.3.3.3. Exclude sub headwords (search only main headwords) Supplemental Explanations 1. About "Mongolian-Chinese dictionary" 2. Some examples for explaining the relations between traditional Mongolian letters, Roman transcript and the keyboard keys 3. Things to be careful of when searching by traditional Mongolian letters 1 1.Outline 1.1. The prologue This "Mongolian-Chinese Dictionary: Online Version" is an electronic version of the "Mongolian-Chinese Dictionary: Revised and Enlarged Edition" (published by Inner Mongolia University Press, 1999) which has been compiled by the Institute of Mongolian Language Study, the School of Mongolian Studies of Inner Mongolia University.
    [Show full text]
  • Spelling: Same; Pronunciation: Different; Meaning: Different
    Heteronyms A heteronym is a word having a different pronunciation and meaning as another word, but the same spelling. Sometimes the stress is on a different syllable and sometimes the stress remains the same, but only the pronunciation of the vowel changes. You will find fruits and vegetables in the produce section. (adjective) Many factories produce too much pollution. (verb) He was a famous Polish actor in the 1960's. (adjective) We need to polish all of the furniture and silverware. (verb) It's very dangerous to use lead paint. (adjective) The cowboys lead the cattle to the ranch. (verb) The boy received a present from his grandmother. (noun) We are going to present our idea at the meeting. (verb) This TV show is always broadcast live. (adjective) Where do kangaroos and koalas live? (verb) A dove is a white bird that symbolizes peace. (noun) The swimmers dove into the pool several times. (verb) The treasure hunter discovered a valuable object in the tomb. (noun) Lawyers often object to questions during a trial. (verb) What was your favorite subject in school? Math, science, history? (noun) New drugs are subject to many tests before they are approved. (verb) They had been lost in the desert for almost a week. (noun) The soldier decided to desert the army one day. (verb) She is very close to finishing her essay for literature class. (adjective) Please close the door and windows when you leave. (verb) Are you going to read the sports section of the newspaper? (verb - present tense) I've already read this book three times.
    [Show full text]
  • Introduction To
    SEMANTICS Prof. Dr. Eşref ADALI Chapter – V E-mail : [email protected] www.adalı.net or www.xn--adal-oza.net What is Semantics ? Semantics is the study of the meaning of linguistic utterance. 1. Lexical Semantics • The meanings of words 2. Formal Semantics • The meanings of sentences or utterances 3. Discourse or Pragmatics • The meanings of context Lexical Semantics Word : is a unit of language which has meaning. Word consists of one or more morpheme which are linked more or less tightly together, and has a phonetical value. Typically a word will consist of a root or stem and affixes or not affix Lexeme: The set of forms taken by a single word. For example; Look Looks Looked Looking are forms of the same lexeme : Look Lexicon: A collection of lexemes Lexical Semantics Lemma or Citation : is the grammatical form that is used to represent a lexeme. In a dictionary • The lemma mouse represents mouse, mice • The lemma go represents go, goes, going, went, gone • The lemma bridge has many senses: • The game bridge was created in Karaköy, Istanbul • The Bosphorus Bridge was constructed in 1973 • Some people will bridge weekend and new year eve vacation. • A sense is a discrete representation of one aspect of the meaning of a word Lexical Semantics • Homonymy • Polysemy • Metaphor • Metonymy • Synonymy • Antonymy • Hyponymy • Hypernomy Homonymy Homonymy is one of a group of words that share the same spelling and the same pronunciation but have different meanings. Term Spelling Pronunciation Meaning Homonym Same Same Different Homograph Same Same
    [Show full text]