ASCII Phonetic Symbols for the World's Languages: Worldbet
Total Page:16
File Type:pdf, Size:1020Kb
ASCI I Phonetic Symb ols for the World's Languages: Worldb et James L. Hieronymus AT&T Bell Lab oratories, Murray Hill, NJ 07974, USA Abstract A new ASCI I enco ding of the International Phonetic Alphab et IPA and additional symb ols for sp eech database lab eling has b een designed for all languages. Many of the previous ASCI I versions of the IPAwere targeted at Europ ean languages and therefore left out many of the sounds of the other languages or used IPAsymb ols for non-Europ ean sounds likeclicks, for plosive bursts. When an attempt was made to lab el a large numb er of languages with phonemic and phonetic symb ols, these were found to b e inadequate. The presentscheme b orrows on earlier work by George Allen, Ian Maddieson, John Wells, Laver et al. and Hieronymus et al. Wherever p ossible, the present scheme was made similar to the base IPA symb ols, so that many of the symb ols will seem to have obvious meanings. Many of the symb ols are the same as other schemes. The underlying principle is that any sp ectrally and temp orally distinct sp eech sound not including pitch which is phonemic in some language should have a separate base symb ol. In most cases the base symb ol consists of a concatination of an IPA symb ol and diacritics. Thus it is easy to recognize the phonemic base symb ols and compare the same broad phonetic sound across languages. Tone languages have diacritics applied to the vowel phoneme symb ols to prop erly identify the phonemes in these languages. Allophonic variations due to contextural coarticulation and stress may b e lab elled by a diacritic attached to the base symb ol. It is p ossible that some sp eech sounds which are phonemic in at least one of the world's languages, are missing from the presentversion. It is hop ed that anyoversights will b e corrected in subsequentversions of Worldb et, and a standard metho d for constructing new symb ols is presented. Intro duction Many systems have b een develop ed for writing the sounds of the world's languages. Manyofthe early workers made their own systems b ecause there was no agreed standard or indeed knowledge of the complete sp eech sound inventory. The International Phonetic Alphab et was develop ed in 1888 and revised several times into its present form. It represents 105 years of exp erience with putting a symbol to each sound in all of the known languages in the world. The issues of economyof representation and the distinction b etween allophonic variation and true baseform sound have b een worked out for many more languages since the IPAwas originally formulated. Therefore it is a go o d place to b egin for anymultilanguage sp eech database lab elling e ort. There are some sounds which are not normally included in the IPA whichhave b een found to b e useful in lab elling large sp eech corp ora like TIMIT, SCRIBE, BDSONS, and PHONDAT. These mo dern attempts at a standard ascii form of the IPA resulted in TIMITBET, MRPA, SAMPA, and SAMPA Extended to name a few of them. These phonetic alphab ets were restricted to English or to Europ ean languages, and thus were to o restricted in scop e to b e used in other ma jor language families. The issue is whether or not the ascii representation is consistent, complete and logical for alloftheIPAsymb ols. Worldb et is an attempt to have a phonetic alphab et whichcovers all of the world's languages in a systematic fashion. It is an ascii version of the IPAplusanumber of symb ols whichwere found useful in database lab elling, which are not currently in the ocial IPA set. This list of extra symb ols may grow with time until all of the imp ortant phenomena have a coherent symb ol representation. This pap er is organized to rst cover the general principles of Worldb et, discuss earlier lab eling sets, give sp eci c symb ol assignments, and discuss lab eling metho ds. In App endix A is an exhaustive list the Worldb et symb ols and their corresp onding lab els in a few other systems, namely TIMITBET, SAMPA and JBET a phonetic alphab et used in sp eech synthesis. App endix B is a table of place and manner of articulation v.s. Worldb et symb ols. In App endix C there are examples of Worldb et symbol inventories for several languages. 1 General Principles Worldb et is an ASCI I version of International Phonetic Alphab et IPA with additional broad phonetic symb ols not presently in the IPA. It is designed for a large set of languages including Indian, Asian, African and Europ ean languages. Considerations of the sp ecial sounds in eachof these languages lead to the principle that eachbasesymb ol will represent a sp eech sound with a sp ectrally distinct time sequence. Eachtyp e of /r/ will have its separate IPAlike designation, rather than the more graphemic r used in some lab el sets. Allophones like aspirated plosives will have a separate base symb ol from unaspirated plosives, if they are phonemic within the language in question, otherwise they will b e marked using the base symb ol plus diacritic. Distinct means to b e so di erent sp ectrally or temp orally as to b e p erceptually di erent, when the comp onents are heard in isolation. Vowels are classed into nominal place p ositions. It is recognized that the detailed vowel color mayvary b etween languages for the same nominal vowel, yet separate symb ols will b e assigned only when the di erences are large enough to constitute di erent phonemes. In actual lab eling exp erience it has b een found that most of the di erences in phonetic lab els b etween trained phoneticians were due to disagreements on the detailed vowel color, rather than the actual broad vowel color. Therefore, Worldb et base symb ols will represent phonemic distinctions in some language, as in the plosive example. The base symb ols are thus meant to b e broad phonetic, but can b e used as surface phonemic symb ols within a given language as stated in the original principles of the IPA [1]. Since the IPA has b een in use for over 100 years and has b een actively develop ed and evolved over this p erio d [2], it should have all of the phonemic distinctions observed in the world's languages to date. Therefore it is the natural starting p ointforany attempt to construct a phoneme set which is sucienttocover all of the languages in the world. Diacritics are used in general to mo dify the base symb ols to deal with allophones whichare due to coarticulation e ects i.e.: labialized /s/ in the environment of /w/, or phonological context e ects. The diacritic allows the particular allophone to b e marked, which has as its base character the phonemically based broad phone which is the origin of this allophone. Of course it is not always easy to determine what is an allophonic variation and what is a change of broad phonetic category. Normally the numb er of symb ols to b e used to lab el a particular language will b e limited, to keep from having an overly large lab el inventory. The motivating factor for Worldb et is to lab el sp eech for corpus driven sp eech research, phono- logical inventories, automatic language identi cation, multi-language sp eech recognition, and multi- language sp eech synthesis. It should also b e useful in constructing multi-lingual dictionaries. In all of the ab ove uses it is most convenienttohaveeach sound lab eled with a particular symb ol closely resemble all other sounds with the same lab el, no matter in which language it is uttered. Previous Lab el Sets Past work on ascii to ipa symb ol sets was reviewed including the Klatt phonb et Allen et al. [3], the PHONASCI I system Allen [4], Arpab et used in the rst ARPASpeech Understanding Pro ject [5], TIMITBET used in lab elling the DARPA Acoustic Phonetic Database whichwas collected by Texas Instruments and lab elled by MIT [6], the Esprit Sp eech Assessment Metho dology Phonetic Alphab et SAMPA [7], the Edinburgh Machine Readible Phonetic Alphab et MRPA [8], the Alvey Pro ject Phonetic alphab et [9], and the SCRIBE pro ject phonetic lab el set SAMPA Extended [10]. These were generally concerned with one or a few Indo-Europ ean Languages, and thus are missing anumb er of the symb ols needed for other languages. For SAMPA some simpilifying assumptions were made b ecause it was thought that they would b e used for transcription within one language, not across languages. This leads to the same symb ol b eing used for quite di erent sounds, most notably for /r/. An e ort for world wide ASCI I to IPAcoverage by Ian Maddieson [11] of UCLA was thoughtto b e to o complicated for the present application. It is a more detailed lab el set aimed at ne phonetic distinctions in all the world's languages. It do es not distinguish b etween diacritics and baseforms. With the full set of diacritics in Worldb et it should b e p ossible to have the same level of detail, with the proviso that phonemes with multiple places of articulation mighthavetohave baseform symb ols assembled using the Worldb et linking character .