A Phoneme Clustering Algorithm Based on the Obligatory Contour Principle

A phoneme clustering algorithm based on the obligatory contour principle Mans Hulden Department of Linguistics University of Colorado [email protected] Abstract cases such as haplology (avoidance of adjacent identical syllables) also fall in this general cate- This paper explores a divisive hierarchi- gory of avoiding repetition along some dimension. cal clustering algorithm based on the well- The general phenomenon itself is supported by known Obligatory Contour Principle in robust, although inconsistent, evidence across a phonology. The purpose is twofold: to see number of languages. An early example is the if such an algorithm could be used for un- observation of Spitta-Bey(1880), 2 that the Ara- supervised classification of phonemes or bic language tends to favor combination of con- graphemes in corpora, and to investigate sonant segments (phonemes) in morphemes that whether this purported universal constraint have different places of articulation; this was also really holds for several classes of phono- later pointed out by Greenberg(1950) and those logical distinctive features. The algorithm Semitic root outliers that deviate from this pat- achieves very high accuracies in an unsu- tern were analyzed in depth in Frajzyngier(1979). pervised setting of inferring a consonant- In Proto-Indo-European (PIE) roots, which are vowel distinction, and also has a strong mostly structured CVC, stop-V-stop combinations tendency to detect coronal phonemes in an have been found to be statistically underrepre- unsupervised fashion. Remaining classes, sented (Iverson and Salmons, 1992). That is, PIE however, do not correspond as neatly seems to obey a cross-linguistic constraint that dis- to phonological distinctive feature splits. favors two similar consonants in a root. Another While the results offer only mixed support specific example comes from Japanese, where the for a universal Obligatory Contour Princi- phenomenon called Lyman’s law—which effec- ple, the algorithm can be very useful for tively says that a morpheme may consist of max- many NLP tasks due to the high accuracy imally one voiced obstruent—can also be inter- in revealing consonant/vowel/coronal dis- preted as avoidance (Itoˆ and Mester, 1986). tinctions. In light of such evidence, proposals have been 1 Introduction1 put forth to define the concept of phoneme by distributional properties alone as opposed to the It has long been noted in phonology that there prevalent distinctive feature systems which are seems to be a universal cross-linguistic tendency largely based on articulatory features (Fischer- to avoid redundancy or repetition of similar speech Jørgensen, 1952). Elsewhere, after finding a sta- features within a word or morpheme, especially if tistical tendency to avoid similar place of articula- the phonemes are adjacent to one another. Many tion in word-initial and word-medial consonants, different names are given to variants of this gen- Pozdniakov and Segerer(2007) offer the argument eral phenomenon in the linguistic literature: “iden- tity avoidance” (Yip, 1998), “similar place avoid- 2Nun hat, wie schon langst¨ bemerkt ist, die arabische ance” (Pozdniakov and Segerer, 2007), “oblig- Sprache die Neigung, solche Buchstaben in einem Worte zu vereinigen, deren Organe weit von einander entfernt liegen, atory contour principle” (OCP) (Leben, 1973), wie Kehllaute und Dentale. Translation: Now, the Arabic and “dissimilation” (Hempl, 1893). Some special language, as has long been noted, has the tendency to com- bine such letters in a word where the place of articulation is 1All code data sets used are available at https:// distant, such as gutturals and dentals (Spitta-Bey, 1880, p. github.com/cvocp/cvocp 15). 290 Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 290–300, Vancouver, Canada, August 3 - August 4, 2017. c 2017 Association for Computational Linguistics that this phenomenon of “Similar Place Avoid- son, 1983; Sukhotin, 1962). The algorithm is also ance” is a statistical universal. more robust than earlier algorithms that perform This phenomenon is often filed under the consonant-vowel separation and works with less generic heading “obligatory contour principle” data, something that is also briefly evaluated. (Leben, 1973; McCarthy, 1986; Yip, 1988; Odden, This paper is structured as follows: an overview 1988; Meyers, 1997; Pierrehumbert, 1993; Rose, of previous work is given in section2, mostly 2000; Frisch, 2004). Originally, the OCP was ap- related to the simpler task of grouping conso- plied as a theoretical constraint only to tone lan- nants and vowels without labeled data, rather than guages, with the argument that adjacent identical identifying distinctive features. Following that, tones in underlying forms were rare, and this re- the general algorithm is developed in section3, flected an obligatory contour principle. The usage after which the experiments on both phonemic has since spread, and is assumed to account for and graphemic representations in section4 are re- segmental features other than tone. ported. Four experiments are evaluated. The first uses phonemic data from 9 languages for clus- It is unclear why the phenomenon is so tering and evaluates clustering along distinctive widespread and why it manifests itself in the di- feature lines. The second is a graphemic exper- verse ways it does. Accounts range from informa- iment that uses a data set of Bible translations tion compression to a diachronically visible hyper- in 503 languages where the task is to distinguish correction by listeners who misperceive the signal the vowels from the consonants; here, results are and make the assumption that repetition is unlikely compared to Kim and Snyder(2013) on the same (Ohala, 1981). data set. That data is slightly noisy, motivating This paper explores the simplest incarnation of the third experiment, which is also graphemic and the idea of similarity avoidance; namely, that two evaluates consonant-vowel distinctions on vetted adjacent segments are preferably different in some word lists from data taken from the ACL SIG- way and that this difference reveals itself glob- MORPHON shared task on morphological reinflec- ally. That is, it is not assumed that the con- tion (Cotterell et al., 2016). The ability of a tier- straint is absolute; rather, an algorithm is devel- based variant of the algorithm to separate coro- oped that induces grouping of unknown phoneme nals from non-coronals is evaluated in a fourth ex- symbols so as to maximize potential alternation periment where Universal Dependencies corpora of clusters in a sequence of symbols, i.e. a cor- (Nivre et al., 2017) are used. pus. If the OCP holds for phonological or phonetic The main results are presented in section5. features—primarily places of articulation—such a Given the high accuracy of the algorithm in C/V clustering algorithm could group phonemes along distinction with very little data and its consequent the lines of distinctive features. While, as we potential applicability to decipherment tasks, a shall see, the observations do not support the pres- small practical example application is evaluated ence of a strong universal OCP effect, the top-level which analyzes a fragment of text, a manuscript clusters discovered by the algorithm correspond of only 54 characters. nearly 100% to the distinction of consonants and vowels—or syllabic and non-syllabic elements if 2 Related Work expressed in terms of features. Furthermore, a tier- based variant of the algorithm additionally groups The statistical experiments of Andrey consonants somewhat reliably into coronal/non- Markov (1913) on Alexander Pushkin’s poem coronal places of articulation, and also often dis- Eugene Onegin constitute what is probably one of tinguishes front vowels from back vowels. This the earliest discoveries of the fact that significant is true even if the algorithm is run on alphabetic latent structure can be found by examining representations. An evaluation of the ability to immediate co-occurrence of graphemes in text. detect C/V distinction against a data set of 503 Examining a 20,000-letter sample of the poem, Bible translations (Kim and Snyder, 2013) is in- Markov found a strong statistical bias that favored cluded, improving upon earlier work that attempts alternation of consonants and vowels. A number to distinguish between consonants and vowels in of computational approaches have since been an unsupervised fashion (Kim and Snyder, 2013; investigated that attempt to reveal phonological Goldsmith and Xanthos, 2009; Moler and Morri- structure in corpora. Often, orthography is used 291 as a proxy for phonology since textual data this alternation, one can assume that there is a nat- is easier to come by. A spectral method was ural grouping of all segments into two initial sets, introduced by Moler and Morrison(1983) with called 0 and 1, in such a way that the total number the explicit purpose of distinguishing consonants of 0-1 or 1-0 alternations between adjacent seg- from vowels by a dimensionality reduction on a ments in a corpus is maximized. For example, segment co-occurrence matrix through singular consider a corpus of a single string abc. This can value decomposition (SVD). An almost iden- be split into two nonempty subsets in six different tical SVD-based approach was later applied to ways: 0 = ab and 1 = c ; 0 = a and 1 = bc ; { } { } { } { } phonological data by Goldsmith and Xanthos 0 = ac and 1 = b , and their symmetric variants { } { } (2009). Hidden Markov Models coupled with which are produced by swapping 0 and 1. Out of the EM algorithm have also been used to learn these, the best assignment is 0 = ac and 1 = b , { } { } consonant-vowel distinctions (Knight et al., since if reflects an alternation of sets where abc 7→ 2006) as well as other latent structure, such as 010. The ‘score’ of this assignment is based on the vowel harmony (Goldsmith and Xanthos, 2009). number of adjacent alternations, in this case 2 (01 Kim and Snyder(2013) use Bayesian inference and 10).

A Phoneme Clustering Algorithm Based on the Obligatory Contour Principle

DRAFT Arabtex a System for Typesetting Arabic User Manual Version 4.00

“Muda” System in Facilitating Writing Arabic Romanization in Word Processor; Standardization of Transliteration and Transcription

Contents Origins Transliteration

Arabic in Romanization

Languages of New York State Is Designed As a Resource for All Education Professionals, but with Particular Consideration to Those Who Work with Bilingual1 Students

Romanization of Arabic 1 Romanization of Arabic

Transliteration of Arabic 1/6 ARABIC Arabic Script*

Arabic Alphabet 1 Arabic Alphabet

Transliteration of Arabic 1/6 ARABIC Arabic Script*

MELA Notes 75–76 (Fall 2002–Spring 2003)

First Name Initial Last Name

Italian and Arabic