English Orthography Is Not “Close to Optimal”

English orthography is not “close to optimal” Garrett Nicolai and Grzegorz Kondrak Department of Computing Science University of Alberta nicolai,gkondrak @ualberta.ca { } Abstract (2) “phonetic variation is not indicated where it is predictable by a general rule” (predictability). They In spite of the apparent irregularity of the conclude that “conventional orthography is [. ] a English spelling system, Chomsky and Halle (1968) characterize it as “near optimal”. We near optimal system for the lexical representation of investigate this assertion using computational English words” (page 49), which we refer to as the techniques and resources. We design an al- optimality claim. gorithm to generate word spellings that max- Chomsky and Halle’s account of English orthog- imize both phonemic transparency and mor- raphy is not without its detractors. Steinberg (1973) phological consistency. Experimental results argues against the idea that speakers store abstract demonstrate that the constructed system is underlying forms of separate morphemes and apply much closer to optimality than the traditional English orthography. sequences of phonological rules during composi- tion. Sampson (1985) cites the work of Yule (1978) in asserting that many common English word-forms 1 Introduction provide counter-evidence to their vowel alternation English spelling is notorious for its irregularity. observations. Derwing (1992) maintains that the ob- Kominek and Black (2006) estimate that it is about servations only hold for five vowel alternations that 3 times more complex than German, and 40 times can be predicted with simple spelling rules. Ac- more complex than Spanish. This is confirmed by cording to Nunn (2006), the idea that spelling repre- lower accuracy of letter-to-phoneme systems on En- sents an abstract phonological level has been aban- glish (Bisani and Ney, 2008). A survey of English doned by most linguists. Sproat (2000) notes that spelling (Carney, 1994) devotes 120 pages to de- few scholars of writing systems would agree with scribe phoneme-to-letter correspondences, and lists Chomsky and Halle, concluding that the evidence 226 letter-to-phoneme rules, almost all of which ad- for a consistent morphological representation in En- mit exceptions. Numerous proposals have been put glish orthography is equivocal. forward for spelling reforms over the years, rang- It is not our goal to formulate yet another pro- ing from small changes affecting a limited set of posal for reforming English orthography, nor even words to complete overhauls based on novel writing to argue that there is a need for such a reform. scripts (Venezky, 1970). Furthermore, we refrain from taking into account In spite of the perceived irregularity of English other potential advantages of the traditional orthog- spellings, Chomsky and Halle (1968) assert that they raphy, such as reflecting archaic pronunciation of remarkably well reflect abstract underlying forms, native words, preserving the original spelling of from which the surface pronunciations are generated loanwords, or maintaining orthographic similarity to with “rules of great generality and wide applicabil- cognates in other languages. Although these may ity”. They postulate two principles of an optimal be valid concerns, they are not considered as such orthographic system: (1) it should have “one repre- by Chomsky and Halle. Instead, our primary ob- sentation for each lexical entry” (consistency); and, jective is a deeper understanding of how the phono- 537 Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 537–545, Denver, Colorado, May 31 – June 5, 2015. c 2015 Association for Computational Linguistics logical and morphological characteristics of English the other hand, the orthography of Serbo-Croatian are reflected in its traditional orthography, which is was originally created according to the rule “write currently the dominant medium of information ex- as you speak”, so that the spelling can be unam- change in the world. biguously produced from pronunciation. This does In this paper, we investigate the issue of ortho- not mean that the pronunciation is completely pre- graphic optimality from the computational perspec- dictable from spelling; for example, lexical stress is tive. We define metrics to quantify the degree of op- not marked (Sproat, 2000). timality of a spelling system in terms of phonemic In this paper, we measure phonemic trans- transparency and morphological consistency. We parency by computing average perplexity between design an algorithm to generate an orthography that graphemes and phonemes. Roughly speaking, maximizes both types of optimality, and implement phonemic perplexity indicates how many differ- it using computational tools and resources. We show ent graphemes on average correspond to a single experimentally that the traditional orthography is phoneme, while graphemic perplexity reflects the much further from optimality than our constructed corresponding ambiguity of graphemes. We provide system, which contradicts the claim of Chomsky and a formal definition in Section 5. Halle. 2.2 Morphological optimality 2 Optimality A purely morphemic writing system would have a unique graphemic representation for each mor- In this section, we define the notions of phone- pheme. Chinese is usually given as an example of mic and morphemic optimality, and our general ap- a near-morphemic writing system. In this paper, proach to quantifying them. We propose two theo- we construct an abstract morphemic spelling sys- retical orthographies that are phonemically and mor- tem for English by selecting a single alphabetic form phologically optimal, respectively. We argue that no for each morpheme, and simply concatenating them orthographic system for English can be simultane- to make up words. For example, the morphemic ously optimal according to both criteria. spelling of viscosity could be ‘viscous ity’.1 · We define morphemic optimality to correspond 2.1 Phonemic optimality to the consistency principle of Chomsky and Halle. A purely phonemic system would have a per- The rationale is that a unique spelling for each mor- fect one-to-one relationship between graphemes and pheme should allow related words to be readily iden- phonemes. Rogers (2005) states that no standard tified in the mental lexicon. Sproat (2000) dis- writing system completely satisfies this property, tinguishes between morpheme-oriented “deep” or- although Finnish orthography comes remarkably thographies, like Russian, and phoneme-oriented close. For our purposes, we assume the International “shallow” orthographies, like Serbo-Croatian. Phonetic Alphabet (IPA) transcription to be such an We propose to measure morphemic consistency ideal system. For example, the IPA transcription of by computing the average edit distance between the word viscosity is [vIskAs@ti]. We obtain the tran- morpheme representations in different word-forms. scriptions from a digital dictionary that represents The less variation morpheme spellings exhibit in a the General American pronunciation of English. writing system, the higher the corresponding value Phonemic transparency can be considered in two of the morphemic transparency will be. We define directions: from letters to phonemes, and vice versa. the measure in Section 5. The pronunciation of Spanish words is recover- It is impossible to achieve complete phonemic able from the spelling by applying a limited set of and morphemic optimality within one system de- rules (Kominek and Black, 2006). However, there signed for English spelling. For example, the stem is some ambiguity in the opposite direction; for ex- morpheme of verb forms hearing and heard is ample, the phoneme [b] can be expressed with ei- 1 Non-traditional spellings are written within single quotes. ther ‘b’ or ’v’. As a result, it is not unusual for na- Morphemes may be explicitly separated by the centered dot tive Spanish speakers to make spelling mistakes. On character. 538 spelled identically but pronounced differently. If // Create word sets we changed the spellings to indicate the difference 1: for each word w in lexicon L do in pronunciation, we would move towards phone- 2: for each morpheme m in w do mic optimality, but away from morphemic optimal- 3: add w to word set Sm ity. Apart from purely phonographic or logographic // Generate morpheme representations variants, any English spelling system must be a com- 4: for each word set Sm do promise between phonemic and morphemic trans- 5: m0 := longest representation of m parency. In this paper, we attempt to algorithmi- 6: for each word w in Sm do cally create an orthography that simultaneously ap- 7: aw := alignment of m0 and w proaches the optimality along both dimensions. 8: add aw to multi-alignment A 9: for each position i in A do 3 Algorithm 10: select representative phoneme r[i] 11: r := r[1.. m ] In this section, we describe our algorithm for gener- m | 0| ating English spellings (Figure 1), which serves as a // Adopt a surface phoneme predictor constructive proof that the traditional orthography is 12: Pronounce := Predictor (L) not optimal. Our objective is to find the best com- // Generate word representations promise between phonemic transparency and mor- 13: for each word w = m1 . mk do 14: r := r ... r phemic consistency. Section 3.1 explains how we m1 · · mk derive a unique representation for each morpheme. 15: for each phoneme r[i] in r do 16: if Pronounce(r[i]) = w[i] then Section 3.2 shows how the morpheme representa- 6 tions are combined into word spellings. Without a 17: r[i] := w[i] loss of generality, the generated spellings are com- 18: r := r[1.. w ] w | | posed of IPA symbols. Figure 1: Spelling generation algorithm. All representa- 3.1 Morpheme representations tions consists of phonemes. We start by identifying all morphemes in the lexicon, and associating each morpheme with sets of words mon representation for a morpheme. We extract the that contain it (lines 1–3 in Figure 1). An example phonemic representation of each allomorph in the word set that corresponds to the morpheme atom is word set, and perform a multi-alignment of the rep- shown in Table 1.

Load more