<<

English is not “close to optimal”

Garrett Nicolai and Grzegorz Kondrak Department of Computing Science University of Alberta nicolai,gkondrak @ualberta.ca { }

Abstract (2) “phonetic variation is not indicated where it is predictable by a general rule” (predictability). They In spite of the apparent irregularity of the conclude that “conventional orthography is [. . . ] a English system, Chomsky and Halle (1968) characterize it as “near optimal”. We near optimal system for the lexical representation of investigate this assertion using computational English ” (page 49), which we refer to as the techniques and resources. We design an al- optimality claim. gorithm to generate that max- Chomsky and Halle’s account of English orthog- imize both phonemic transparency and mor- raphy is not without its detractors. Steinberg (1973) phological consistency. Experimental results argues against the idea that speakers store abstract demonstrate that the constructed system is underlying forms of separate and apply much closer to optimality than the traditional English orthography. sequences of phonological rules during composi- tion. Sampson (1985) cites the work of Yule (1978) in asserting that many common English word-forms 1 Introduction provide counter-evidence to their alternation English spelling is notorious for its irregularity. observations. Derwing (1992) maintains that the ob- Kominek and Black (2006) estimate that it is about servations only hold for five vowel alternations that 3 times more complex than German, and 40 times can be predicted with simple spelling rules. Ac- more complex than Spanish. This is confirmed by cording to Nunn (2006), the idea that spelling repre- lower accuracy of -to- systems on En- sents an abstract phonological level has been aban- glish (Bisani and Ney, 2008). A survey of English doned by most linguists. Sproat (2000) notes that spelling (Carney, 1994) devotes 120 pages to de- few scholars of writing systems would agree with scribe phoneme-to-letter correspondences, and lists Chomsky and Halle, concluding that the evidence 226 letter-to-phoneme rules, almost all of which ad- for a consistent morphological representation in En- mit exceptions. Numerous proposals have been put glish orthography is equivocal. forward for spelling reforms over the years, rang- It is not our goal to formulate yet another pro- ing from small changes affecting a limited set of posal for reforming English orthography, nor even words to complete overhauls based on novel writing to argue that there is a need for such a reform. scripts (Venezky, 1970). Furthermore, we refrain from taking into account In spite of the perceived irregularity of English other potential advantages of the traditional orthog- spellings, Chomsky and Halle (1968) assert that they raphy, such as reflecting archaic pronunciation of remarkably well reflect abstract underlying forms, native words, preserving the original spelling of from which the surface pronunciations are generated , or maintaining orthographic similarity to with “rules of great generality and wide applicabil- cognates in other languages. Although these may ity”. They postulate two principles of an optimal be valid concerns, they are not considered as such orthographic system: (1) it should have “one repre- by Chomsky and Halle. Instead, our primary ob- sentation for each lexical entry” (consistency); and, jective is a deeper understanding of how the phono-

537

Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 537–545, Denver, Colorado, May 31 – June 5, 2015. 2015 Association for Computational Linguistics logical and morphological characteristics of English the other hand, the orthography of Serbo-Croatian are reflected in its traditional orthography, which is was originally created according to the rule “write currently the dominant medium of information ex- as you speak”, so that the spelling can be unam- change in the world. biguously produced from pronunciation. This does In this paper, we investigate the issue of ortho- not mean that the pronunciation is completely pre- graphic optimality from the computational perspec- dictable from spelling; for example, lexical is tive. We define metrics to quantify the degree of op- not marked (Sproat, 2000). timality of a spelling system in terms of phonemic In this paper, we measure phonemic trans- transparency and morphological consistency. We parency by computing average perplexity between design an algorithm to generate an orthography that graphemes and . Roughly speaking, maximizes both types of optimality, and implement phonemic perplexity indicates how many differ- it using computational tools and resources. We show ent graphemes on average correspond to a single experimentally that the traditional orthography is phoneme, while graphemic perplexity reflects the much further from optimality than our constructed corresponding of graphemes. We provide system, which contradicts the claim of Chomsky and a formal definition in Section 5. Halle. 2.2 Morphological optimality 2 Optimality A purely morphemic would have a unique graphemic representation for each mor- In this section, we define the notions of phone- pheme. Chinese is usually given as an example of mic and morphemic optimality, and our general ap- a near-morphemic writing system. In this paper, proach to quantifying them. We propose two theo- we construct an abstract morphemic spelling sys- retical that are phonemically and mor- tem for English by selecting a single alphabetic form phologically optimal, respectively. We argue that no for each , and simply concatenating them orthographic system for English can be simultane- to make up words. For example, the morphemic ously optimal according to both criteria. spelling of viscosity could be ‘viscous ity’.1 · We define morphemic optimality to correspond 2.1 Phonemic optimality to the consistency principle of Chomsky and Halle. A purely phonemic system would have a per- The rationale is that a unique spelling for each mor- fect one-to-one relationship between graphemes and pheme should allow related words to be readily iden- phonemes. Rogers (2005) states that no standard tified in the mental . Sproat (2000) dis- writing system completely satisfies this property, tinguishes between morpheme-oriented “deep” or- although comes remarkably thographies, like Russian, and phoneme-oriented close. For our purposes, we assume the International “shallow” orthographies, like Serbo-Croatian. Phonetic Alphabet (IPA) transcription to be such an We propose to measure morphemic consistency ideal system. For example, the IPA transcription of by computing the average edit distance between the word viscosity is [vIskAs@ti]. We obtain the tran- morpheme representations in different word-forms. scriptions from a digital that represents The less variation morpheme spellings exhibit in a the General American pronunciation of English. writing system, the higher the corresponding value Phonemic transparency can be considered in two of the morphemic transparency will be. We define directions: from letters to phonemes, and vice versa. the measure in Section 5. The pronunciation of Spanish words is recover- It is impossible to achieve complete phonemic able from the spelling by applying a limited set of and morphemic optimality within one system de- rules (Kominek and Black, 2006). However, there signed for English spelling. For example, the stem is some ambiguity in the opposite direction; for ex- morpheme of verb forms hearing and heard is ample, the phoneme [] can be expressed with ei- 1 Non-traditional spellings are written within single quotes. ther ‘b’ or ’v’. As a result, it is not unusual for na- Morphemes may be explicitly separated by the centered dot tive Spanish speakers to make spelling mistakes. On character.

538 spelled identically but pronounced differently. If // Create word sets we changed the spellings to indicate the difference 1: for each word w in lexicon do in pronunciation, we would move towards phone- 2: for each morpheme m in w do mic optimality, but away from morphemic optimal- 3: add w to word set Sm ity. Apart from purely phonographic or logographic // Generate morpheme representations variants, any English spelling system must be a com- 4: for each word set Sm do promise between phonemic and morphemic trans- 5: m0 := longest representation of m parency. In this paper, we attempt to algorithmi- 6: for each word w in Sm do cally create an orthography that simultaneously ap- 7: aw := alignment of m0 and w proaches the optimality along both dimensions. 8: add aw to multi-alignment A 9: for each position i in A do 3 Algorithm 10: select representative phoneme r[i] 11: r := r[1.. m ] In this section, we describe our algorithm for gener- m | 0| ating English spellings (Figure 1), which serves as a // Adopt a surface phoneme predictor constructive proof that the traditional orthography is 12: Pronounce := Predictor (L) not optimal. Our objective is to find the best com- // Generate word representations promise between phonemic transparency and mor- 13: for each word w = m1 . . . mk do 14: r := r ... r phemic consistency. Section 3.1 explains how we m1 · · mk derive a unique representation for each morpheme. 15: for each phoneme r[i] in r do 16: if Pronounce(r[i]) = w[i] then Section 3.2 shows how the morpheme representa- 6 tions are combined into word spellings. Without a 17: r[i] := w[i] loss of generality, the generated spellings are com- 18: r := r[1.. w ] w | | posed of IPA symbols. Figure 1: Spelling generation algorithm. All representa- 3.1 Morpheme representations tions consists of phonemes. We start by identifying all morphemes in the lexicon, and associating each morpheme with sets of words mon representation for a morpheme. We extract the that contain it (lines 1–3 in Figure 1). An example phonemic representation of each allomorph in the word set that corresponds to the morpheme atom is word set, and perform a multi-alignment of the rep- shown in Table 1. Words may belong to more than resentations by pivoting on the longest representa- one set. For example, the word atomic will also be tion of the morpheme (lines 5–8). For each posi- included in the word set that corresponds to the mor- tion in the multi-alignment, we identify the set of pheme -ic. We make no distinction between bound phonemes corresponding to that position. If there and free morphemes. is no variation within a position, we simply adopt As can be seen in Table 1, English morphemes of- the common phoneme. Otherwise, we choose the ten have multiple phonemic realizations. The objec- phoneme that is most preferred in a fixed hierarchy tive of the second step (lines 4–11) is to follow the of phonemes. In this case, since [æ] and [A] are pre- consistency principle by establishing a single repre- ferred to [@], the resulting morpheme representation sentation of each morpheme. They suggest that or- is ‘ætAm’. thographic representations should reflect the under- For selecting between variant phonemes, we fol- lying forms of morphemes as much as possible. Un- low a manually-constructed hierarchy of phonemes fortunately, underlying forms are not attested, and (Table 2), which roughly follows the principle of there is no commonly accepted algorithm to con- least effort. The assumption is that the phonemes re- struct them. Instead, our algorithm attempts to es- quiring more articulatory effort to produce are more tablish a sequence of phonemes that is maximally likely to represent the underlying phoneme. Within similar to the attested surface allomorphs. a single row, phonemes are listed in the order of Table 1 shows an example of generating the com- preference. For example, alveolar like [s]

539 æ t @ m atom Stops b d p t k æ t @ m atoms Affricates dZ tS @ t A m I k atomic Fricatives D v z Z T s S h @ t A m I k l i atomically Nasals m n N s 2 b @ t A m I k subatomic Liquids l r æ t A m Glides j w Diphthongs aI OI aU Table 1: Extracting the common morphemic representa- Tense i e o u A tion . Lax vowels æ E O U 2 Reduced vowels I@ are preferred to post-alveolar ones like [S], in order deletion to account for palatalization. Since our representa- Table 2: Hierarchy of phonemes. tions are not intended to represent actual underly- ing forms, the choice of a particular phoneme hier- archy affects only the shape of the generated word ‘Iti’. If, given the input ‘sInsir Iti’, the predic- · spellings. tor correctly generates the surface pronunciation 3.2 Word representations [sInsEr@ti], we adopt the input as our final spelling. However, if the prediction is [sInsir@ti] instead, our Ideally, polymorphemic words should be repre- final spelling becomes ‘sInsEr Iti’, in order to avoid sented by a simple concatenation of the correspond- · a potentially misleading spelling. Since the second ing morpheme representations. However, for lan- vowel was incorrectly predicted, we determine it to guages that are not purely concatenative, this ap- be unpredictable, and thus represent it with the sur- proach may produce forms that are far from the face phoneme, rather than the underlying one. The phonemic realizations. For example, assuming that choice of the predictor affects only the details of the the words deceive and deception share a morpheme, generated spellings. a spelling ‘deceive ion’ would fail to convey the ac- · tual pronunciation [d@sEpS@n]. The predictability 4 Implementation principle of Chomsky and Halle implies that pho- netic variation should only be indicated where it is In this section, we describe the specific data and not predictable by general rules. Unfortunately, the tools that we use in our implementation of the al- task of establishing such a set of general rules, which gorithm described in the previous section. we discuss in Section 7, is not at all straightforward. Instead, we assume the existence of an oracle (line 4.1 Data 12 in Figure 1) which predicts the surface pronunci- For the implementation of our spelling generation ation of each phoneme found in the concatenation of algorithm, we require a lexicon that contains mor- the morphemic forms. phological segmentation of phonemic representa- In our algorithm (lines 13–18), the default tions of words. Since we have been been unsuc- spelling of the word is composed of the represen- cessful in finding such a lexicon, we extract the tations of its constituent morphemes conjoined with necessary information from two different resources: a separator character. If the predicted pronunciation the CELEX lexical database (Baayen et al., 1995), matches the actual surface phoneme, the “underly- which includes morphological analysis of words, ing” phoneme is preserved; otherwise, it is substi- and the Combilex speech lexicon (Richmond et al., tuted by the surface phoneme. This modification 2009), which contains high-quality phonemic tran- helps to maintain the resulting word spellings rea- scriptions. After intersecting the , and prun- sonably close to the surface pronunciation. ing it of proper nouns, function words, duplicate For example, consider the word sincerity. Sup- forms, and multi-word entries, we are left with ap- pose that our algorithm derives the representations proximately 51,000 word-forms that are annotated of the two underlying morphemes as ‘sInsir’ and both morphologically and phonemically.

540 In order to segment phonemic representations into Underlying: foto + græf + @r + z constituent morphemes, we apply a high-precision Predicted: fot@ græf @r z phonetic aligner (Kondrak, 2000) to link letters and Surface: f@tA gr@f @r z phonemes using the procedure described in (Dwyer Respelling: fotA græf @r z · · · and Kondrak, 2009). In rare cases where the pho- Table 3: Deriving the spelling of the word photographers. netic aligner fails to produce an alignment, we back- off to alignment generated with m2m-aligner (Ji- ampojamarn et al., 2007), an unsupervised EM- Since DIRECTL+ requires a training set, we split based algorithm. We found that this approach the lexicon into two equal-size parts with no mor- worked better for our purposes than relying on the pheme overlap, and induce two separate models on alignments provided in Combilex. We use the same each set. Then we apply each model as the predictor approach to align variant phonemic representations on the other half of the lexicon. This approach simu- of morphemes as described in Section 3.1. lates the human ability to guess pronunciation from The morphological information contained in the spelling. Jiampojamarn et al. (2010) report that CELEX is incomplete for our purposes, and requires DIRECTL+ achieves approximately 90% word ac- further processing. For example, the word amputate curacy on the letter-to-phoneme conversion task on is listed as monomorphemic, but in fact contains the the CELEX data. suffix -ate. However, amputee is analyzed as amputee = amputate ate + ee. 5 Evaluation measures − This allows us to identify the stem as amput, In this section, we define our measures of phonemic which in turn implies the segmentations amput ee, transparency and morphemic consistency. · amput ate, and amput at ion. · · · 5.1 Phonemic transparency Another issue that requires special handling in CELEX involves recovering reduced geminate con- Kominek and Black (2006) measure the complexity sonants. For example, the word interrelate is pro- of spelling systems by calculating the average per- nounced with a single [r] phoneme at the morpheme plexity of phoneme emissions for each letter. The boundary. However, when segmenting the phoneme total perplexity is the sum of each letter’s perplex- sequence, we need to include [r] both at the end of ity weighted by its unigram probability. Since their inter- and at the beginning of relate. focus is on the task of inducing text-to-speech rules, they also incorporate letter context into this defini- 4.2 Predictor tion. Thus, a system that is completely explained by The role of the predictor mentioned in Section 3.2 a set of rules has a perplexity of 1. is performed by DIRECTL+ (Jiampojamarn et al., The way we compute perplexity differs in several 2010), a publicly available discriminative string aspects. Whereas Kominek and Black (2006) calcu- transducer. It takes as input a sequence of com- late the perplexity of single letters, we take as units mon morpheme representations, determined using substrings derived from many-to-many alignment, the method described above, and produces the pre- with the length limited to two characters. Some let- dicted word pronunciation. Since DIRECTL+ tends ter bigrams, such as ph, th, and ch, are typically to make mistakes related to the unstressed vowel re- pronounced as a single phoneme, while the letter duction phenomenon in English, we refrain from re- often corresponds to the phoneme bigram [ks]. By placing the “underlying” phonemes with either [@] considering substrings we obtain a more realistic es- or [I]. timate of spelling perplexity. An example derivation is shown in Table 3, where We calculate the average orthographic perplexity the Underlying string represents the input to DI- using the standard formulation: RECTL+, Predicted is its output, Surface is the ac- P logP tual pronunciation, and Respelling is the spelling − i i Pave = Pce i (1) generated according to the algorithm in Figure 1. c P X 541 System viscous viscosity System Orth Phon Morph T.O. viscous viscosity T.O. 2.32 2.10 96.11 IPA vIsk@s vIskAs@ti IPA 1.00 1.00 93.94 M-CAT viscous viscous ity M-CAT 2.51 2.36 100.00 · ALG vIskAs vIskAs Iti ALG 1.33 1.72 98.90 · SR viscous viscosity SR 2.27 2.15 96.57 SS viscus viscosity SS 1.60 1.72 94.72

Table 4: Example spellings according to various systems. Table 5: Orthographic, phonemic and morphemic opti- mality of spelling systems. where Pc is the probability of a grapheme substring in the dictionary, and Pi is the probability that the As an example, consider the word set consisting grapheme substring is pronounced as the phoneme of six word-forms: snip, snips, snipped, snipping, substring i. Note that this formulation is not contin- snippet, and snippets. The first two words, which gent on any set of rules. represent the base morpheme as snip, receive a per- In a similar way, we compute the phonemic per- fect score of 1 for morphemic consistency. The re- plexity in the opposite direction, from phonemes to maining four words, which have the morpheme as letters. The orthographic and the phonemic perplex- snipp, obtain the score of 75% because one of the ity values quantify the transparency of a spelling four phonemes is spelled differently from the base system with respect to reading and writing, respec- form. For free morphemes, the base form is simply tively. the spelling of the morpheme, but for bound mor- phemes, we take the majority spelling of the mor- 5.2 Morphemic consistency pheme.

Little (2001) proposes to calculate the morphemic 6 Quantitative comparison optimality of English spellings by computing the average percentage of “undisturbed letters” in the We compare the traditional English orthography polymorphemic words with respect to the base form. (T.O.) to three hypothetical systems: phonemic For example, four of five letters of the base form transcription (IPA), morpheme concatenation (M- are present in voicing, which translates into CAT), and the orthography generated by the algo- 80% optimal. The examples given in the paper al- rithm described in Section 3 (ALG). In addition, low us to interpret this measure as a function of edit we consider two proposals submitted to the En- distance normalized by the length of the base form. glish Spelling Society: a minimalist We make three modifications to the original (SR) of Gibbs (1984), and the more comprehensive method. First, we compute the average over all SoundSpel (SS) of Rondthaler and Edward (1986). words in the lexicon rather than over word sets, Table 4 lists the spellings of the words viscous and which would give disproportionate weight to words viscosity in various orthographies. in smaller word sets. Second, we normalize edit dis- Table 5 shows the values of orthographic and tance by the number of phonemes in a word, rather phonemic transparency, as well as morphemic con- than by the number of letters in a spelling, in order to sistency for the evaluated spelling systems. By def- avoid penalizing systems that use shorter spellings. inition, phonemic transcription obtains the optimal Finally, we consider edit operations to apply to sub- transparency scores of 1, while simple morphologi- strings aligned to substrings of phonemes, rather cal concatenation receives a perfect 100% in terms than to individual symbols. In this way, the maxi- of morphemic consistency. mum number of edit operations is equal to the num- The results in Table 5 indicate that traditional or- ber of phonemes. The modified measure yields a thography scores poorly according to all three mea- score between 0 and 100%, with the latter value rep- sures. Its low orthographic and phonemic trans- resenting morphemic optimality. parency is to be expected, but its low morphemic

542 Rule Input Output Rule Writing Reading e-deletion voice ing voicing e-deletion 98.8 67.1 · -replacement industry al industrial y-replacement 93.5 95.8 · k-insertion panic ing panicking k-insertion 100.0 1.0 · e-insertion church s churches e-insertion 100.0 98.7 · doubling get ing getting consonant doubling 96.3 36.3 · f-voicing knife s knives f-voicing 33.3 14.7 · Table 6: Common English spelling rules with examples. Table 7: Applicability of common spelling rules. consistency is striking. Traditional orthography is and reading applicability. Writing rules are applied not only far from optimality, but overall seems no to morphemes when they are in the correct environ- more optimal than any other of the evaluated sys- ment. For example, the k-insertion rule fires if the tems. morpheme ends in a c and the next morpheme begins Searching for the explanation of this surprising re- with e or i, as in panic ing. On the other hand, read- · sult, we find that much of the morphemic score de- ing may involve recovering the morphemes from the duction can be attributed to small changes like drop- surface forms. For example, if the stem ends in ping of the , as in ‘make’ + ‘ing’ = ‘mak- a tt and the affix begins with an i, the consonant ing’. These types of inconsistencies counter-weigh doubling rule implies that the free form of the mor- the high marks that traditional orthography gets for pheme ends in a single t, as in getting. maintaining consistent spelling in spite of unstressed The results in Table 7 show that the rules, with the vowel reductions. exception of the f-voicing rule, have high applicabil- The prevalence of silent e’s in traditional orthog- ity in writing. Most rules, however, cannot be trusted raphy undeniably diminishes its morphemic con- to recover the morpheme spellings from the surface sistency. Nor is the device necessary to represent form. For example, following the consonant dou- the pronunciation of the preceding vowel; for ex- bling rule would cause the reader to incorrectly in- ample, SoundSpel has those words as ‘maek’ and fer from the word butted that the spelling of the verb ‘maeking’. However, one can argue that such mi- is but. This is significant considering that Chomsky nor alterations should not be penalized because En- and Halle define orthography as a system for readers glish speakers subconsciously take them into ac- (page 49). count while reading. In the next section, we describe Notwithstanding the unreliability of the spelling an experiment in which we pre-process words with rules, we incorporate them into the computation of such orthographic rules, in order to determine how the morphemic consistency of the traditional orthog- much they influence the optimality picture. raphy. We apply the rules from a reading perspec- tive, but assume some morphemic knowledge of a 7 Spelling rules reader. Whereas we consider a rule to misfire if it Table 6 lists six common English spelling rules that does not apply in the correct environment when cal- affect letters at morpheme boundaries, of which the culating applicability, as in Table 7, when calculat- first five are included in the textbook account of Ju- ing morphemic consistency, we allow the rules to be rafsky and Martin (2009, page 63). We conducted more flexible. We consider a morpheme to match an experiment to determine the applicability of these the prototype if either the observed form or the form rules by computing how often they fired when trig- modified by the spelling rule matches the prototype. gered by the correct environment.2 We tested the 8 Discussion rules in both directions, with respect to both writing

2 Figure 2 shows a two-dimensional plot of ortho- The conditioning environments of the rules were implemented according to the guidelines provided at graphic perplexity vs. morphemic consistency. The http://www.phonicslessons.co.uk/englishspellingrules.html. (unattainable) optimality is represented by the lower

543 left corner of the plot. The effect of accommodat- ing the spelling rules within the traditional orthog- 93 raphy is illustrated by an arrow, which indicates an 94 increase in morphemic consistency from 96.11 to 98.90. 95

The ALG(L) system represents a version of the 96 ALG system in which the IPA symbols are respelled using combinations of the 26 letters of the Roman 97

alphabet, with the morpheme boundary symbol re- OPTIMALITY MORPHEMIC 98 moved. This change, which is intended to make the comparison with the traditional orthography more 99 interpretable, increases the orthographic perplexity 100 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 from 1.33 to 1.58. Furthermore, we ensure that ORTHOGRAPHIC PERPLEXITY ALG(L) contains no homographs (which consitute 2.6% of the lexicon in ALG) by reverting to a tradi- T.O Alg SS SR IPA Morph +Rules Alg(L) tional spelling of a morpheme if necessary. Since the respelling applies to all instances of that morpheme, it has no effect on the morphemic consistency, but results in a small increase of the orthographic per- Figure 2: Morphemic and orthographic optimality of var- plexity to 1.61. ious spelling systems. The plot in Figure 2 shows that, even after ac- counting for the orthographic rules, traditional or- much more phonemically transparent, particularly thography does not surpass the level of morphemic for vowels. Phonemically, ALG(L) improves on consistency of ALG. With the same writing script the traditional orthography mostly by making the and no homographs, ALG(L) is less than half the spelling more predictable, For example, ‘a’ repre- distance from the orthographic optimality. On the sents the phoneme [æ] in 91.7% of the cases in the other hand, neither of the spelling reform proposals generated spellings, as opposed to only 36.5% in tra- is substantially better overall than the traditional or- ditional orthography. thography. Inspection of the spellings generated by our algo- 9 Conclusion rithm reveals that it generally maintains consistent spellings of morphemes. In fact, it only makes a We have analyzed English orthography in terms of change from the underlying form in 3660 cases, or morphemic consistency and phonemic transparency. 7.2% of the words in the dictionary. Consider the According to the strict interpretation of morphemic morpheme transcribe, which is traditionally spelled consistency, traditional orthography is closer to the as ‘transcrip’ in transcription. Even if we disre- level of a phonemic transcription than to that of gard the final ‘e’ by invoking the e-deletion spelling a morphemic concatenation. Even if orthographic rule, the morphemic consistency in the traditional rules are assumed to operate cost-free as a pre- orthography is still violated by the ‘b’/‘p’ alterna- processing step, the orthographic perplexity of tra- tion. Our predictor, however, considers this a pre- ditional orthography remains high. dictable devoicing assimilation change, which oc- While phonemic transparency and morphemic curs in a number of words, including subscription consistency are at odds with each other, we have pro- and absorption. Consequently, the spellings gen- vided a constructive proof that it is possible to create erated by the algorithm preserve the morpheme’s a spelling system for English that it is substantially ‘b’ ending in all words that contain it. In addition, closer to theoretical optimality than the traditional the algorithm avoids spurious idiosyncrasies such as orthography, even when it is constrained by the tra- four/forty, which abound in traditional orthography. ditional character set. This contradicts the claim that The spellings generated by the algorithm are also English orthography is near optimal.

544 Acknowledgments Joseph R Little. 2001. The optimality of English spelling. This research was supported by the Natural Sciences Anneke Marijke Nunn. 2006. : A and Engineering Research Council of Canada, and systematic investigation of the spelling of Dutch words. the Alberta Innovates – Technology Futures. The Hague: Holland Academic Graphics. Korin Richmond, Robert AJ Clark, and Susan Fitt. 2009. Robust LTS rules with the Combilex speech technol- References ogy lexicon. pages 1295–1298, September. Harald R. Baayen, Richard Piepenbrock, and Leon Gu- Henry Rogers. 2005. Writing Systems. Blackwell. likers. 1995. The CELEX Lexical Database. Release Edward Rondthaler and J LIAS Edward. 1986. Dictio- 2 (CD-ROM). Linguistic Data Consortium, University nary of simplified American Spelling. of Pennsylvania, Philadelphia, Pennsylvania. Geoffrey Sampson. 1985. Writing systems: A linguistic Maximilian Bisani and Hermann Ney. 2008. Joint- introduction. Stanford University Press. sequence models for grapheme-to-phoneme conver- Richard Sproat. 2000. A computational Theory of Writ- sion. Speech Communication, 50(5):434–451. ing Systems. Cambridge. Edward Carney. 1994. A Survey of English Spelling. Danny D Steinberg. 1973. , reading, and Routledge. Chomsky and Halle’s optimal orthography. Journal and Morris Halle. 1968. The sound pat- of Psycholinguistic Research, 2(3):239–258. tern of English. Richard L Venezky. 1970. The structure of English or- Bruce L Derwing. 1992. Orthographic aspects of lin- thography, volume 82. Walter de Gruyter. guistic competence. The linguistics of literacy, pages Valerie Yule. 1978. Is there evidence for Chomsky’s in- 193–210. terpretation of English spelling? Spelling Progress Kenneth Dwyer and Grzegorz Kondrak. 2009. Reducing Bulletin, 18(4):10–12. the annotation effort for letter-to-phoneme conversion. In Proceedings of ACL-IJCNLP, pages 127–135. Stanley Gibbs. 1984. The Simplified Spelling Society’s 1984 proposals. Journal of the Simplified Spelling So- ciety, 2:32. Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek Sherif. 2007. Applying many-to-many alignments and hidden markov models to letter-to-phoneme con- version. In Human Language Technologies 2007: The Conference of the North American Chapter of the As- sociation for Computational Linguistics; Proceedings of the Main Conference, pages 372–379, Rochester, New York, April. Association for Computational Lin- guistics. Sitichai Jiampojamarn, Colin Cherry, and Grzegorz Kon- drak. 2010. Integrating Joint n-gram Features into a Discriminative Training Framework. In Proceedings of NAACL-2010, Los Angeles, CA, June. Association for Computational Linguistics. Dan Jurafsky and James H Martin. 2009. Speech & lan- guage processing. Pearson Education India, 2nd edi- tion. John Kominek and Alan W. Black. 2006. Learning pronunciation : Language complexity and word selection strategies. In HLT-NAACL, pages 232– 239. Grzegorz Kondrak. 2000. A new algorithm for the alignment of phonetic sequences. In Proceedings of NAACL 2000: 1st Meeting of the North American Chapter of the Association for Computational Linguis- tics, pages 288–295.

545