A Computational Approach to Deciphering Unknown Scripts

Kevin Knight Kenji Yamada USC/Information Sciences Institute USC/Information Sciences Institute 4676 Admiralty Way 4676 Admiralty Way Marina del Rey, CA 90292 Marina del Rey, CA 90292 [email protected] [email protected]

Abstract • minimal input (monolingual inscriptions We propose and evaluate computational techniques only) for deciphering unknown scripts. We focus on the case in which an unfamiliar encodes a known This situation has arisen in many famous . The decipherment of a brief document cases of decipherment--for example, in the Lin- or inscription is driven by data about the spoken ear B documents from Crete (which turned language. We consider which scripts are easy or hard out to be a "non-Greek" script for an- to decipher, how much data is required, and whether cient Greek) and in the Mayan documents from the techniques are robust against language change . Both of these cases lay unsolved over time. until the latter half of the 20th century (Chad- 1 Introduction wick, 1958; Coe, 1993). In computational linguistic terms, this de- With surprising frequency, archaeologists dig up cipherment task is not really translation, but documents that no modern person can read. rather text-to-speech conversion. The goal of Sometimes the written characters are familiar the decipherment is to "make the text speak," (say, the Phoenician ), but the lan- after which it can be interpreted, translated, guage is unknown. Other times, it is the reverse: etc. Of course, even after an ancient docu- the written script is unfamiliar but the language ment is phonetically rendered, it will still con- is known. Or, both script and language may be tain many unknown and strange con- unknown. structions. Making the text speak is therefore Cryptanalysts also encounter unreadable doc- only the beginning of the story, but it is a cru- uments, but they try to read them anyway. cial step. With patience, insight, and computer power, Unfortunately, current text-to-speech sys- they often succeed. Archaeologists and lin- tems cannot be applied directly, because guists known as epigraphers apply analogous they require up front a clearly specified techniques to ancient documents. Their deci- sound/writing connection. For example, a sys- pherment work can have many resources as in- tem designer may create a large pronunciation put, not all of which will be present in a given dictionary (for English or Chinese) or a set of case: (1) monolingual inscriptions, (2) accom- manually constructed -based pronun- panying pictures or diagrams, (3) bilingual in- ciation rules (for Spanish or Italian). But in scriptions, (4) the historical record, (5) physical decipherment, this connection is unknown! It is artifacts, (6) bilingual dictionaries, (7) informal exactly what we must discover through analysis. grammars, etc. There are no rule books, and literate informants In this paper, we investigate computational are long-since dead. approaches to deciphering unknown scripts, and report experimental results. We concentrate on 2 Writing Systems the following case: To decipher unknown scripts, is useful to under- stand the nature of known scripts, both ancient • unfamiliar script and modern. Scholars often classify scripts into • known language three categories: (1) alphabetic, (2) syllabic,

37 purely logographic; it is sometimes called mor- phophonemic. Ancient writing is also mixed, and archaeologists frequently observe radical writing changes in a single language over time. 3 Experimental Framework In this paper, we do not decipher any ancient scripts. Rather, we develop algorithms and ap- ply them to the "decipherment" of known, mod- ern scripts. We pretend to be ignorant of the connection between sound and writing. Once our algorithms have come up with a proposed phonetic decipherment of a given document, we route the sound sequence to a speech synthe- sizer. If a native speaker can understand the speech and make sense of it, then we consider the decipherment a success. (Note that the na- Figure 1: The Phaistos Disk (c. 1700BC). The tive speaker need not even be literate, theoreti- disk is six inches wide, double-sided, and is the cally). We experiment with modern writing sys- earliest known document printed with a form of tems that span the categories described above. movable type. We are interested in the following questions:

.~ and (3) log6graphic (Sampson, 1985). • Can automatic techniques decipher an un- known script? If so, how accurately? Alphabetic systems attempt to repre- • What quantity of written text is needed sent single sounds with single characters, for successful decipherment? (this may be though no system is "perfect." For exam- • quite limited by circumstances) ple, Semitic have no characters for sounds. And even highly regular • What knowledge of the spoken language is writing systems like Spanish have plenty of needed? Can it to be extracted automati- spelling variation, as we shall see later. cally from available resources? What quan- tity of resources? Syllabic systems have characters for entire , such as "ba" and "shu." Both • Are some writing systems easier to decipher and Mayan are primarily syllabic, than others? Are there systematic differ- as is Japanese . The Phaistos Disk ences among alphabetic, syllabic, and lo- from Crete (see Figure 1) is thought to be gographic systems? syllabic, because of the number of distinct • Are separators necessary or helpful? characters present. • Can automatic techniques be robust Finally, logographic systems have charac- against language evolution (e.g., modern ters for entire words. Chinese is often cited versus ancient forms of a language)? as a typical example. • Can automatic techniques identify the lan- Unfortunately, actual scripts do not fall guage behind a script as a precursor to de- neatly into one category or another (DeFrancis, ciphering it? 1989; Sproat, forthcoming). Written Japanese will contain syllabic kana, alphabetic roomaji, 4 Alphabetic Writing (Spanish) and logographic characters all in the same Five hundred years ago, Spaniards invaded document. actually have a Mayan lands, burning documents and effec- phonetic component, and words are often com- tively eliminating everyone who could read and posed of more than one character. Irregular write. (Modern Spaniards will be quick to point English writing is neither purely alphabetic nor out that most of the work along those lines

38 Sounds: B-+borv B, D, G, J (ny as in canyon), L (y as D~d in yarn), T (th as in thin), a, b, d, e, G~g f, g, i, k, l, m, n, o, p, r, rr (trilled), s, J--+fi t, tS (ch as in chin), u, x (h as in hat) L--+ llory a .--~ a or £ Characters: b --.~. b or v d--~d fi, £, 6, i, o, u, a, b, c, d, e, f, g, h, i, j, e ---~ e or 6 k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, f~f Z gag i~ior~ Figure 2: Inventories of Spanish sounds (with l~l rough English equivalents in parentheses) and m-+m characters. n--+n o-moor6 p--~p had already been carried out by the Aztecs). r--Jr Mayan hieroglyphs remained uninterpreted for many centuries. We imagine that if the Mayans t--rt tS~ch have invaded Spain, then 20th century Mayan scholars might be deciphering ancient Spanish u--~ u or fi x-+j documents instead. We begin with an analysis of Spanish writing. nothing --+ h T (followed by a, o, or u) ~ z The task of decipherment will be to re-invent these rules and apply them to written docu- T (followed by e or i) --+ c or z T (otherwise) ~ c ments in reverse. First, is necessary to settle on the basic inventory of sounds and characters. k (followed by e or i) ~ q u k (followed by s) ---+ x Characters are easy; we simply tabulate the dis- tinct ones observed in text. For sounds, we need k (otherwise) ~ c something that will serve as reasonable input to rr (at beginning of word) ~ r rr (otherwise) ---,, rr a speech synthesizer. We use a Spanish-relevant subset of the International Phonetic Alphabet s (preceded by k) ~ nothing s (otherwise) --+ s (IPA), which seeks to capture all sounds in all . Actually, we use an ASCII version of the IPA called SAMPA (Speech Assessment Figure 3: Spanish sound-to-character spelling Methods Phonetic Alphabet), originally devel- rules. The left-hand side of each rule contains a oped under ESPRIT project 1541. There is also Spanish sound (and possible conditions), while a public-domain Castillian speech synthesizer theright-hand side contains zero or more Span- (called Mbrola) for the Spanish SAMPA sound ish characters. set. Figure 2 shows the sound and character inventories. • silence can produce a character (h) Now to spelling rules. Spanish writing is clearly not a one-for-one proposition: Moreover, there are ambiguities. The sound L • a single sound can produce a single charac- (English y-sound) may be written as either ll or ter (a-+ a) y. The sound i may also produce the character y, so the pronunciation of y varies according to * a sound can produce two characters (tS context. The sound rr (trilled r) is written rr in ch) the middle of a word and r at the beginning of • two sounds can produce a single character a word. (k s ---+ x) Figure 3 shows a sample set of Spanish

39 spelling rules. We formalized these rules com- Armed with these basic probabilities, we can putationally in a finite-state transducer (Pereira compute two things. First, the total probabil- and Riley, 1997). The transducer is bidirec- ity of observing a particular written sequence of tional. Given a specific sound sequence, we characters cl ... cn: can extract all possible character sequences, and vice versa. It turns out that while there are P(Cl ...Cn) = many ways to write a given Spanish sound se- Es,...s. P(sl ...s,) • P(c 1 ...Cn I Sl ...Sn) quence with these rules, it is fairly clear how to And second, we can compute the most proba- pronounce a written sequence• ble phonetic decipherment sl ... s, of a particu- In our decipherment experiment, we blithely lar written sequence of characters cl ... c,. This ignore many of the complications just described, will be the one that maximizes P(sl ...s~ I cl and pretend that Spanish writing is, in fact, a • ..cn), or equivalently, maximizes P(sl ...s~) one-for-one proposition. That is, to write down • P(cl ...c, I Sl ...sn). The trick is that the a sound sequence, one replaces each sound with P(character I sound ) probabilities are unknown a single character. We do allow ambiguity, how- to us. We want to assign values that maximize ever. A given sound may produce one character P(cl ...c,). These same values can then be sometimes, and another character other times. used to decipher. Decipherment is driven by knowledge about We adapt the EM algorithm (Dempster et the spoken language. In the case of archeologi- al., 1977), for decipherment, starting with a cal decipherment, this knowledge may include uniform probability over P(character [ sound). vocabulary, grammar, and meaning. We use That is, any sound will produce any character simpler data. We collect frequencies of sound- with probability 0.0333. The algorithm succes- triples in spoken Spanish. If we know that triple sively refines these probabilities, guaranteeing "t 1 k" is less frequent than "a s t," then we to increase P(cl ... cn) at each iteration. EM re- should ultimately prefer a decipherment that quires us to consider an exponential number of contains the latter instead of the former, all decipherments at each iteration, but this can be other things being equal. done efficiently with a dynamic-programming This leads naturally into a statistical ap- implementation (Baum, 1972). The training proach to decipherment. Our goal is to settle scheme is illustrated in Figure 4. on a sound-to-character scheme that somehow maximizes the probability of the observed writ- In our experiment, we use the first page of ten document. Like many linguistic problems, the novel Don Quixote as our "ancient" Span- this one can be formalized in the noisy-channel ish document cl ...cn. To get phonetic data, framework. Our sound-triple frequencies can we might tape-record modern Spanish speak- be turned into conditional probabilities such as ers and transcribe the recorded speech into the P(t I a s). We can estimate the probability of IPA alphabet. Or we might use documents a sound sequence as the product of such local written in an alternate, known script, if any probabilities. existed. In this work, we take a short cut by reverse-engineering a set of medical Spanish P(sl ... sn) documents, using the finite-state transducer de- P(s3 I Sl s2) • P(s4 I s2 s3) • P(s5 I s3 s4) .... scribed above, to obtain a long phonetic training sequence Sl • • • s,~. A specific sound-to-character scheme can be At each EM iteration, we extract the most represented as a set of conditional probabilities probable decipherment and synthesize it into such as P(v I B). Read this as "the probability that Spanish sound B is written with character audible form. At iteration 0, with uniform v." We can estimate the conditional probabil- probabil!ties, the result is pure babble. At iteration 1, Spanish speakers report that "it ity of entire character sequence given an entire sound sequence as a product of such probabili- sounds like someone speaking Spanish, but us- ing words I don't know." At iteration 15, ties. the decipherment can be readily understood. P(cl ...on ]Sl ...s,) ,.~ (Recordings can be accessed on the World P(cl Is1) • P(c2 Is2) • P(c3 I $3) "... Wide Web at http : llwww, isi. edu/natural-

40 100% -- maximizing sound character P(sl...sn) * P(cl...cn I sl...sn) 3 i~ sequenci~ ~ sequences 90% -- i...... i ~maximizing P(cl...c. I sl...s.)

Decipherment 80% -- I trained on i accuracy observable (% phonemes P(sl...S.) * P(cl...c. I sl. sn) correctly spoken data. L_ pronounced) 70% --

trained on document written in unknown script (only sound-to-character 60% -- parameter values are allowed to change during training). 50% -- Figure 4: Training scheme for decipherment. < < We first train a phonetic model on phonetic < data. We then combine the phonetic model with < a generic (uniform) spelling model to create a IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII probabilistic generator of character sequences. EM iterations Given a particular character sequence ("an- cient document"), the EM algorithm searches Figure 5: Performance of Spanish decipher- for adjustments to the spelling model that will ment. As we increase the number of EM it- increase the probability of that character se- erations, we see an improvement in decipher- quence. ment performance (measured in terms of cor- rectly generated phonemes). The best result is obtained by weighting the learned spelling la.nguago/mt/decipher, html). model more highly than the sound model, i.e., If we reverse-engineer Don Quixote, we can by choosing a phonetic decoding Sl ...sn for obtain a gold standard phonetic decipherment. character sequence cl ... c~ that maximizes P(sl Our automatic decipherment correctly identifies ...s.) • P(cl ...c. Is1 ...s.) 96% of the sounds. Incorrect or dropped sounds are due to our naive one-for-one model, and not to weak algorithms or small corpora. For ex- character probabilities are fairly good, they are ample, "de la Mancha" is deciphered as "d e l still somewhat unsure, and this leaves room for a m a n T i a" even though the characters ch the phonetic model to overrule them incorrectly. really represent the single sound tS rather than However, the P(sl ...sn) model does have use- the two sounds T i. ful things to contribute. Our best performance, Figure 5 shows how performance changes at shown in the highest curve, is obtained by each EM iteration. It shows three curves. The weighing the learned sound-to-character prob- worst-performing curve reflects the accuracy of abilities more highly, i.e., by maximizing P(sl the most-probable decipherment using the for- • ..Sn)" P(Cl ...ca I sl ...sn) 3. mula above, i.e., the one that maximizes P(sl We performed some alternate experiments• • ..sn) • P(cl ...ca ] sl ...sn). We find that Using phoneme pairs instead of triples is it is better to ignore the P(Sl ...s,~) factor al- workable--it results in a drop from 96% ac- together, because while the learned sound-to- curacy to 92%. Our main experiment uses

41 word separators; removing these degrades per- smaller documents. 100 sentences yields 97.5% formance. For example, it becomes more dif- accuracy; 50 sentences yields 96.2% accuracy; ficult to tell whether the r character should be 20 sentences yields 82.2% accuracy; five sen- trilled or not. In our experiments with Japanese tences yields 48.5% accuracy. If we were to and Chinese, described next, we did not use give the sound sequence model some knowledge word separators, as these languages are usually about words or grammar, the accuracy would written without them. likely not fall off as quickly.

5 Syllabic writing (Japanese Kana) 6 "Logographic" writing (Chinese) As we mentioned in Section 2, Chinese char- The phonology of Japanese adheres strongly to .... acters .have internal phonetic components, and a -vowel-consonant-vow@l structure, does not really have a differ- which makes it quite amenable to syllabic writ- ent character for every word: so, it is not re- ing. Indeed, the Japanese have devised a kana ally logographic. However, it is representative consisting of about 80 symbols. There of writing systems whose distinct characters are is a symbol for ka, another for ko, etc. Thus, measured in the thousands, as opposed to 20-50 the writing of a sound like K depends on its for alphabets and 40-90 for . This phonetic context. Modern Japanese is not writ- creates several difficulties for decipherment: ten in pure kana; it employs a mix of alpha- betic, syllabic, and logographic writing devices. • computational complexity--our decipher- However, we will use pure kana as a stand-in ment algorithm runs in time roughly cubic for wide range of syllabic writing systems, as in the number of known sound triples. Japanese data is readily available. We obtain kana text sequences from the Kyoto Treebank, • very rare characters--if we only see a char- and we obtain sound sequences by using the acter once, the context may not be rich finite-state transducer described in (Knight and enough for us to guess its sound. Graehl, 1998). • sparse sound-triple data--the decipher- As with Spanish, we build a spoken language ment of a written text is likely to include model based on sound-triple frequencies. The novel sound triples. sound-to-kana model is more complex. We as- sume that each kana token is produced by a We created spoken language data for Chinese sequence of one, two, or three sounds. Us- by automatically (if imperfectly) pronouncing ing knowledge of syllabic writing in general-- Chinese text. We are indebted to Richard plus an analysis of Japanese sound patterns-- Sproat for running our documents through the we restrict those sequences to be (1) consonant- text-to-speech system at Bell Labs. We created vowel, (2) vowel-only, (3) consonant-no-vowel, sound-pair frequencies over the resulting set of and (4) consonant-semivowel-vowel. For ini- 1177 distinct syllables, represented in pinyin tial experiments, we mapped "small kana" onto format, suitable for synthesizing speech. We at- their large versions, even though this leads to tempted to decipher a written document of 1900 some incorrect learning targets, such as KIYO phrases and sentences, containing 2113 distinct instead of KYO. We implement the sound- and characters and no word separators. sound-to-kana models in a large finite-state ma- Our result was 22% accuracy, after chine and use EM to learn individual weights 20 EM iterations. We may compare this to a such as P(ka-kana [ SHY U). Unlike the Spanish baseline strategy of guessing the pinyin sound case, we entertain phonetic hypotheses of vari- de0 (English "of") for every character, which ous lengths for any given character sequence. yields 3.2% accuracy. This shows a considerable Deciphering 200 sentences of kana text yields improvement, but the speech in not comprehen- 99% phoneme accuracy. We render the sounds sible. Due to computational limits, we had to imperfectly (yet inexpensively) through our (1) eliminate all pinyin pairs that occurred less public-domain Spanish synthesizer. The result than five times, and (2) prevent our decoder is comprehensible to a Japanese speaker. from proposing any novel pinyin pairs. Because We also experimented with deciphering our our goal-standard decipherment contained

42 many rare sounds and novel pairs, these com- checked by humans, and published human deci- putational limits severely impaired accuracy. pherments could be checked by computer. We would subsequently like to attack ancient scripts 7 Discussion that yet to be deciphered. High-speed comput- ers are not very intelligent, but they display a We have presented and tested a computational patience that exceeds even the most thorough approach to phonetically deciphering written human linguist. scripts. We cast decipherment as a special kind of text-to-speech conversion in which we It will be important to consider text-layout have no rules or data that directly connect questions when dealing with real scripts. For ex- speech sounds with written characters. We set ample, Mayan may run from top to bot- up general finite-state transducers for turning tom, right to left, or they may run differently. sounds into writing, and use the EM algorithm Furthermore, each contain sub-parts rep- to estimate their parameters. The whole pro- resenting up to ten sounds, and these may be cess is driven by knowledge about the spoken organized in a spiral pattern. language, which may include frequency infor- Another intriguing possibility is to do lan- mation about sounds, sound sequences, words, guage identification at the same time as deci- grammar, meaning, etc. An interesting result pherment. Such identification would need to is that decipherment is possible using limited be driven by online sound sets and spoken cor- knowledge of the spoken language, e.g., sound- pora that span a very wide range of languages. triple frequencies. This is encouraging, because Whether a document represents a given lan- it may provide robustness against language evo- guage could then be estimated quantitatively. lution, a fixture of archaeological deciphering. In case language identification fails, we may However, our experiments have been targeted be faced with a completely extinct language. a bit narrowly. We were able to re-use the Span- Current computational techniques demonstrate ish decoder on Chinese, but it could not work that it is theoretically possible to figure out for Japanese kana. Even our Japanese decoder where nouns, verbs, and adjectives from raw would fail on an alternative syllabic script for text, but actual translation into English is an- Japanese which employed a single symbol for other matter. Archaeologists have sometimes the sound KAO, instead of separate kana sym- succeeded in such cases by leveraging bilingual bols for KA and O. One ambitious line of re- documents and loan words from related lan- search would be to examine writing systems in guages. Only a truly optimistic cryptanalyst an effort to invent a single, generic "mother of would believe that progress could be made even all writing systems," whose specializations in- without these resources; but see (AI-Onaizan clude a large fraction of actual ones. To cover and Knight, 1999) for initial results on Arabic- Spanish and Japanese, for example, we could set English translation using only monolingual re- up a scheme in which each sound produces zero sources. or more characters, where the sound is poten- Finally, we note that the application of tially influenced by the two sounds immediately source-channel models to the text-to-speech preceding and following it. This gets tricky: the problem is promising. This kind of statistical "mother Of all" has to be general, but it also has modeling is prevalent in speech recognition, but to be narrow enough to support decipherment ours is one of the few applications in speech syn- through automatic training. (Sproat, forthcom- thesis. It may be possible to use uncorrelated ing) suggests the class of finite-state transduc- streams of speech and text data to learn map- ers as one candidate. This narrows things down pings that go beyond character pronunciation, significantly from the class of Turing machines, to pitch, duration, stress, and so on. but not far enough for the direct application of known training algorithms. References In the future, we would like to attack an* cient scripts. We would start with scripts that Y. A1-Onaizan and K. Knight. 1999. A have already been roughly deciphered by ar- ciphertext-only approach to translation. (In chaeologists. Computer decipherments could be preparation).

43 L. E. Baum. 1972. An inequality and associated maximization technique in statistical estima- tion of probabilistic functions of a Markov process. Inequalities, 3. J. Chadwick. 1958. The Decipherment o/Lin- ear B. Cambridge University Press, Cam- bridge. M. Coe. 1993. Breaking the Maya Code. Thames and Hudson, New York. J. DeFrancis. 1989. : The Di- verse Oneness of Writing Systems. Univer- sity of Hawaii Press, Honolulu. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(B):1-38. K. Knight and J. Graehl. 1998. Machine transliteration. Computational , 24(4). F. Pereira and M. Riley. 1997. Speech recog- nition by composition of weighted finite au- tomata. In E. Roche and Y. Schabes, editors, Finite-State Language Processing. MIT Press. J. Sampson. 1985. Writing Systems. Stanford, Stanford. R. Sproat. forthcoming. A Computational The- ory of Writing Systems.

44