A Computational Approach to Deciphering Unknown Scripts
Total Page:16
File Type:pdf, Size:1020Kb
A Computational Approach to Deciphering Unknown Scripts Kevin Knight Kenji Yamada USC/Information Sciences Institute USC/Information Sciences Institute 4676 Admiralty Way 4676 Admiralty Way Marina del Rey, CA 90292 Marina del Rey, CA 90292 [email protected] [email protected] Abstract • minimal input (monolingual inscriptions We propose and evaluate computational techniques only) for deciphering unknown scripts. We focus on the case in which an unfamiliar script encodes a known This situation has arisen in many famous language. The decipherment of a brief document cases of decipherment--for example, in the Lin- or inscription is driven by data about the spoken ear B documents from Crete (which turned language. We consider which scripts are easy or hard out to be a "non-Greek" script for writing an- to decipher, how much data is required, and whether cient Greek) and in the Mayan documents from the techniques are robust against language change Mesoamerica. Both of these cases lay unsolved over time. until the latter half of the 20th century (Chad- 1 Introduction wick, 1958; Coe, 1993). In computational linguistic terms, this de- With surprising frequency, archaeologists dig up cipherment task is not really translation, but documents that no modern person can read. rather text-to-speech conversion. The goal of Sometimes the written characters are familiar the decipherment is to "make the text speak," (say, the Phoenician alphabet), but the lan- after which it can be interpreted, translated, guage is unknown. Other times, it is the reverse: etc. Of course, even after an ancient docu- the written script is unfamiliar but the language ment is phonetically rendered, it will still con- is known. Or, both script and language may be tain many unknown words and strange con- unknown. structions. Making the text speak is therefore Cryptanalysts also encounter unreadable doc- only the beginning of the story, but it is a cru- uments, but they try to read them anyway. cial step. With patience, insight, and computer power, Unfortunately, current text-to-speech sys- they often succeed. Archaeologists and lin- tems cannot be applied directly, because guists known as epigraphers apply analogous they require up front a clearly specified techniques to ancient documents. Their deci- sound/writing connection. For example, a sys- pherment work can have many resources as in- tem designer may create a large pronunciation put, not all of which will be present in a given dictionary (for English or Chinese) or a set of case: (1) monolingual inscriptions, (2) accom- manually constructed character-based pronun- panying pictures or diagrams, (3) bilingual in- ciation rules (for Spanish or Italian). But in scriptions, (4) the historical record, (5) physical decipherment, this connection is unknown! It is artifacts, (6) bilingual dictionaries, (7) informal exactly what we must discover through analysis. grammars, etc. There are no rule books, and literate informants In this paper, we investigate computational are long-since dead. approaches to deciphering unknown scripts, and report experimental results. We concentrate on 2 Writing Systems the following case: To decipher unknown scripts, is useful to under- stand the nature of known scripts, both ancient • unfamiliar script and modern. Scholars often classify scripts into • known language three categories: (1) alphabetic, (2) syllabic, 37 purely logographic; it is sometimes called mor- phophonemic. Ancient writing is also mixed, and archaeologists frequently observe radical writing changes in a single language over time. 3 Experimental Framework In this paper, we do not decipher any ancient scripts. Rather, we develop algorithms and ap- ply them to the "decipherment" of known, mod- ern scripts. We pretend to be ignorant of the connection between sound and writing. Once our algorithms have come up with a proposed phonetic decipherment of a given document, we route the sound sequence to a speech synthe- sizer. If a native speaker can understand the speech and make sense of it, then we consider the decipherment a success. (Note that the na- Figure 1: The Phaistos Disk (c. 1700BC). The tive speaker need not even be literate, theoreti- disk is six inches wide, double-sided, and is the cally). We experiment with modern writing sys- earliest known document printed with a form of tems that span the categories described above. movable type. We are interested in the following questions: .~ and (3) log6graphic (Sampson, 1985). • Can automatic techniques decipher an un- known script? If so, how accurately? Alphabetic systems attempt to repre- • What quantity of written text is needed sent single sounds with single characters, for successful decipherment? (this may be though no system is "perfect." For exam- • quite limited by circumstances) ple, Semitic alphabets have no characters for vowel sounds. And even highly regular • What knowledge of the spoken language is writing systems like Spanish have plenty of needed? Can it to be extracted automati- spelling variation, as we shall see later. cally from available resources? What quan- tity of resources? Syllabic systems have characters for entire syllables, such as "ba" and "shu." Both • Are some writing systems easier to decipher Linear B and Mayan are primarily syllabic, than others? Are there systematic differ- as is Japanese kana. The Phaistos Disk ences among alphabetic, syllabic, and lo- from Crete (see Figure 1) is thought to be gographic systems? syllabic, because of the number of distinct • Are word separators necessary or helpful? characters present. • Can automatic techniques be robust Finally, logographic systems have charac- against language evolution (e.g., modern ters for entire words. Chinese is often cited versus ancient forms of a language)? as a typical example. • Can automatic techniques identify the lan- Unfortunately, actual scripts do not fall guage behind a script as a precursor to de- neatly into one category or another (DeFrancis, ciphering it? 1989; Sproat, forthcoming). Written Japanese will contain syllabic kana, alphabetic roomaji, 4 Alphabetic Writing (Spanish) and logographic kanji characters all in the same Five hundred years ago, Spaniards invaded document. Chinese characters actually have a Mayan lands, burning documents and effec- phonetic component, and words are often com- tively eliminating everyone who could read and posed of more than one character. Irregular write. (Modern Spaniards will be quick to point English writing is neither purely alphabetic nor out that most of the work along those lines 38 Sounds: B-+borv B, D, G, J (ny as in canyon), L (y as D~d in yarn), T (th as in thin), a, b, d, e, G~g f, g, i, k, l, m, n, o, p, r, rr (trilled), s, J--+fi t, tS (ch as in chin), u, x (h as in hat) L--+ llory a .--~ a or £ Characters: b --.~. b or v d--~d fi, £, 6, i, o, u, a, b, c, d, e, f, g, h, i, j, e ---~ e or 6 k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, f~f Z gag i~ior~ Figure 2: Inventories of Spanish sounds (with l~l rough English equivalents in parentheses) and m-+m characters. n--+n o-moor6 p--~p had already been carried out by the Aztecs). r--Jr Mayan hieroglyphs remained uninterpreted for many centuries. We imagine that if the Mayans t--rt tS~ch have invaded Spain, then 20th century Mayan scholars might be deciphering ancient Spanish u--~ u or fi x-+j documents instead. We begin with an analysis of Spanish writing. nothing --+ h T (followed by a, o, or u) ~ z The task of decipherment will be to re-invent these rules and apply them to written docu- T (followed by e or i) --+ c or z T (otherwise) ~ c ments in reverse. First, is necessary to settle on the basic inventory of sounds and characters. k (followed by e or i) ~ q u k (followed by s) ---+ x Characters are easy; we simply tabulate the dis- tinct ones observed in text. For sounds, we need k (otherwise) ~ c something that will serve as reasonable input to rr (at beginning of word) ~ r rr (otherwise) ---,, rr a speech synthesizer. We use a Spanish-relevant subset of the International Phonetic Alphabet s (preceded by k) ~ nothing s (otherwise) --+ s (IPA), which seeks to capture all sounds in all languages. Actually, we use an ASCII version of the IPA called SAMPA (Speech Assessment Figure 3: Spanish sound-to-character spelling Methods Phonetic Alphabet), originally devel- rules. The left-hand side of each rule contains a oped under ESPRIT project 1541. There is also Spanish sound (and possible conditions), while a public-domain Castillian speech synthesizer theright-hand side contains zero or more Span- (called Mbrola) for the Spanish SAMPA sound ish characters. set. Figure 2 shows the sound and character inventories. • silence can produce a character (h) Now to spelling rules. Spanish writing is clearly not a one-for-one proposition: Moreover, there are ambiguities. The sound L • a single sound can produce a single charac- (English y-sound) may be written as either ll or ter (a-+ a) y. The sound i may also produce the character y, so the pronunciation of y varies according to * a sound can produce two characters (tS context. The sound rr (trilled r) is written rr in ch) the middle of a word and r at the beginning of • two sounds can produce a single character a word. (k s ---+ x) Figure 3 shows a sample set of Spanish 39 spelling rules. We formalized these rules com- Armed with these basic probabilities, we can putationally in a finite-state transducer (Pereira compute two things. First, the total probabil- and Riley, 1997). The transducer is bidirec- ity of observing a particular written sequence of tional. Given a specific sound sequence, we characters cl ..