A Probabilistic Model of Ancient Egyptian Writing
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by St Andrews Research Repository A probabilistic model of Ancient Egyptian writing Mark-Jan Nederhof and Fahrurrozi Rahman School of Computer Science University of St Andrews North Haugh, St Andrews, KY16 9SX, UK Abstract Most are written as Latin characters, some with di- acritical marks, plus aleph Z and ayin c. An equal This article investigates a probabilistic sign is commonly used to precede suffix pronouns; model to describe how signs form words thus sdm means “to hear” and sdm=f “he hears”. in Ancient Egyptian writing. This applies ¯ ¯ A dot can be used to separate other morphemes; to both hieroglyphic and hieratic texts. for example, in sdm.tw=f, “he is heard”, the mor- The model uses an intermediate layer of ¯ pheme .tw indicates passive. sign functions. Experiments are concerned with finding the most likely sequence of The Ancient Egyptian writing system itself is a sign functions that relates a given se- mixture of phonetic and semantic elements. The quence of signs and a given sequence of most important are phonograms, logograms, and phonemes. determinatives. A phonogram is a sign that repre- sents a sequence of one, two or three letters, with- 1 Introduction out any semantic association. A logogram repre- Ancient Egyptian writing, used in Pharaonic sents one particular word, or more generally the Egypt, existed in the form of hieroglyphs, often lemma of a word or a group of etymologically re- carved in stone or painted on walls, and some- lated words. A determinative is commonly written times written on papyrus (Allen, 2000). Hiero- at the end of a word, following phonograms, to glyphs depict people, animals, plants and vari- clarify the meaning of a word; in their most ob- ous kinds of objects and geographical features. A vious use, determinatives disambiguate between cursive form of Ancient Egyptian writing, called homophones, or more precisely, different words hieratic, was predominantly written on papyrus. consisting of the same consonants. In addition, Most hieratic symbols can be seen as simplified there are typographical signs, for example, three hieroglyphs, to such an extent that it is difficult strokes that indicate the plural form of a noun (also for the modern untrained eye to tell what is de- used for collective nouns). More classes of signs picted. Because hieratic handwriting varied con- can be distinguished, such as the phonetic deter- siderably over time, with notable differences be- minatives, which tend to be placed near the end tween regions and scribes, the creation of com- of a word, next to normal determinatives, but their puter fonts for hieratic is problematic, and con- function is phonetic rather than semantic, i.e. they sequently scholars commonly resort to publishing repeat letters already written by phonograms. hieratic texts in a normalized hieroglyphic font. What makes automatic analysis of Ancient Since Version 5.2, Unicode contains a selection Egyptian writing so challenging is that there was of 1071 hieroglyphs. Henceforth we will use the no fixed way of writing a word, so that table- term sign to refer to a hieroglyph or a hieratic sym- lookup is largely ineffective. Even within a sin- bol. gle text, the same word can often be found written The Ancient Egyptian language is in the fam- in three or more different ways. Moreover, one ily of Afro-Asiatic languages, which includes the sign can often be used in different functions, e.g. Semitic languages (Loprieno, 1995). As in scripts as phonogram or as determinative. Some signs of several Semitic languages (e.g. Hebrew, Arabic, can be used as different phonograms with differ- Phoenician), only consonants are written. Modern ent sound values. Together with the absence of scholars use between 24 and 25 letters to translit- word boundary markers, this makes it even hard to erate Egyptian texts in terms of these consonants. segment a text into words. Generalizing statements can be made about relied on simple Unix applications such as ‘grep’ writings of words. Typically, either a word starts and ‘sed’. The same problem was addressed with a number of phonograms, covering all the let- by Rosmorduc (2008), using manually produced ters of the stem, possibly some covered more than rewrite rules. Further work along these lines by once, followed by one or more determinatives, or a Barthelemy´ and Rosmorduc (2011) uses two ap- word starts with a logogram, possibly followed by proaches, namely cascades of binary transducers one or more phonograms especially for endings, and intersections of multitape transducers, with possibly followed by one or more determinatives. the objective to compare the sizes of the resulting More phonograms can follow the determinatives automata. for certain suffixes. This coarse description is in- A more modest task is to automatically align adequate however to model the wide spectrum of given hieroglyphic text and transliteration, as writings of words, nor would it be sufficient to dis- considered by Nederhof (2008), who used an ambiguate between alternative analyses of one se- automaton-based approach with configurations, quence of signs. similar to that in Section 4, except that manually These factors motivate the search for an ac- determined penalties were used instead of proba- curate and robust model that can be trained on bilities. data, and that becomes more accurate as more Relating hieroglyphic texts and their Egypto- data becomes available. Ideally, the model should logical transliteration is an instance of relating be amenable to unsupervised training. Whereas two alternative orthographic representations of the linguistic models should generally avoid unwar- same language. The problem of mechanizing this ranted preconceptions, we see it as inevitable that task is known as machine transliteration. For ex- our model has some knowledge about the writing ample, Knight and Graehl (1998) consider trans- system already built in, for two reasons. First, lation of names and technical terms between En- little training material is currently available, and glish and katakana, and Malik et al. (2008) con- second, the number of signs is quite large, so sider transliteration between Hindi and Urdu. An- that the little training material is spread out over other very related problem is conversion between many parameters. The a priori knowledge in our graphemes and phonemes, considered for example model consists of a sign list that enumerates possi- by Galescu and Allen (2002). ble functions of signs and a formalization of how Typical approaches to solve these tasks involve these functions produce words. This knowledge finite-state transducers. This can be justified by sufficiently reduces the search space, so that prob- the local dependencies between input and output, abilistic parameters can be relatively easily esti- that is, ultimately the transliteration can be broken mated. down into mappings from at most n to at most m In our framework, a sign function is formally symbols, for some small n and m. For Ancient identified by the combination of (a) the one or Egyptian however, it is unclear what those bounds more signs of its writing, (b) its class, which could on n and m would be. In this sense, Ancient Egyp- be ‘phonogram’, ‘logogram’, ‘determinative’, etc., tian may pose a challenge to the Regularity hy- (c) zero, one or two values, depending on the class. pothesis from Sproat (2000). For this reason we One example is the phonogram function for sign do not exclusively rely on finite-state methods in this paper. with sound value r. There is a logogram func- tion for the same sign, with as value the lemma rZ, 2 Sign list “mouth”. A typographical function for the three stokes may have a semantic value ‘plural’ and a Essential to the application of our model is an an- phonetic value that is the masculine plural ending notated sign list. We have created such a list in the -w. form of a collection of XML files.1 Apart from The problem we will address in the experiments being machine-readable, these files can also be is guessing the sign functions given the signs and converted to human-readable web pages. Among the letters. This is related to the problem of au- other things, the files gather knowledge about tomatically obtaining transliteration from hiero- the various functions of the 1071 signs from the glyphic text. As far as we are aware, the earli- 1http://mjn.host.cs.st-andrews.ac.uk/ est work to attempt this was Tsukamoto (1997). It egyptian/unicode/ Unicode repertoire, gathered from a number of transliterations. For example, the information that sources, the foremost of which is Gardiner (1957). the word nmtt, “step”, denoted by the logogram The annotated sign list is necessarily imperfect , is feminine can be used to infer that uses of and incomplete, which is due to inadequacies of the logogram in plural writings should be matched the Unicode set itself (Rosmorduc, 2002/3; Polis to nmtwt, “steps”, with the feminine plural end- and Rosmorduc, 2013), as well as to the nature ing -wt in place of the feminine singular ending of Ancient Egyptian writing, which gave scribes -t. Similarly, the logogram , for hnj, “to row”, considerable freedom to use existing signs in new ¯ is accompanied by information that its stem is hn, ways and to invent new signs where existing signs ¯ so we can identify the use in the writing of hn=f, seemed inadequate. We have furthermore ignored ¯ the origins of signs, and distinguish fewer nuances “he rows”, without the weak consonant j, which of sign use than e.g. Schenkel (1971). disappears in most inflections. Our functions are divided into logograms, deter- 3 Corpus minatives, phonograms, phonetic determinatives and typographical signs. The typographical signs There is currently only one comprehensive corpus include for example the three strokes that indicate of Late Egyptian, which is still under development plurality or collectivity.