Word Segmentation for Akkadian Cuneiform, an Ancient Writing System and a Language Used for About 3 Millennia in the Ancient Near East
Total Page:16
File Type:pdf, Size:1020Kb
Akkadian Word Segmentation Timo Homburg M.Sc., Dr. Christian Chiarcos Institute for Computer Science Goethe University, Robert-Mayer-Str. 10, 60325 Frankfurt am Main, Germany [email protected], [email protected] Abstract We present experiments on word segmentation for Akkadian cuneiform, an ancient writing system and a language used for about 3 millennia in the ancient Near East. To our best knowledge, this is the first study of this kind applied to either the Akkadian language or the cuneiform writing system. As a logosyllabic writing system, cuneiform structurally resembles Eastern Asian writing systems, so, we employ word segmentation algorithms originally developed for Chinese and Japanese. We describe results of rule-based algorithms, dictionary-based algorithms, statistical and machine learning approaches. Our results may indicate possible promising steps in cuneiform word segmentation that can create and improve natural language processing in this area. Keywords: Assyriology, Cuneiform, Akkadian, Chinese, Word Segmentation, Machine Learning 1. Introduction ( dan) (Gelb, 1957, p.8, Da-an--ri vs. Dan-r-ri). Word segmentation is the most elementary task in natural We primarily consider the Akkadian language, the domi- language processing of written language. In most alpha- nant language of the Ancient Near East from the 3rd to the betical writing systems, this task is commonly referred to 1st millennium BCE. Originally spoken in Mesopotamia, it as tokenization and can be easily solved through the inter- became the lingua franca in the Near East during the 2nd pretation of orthographical markers for word and sentence m. BCE, with an extensive body of material comparable boundaries, e.g., white spaces. Where these are lacking, only to corpus languages such as Classical Latin or An- however, word segmentation is a challenging task, a classi- cient Greek. With a considerable amount of cuneiform clay cal – and successfully addressed – problem in logographic tablets not yet deciphered, and new ones being continu- writing systems like Chinese and logosyllabic writing sys- ously excavated, the automated processing of the Akkadian tems like Japanese. language is thus of tremendous importance. Previous re- Here, we describe experiments on cuneiform, a writing sys- search on automated digitization focused on producing 3D tem developed in the 4th m. BCE in Mesopotamia sub- scans of tablets (Sect. 2.), with Optical Character Recog- sequently applied to various Semitic, Indo-European and nition (OCR) being a logical next step in the development. isolate languages in the region. As a logosyllabic writing Successful cuneiform OCR, however, needs to be accom- system, it shares important structural characteristics with plished by knowledge-rich NLP methods for the contex- Chinese and Japanese (Ikeda, 2007), so that we evaluate tual disambiguation of characters: One of the key charac- word segmentation methods successfully applied to these teristics of cuneiform is that a character can be read as an languages. However, these languages are unrelated to those logograph, as a determinative, or as a syllabic sign (with of the Ancient Near East, so that future research will focus different phonemic values). The contextual distribution of on developing aspects specific to languages with cuneiform characters is thus heavily dependent on its context. Word writing. segmentation approaches may thus be a key component to As a writing system, cuneiform poses a number of unique any approach on cuneiform OCR. challenges: Akkadian is the oldest attested Semitic language, and has • The same character, e.g., , can be read as a logo- thus occasionally been considered in experiments on NLP graph or as a syllable, as the logograph GURU ‘young for Semitic languages, but mostly focusing on (rule-based) man’ or with its phonological reading as a syllabic morphological analysis. To our best knowledge, the present sign. paper describes the first study of word segmentation in Akkadian cuneiform. It thus provides a primary point of • As a syllabic sign, a single character can have multi- orientation for any subsequent experiments on cuneiform ple different readings, e.g., grounded in the possible word segmentation and will be of utmost importance to fu- Sumerian pronounciation(s) of the logograph, or the ture experiments on cuneiform OCR and Akkadian NLP. pronounciation of their Akkadian translations, may be read as dan/tan (from Akk. dannu ‘strong, power- ful’), kal (from Sum. kal ‘rare, valuable’ and kalag ‘strong’), rib (from Sum. rib ‘outstanding, strong’), etc. (Tinney and others, 2006; Lauffenburger, nd; Borger, 2004). • CVC syllables (e.g., dan) can be as a pair of CV-VC characters ( da-an) or with a single CVC character 4067 2. State Of the Art computer graphics, the obvious gap between both lines of We distinguish three types of word segmentation algo- research lies in the absence of any studies concerned with rithms: the transition from the (identified) sign and its linguistic in- terpretation, a challenging task, as mentioned before. rule-based segmentation rules derived from grammar With our paper, we describe the first experiments in this direction, with a specific focus on segmenting character dictionary-based segmentation by lookup in a (statically sequences into words as a core component for future ap- enhanced) dictionary proaches on transliteration. statistical/machine learning data-driven segmentation as 3. Experimental Setup learnt from segmented corpora 3.1. Corpus Data As shown in several SIGHAN BakeOffs in the last decade We use corpora from three different periods and dialects, (Sproat and Emerson, 2003), in Chinese machine learn- namely Old Babylonian, Middle Babylonian and Neo- ing and dictionary-based approaches like MaxMatch (Chen Assyrian, from the Cuneiform Digital Library Initiative 2 and Liu, 1992) produce reasonable results while rule-based (CDLI) , representing most of the available texts (clay methods are commonly used as a Baseline (Palmer and tablets) of the given periods of time. The corpora were ran- Burger, 1997). In Japanese, however, rule-based algorithms domly split in a 80:20 ratio for training and testing purposes like Tango (Ando and Lee, 2000) proved to be more suc- (on a per-tablet, not a per-line basis). For the experiments, cessful. This is partially due to the morphological richness we trained our segmentation algorithms on each of these of Japanese as compared to Chinese. language stages, and performed evaluations on each lan- As a point of orientation for subsequent studies on guage stage respectively. For reasons of space, we only cuneiform, we evaluate selected approaches from these report results for the Middle Babylonian training corpus classes in their performance on Akkadian. Neither the and evaluation against the Middle Babylonian test corpus Akkadian language nor cuneiform as a writing system have in detail. Further experiments showing robust performance been addressed in this respect before. across different language stages will be represented in a Along with other cuneiform languages, Akkadian has a graphical way. Additionally we will present results of clas- considerable research history in NLP. For the greatest part, sifications using corpora data of one epoch applied on other existing approaches are concerned with rule-based mor- epochs of the same language to get an impression of the phological analyzers, e.g., Kataja and Koskenniemi (1988), performance of the algorithms on related data. Barthlemy (1998), Macks (2002), Barthlemy (2009), Khait The CDLI ATF format contains metadata, a (word- (accepted) for Akkadian, or Valentin Tablan Wim Peters segmented) transliteration, and (optionally) a translation. (2006) for Sumerian. As for data-driven morphological CDLI data always represents cuneiform in lines as found tools, the state of the art in the field is represented by the on the clay tablets. To minimize ambiguity, Akkadian writ- Lemmatizer of the Open Richly Annotated Cuneiform Cor- ers tried to avoid incomplete words at the end of a line, pus (ORACC),1 which supports manual morphological an- so the tablets themselves provide initial data on word seg- notation for Akkadian, Sumerian and (to a limited degree) mentation. From the transliteration, we restored the orig- Hittite with a lookup-functionality in the annotated corpus. inal UTF-8 characters on the basis of a sign list that we Such example-based approaches can be extended to auto- compiled from various resources. Non-restorable charac- matically transfer morphological rules through phonologi- ters were ignored and thus are not represented in the result- cal equivalences, as demonstrated by Snyder et al. (2010) ing texts. This data represents our gold standard. It should for the projection of Hebrew morphology and lexicon to be noted that the mapping to UTF-8 can not be trivially re- Ugaritic, another Semitic cuneiform language. As for versed because of the highly ambiguous phonological and higher levels of linguistic analysis, we are not aware of any ideographic meaning of characters. tools for syntactic or semantic annotation for Akkadian, After conversion and the removal of whitespaces, segmen- however, the latter has been considered for administrative tation algorithms of three categories have been applied. texts from the Sumerian period, whose highly convention- Figure 1 shows the segmentation process. alized structure can be exploited for concept classification 3.2. Baseline (Jaworski, 2008). As our baseline we adopted the Character-As-Word