Toward Automatically Assembling Hittite-Language Cuneiform Tablet Fragments Into Larger Texts

Toward Automatically Assembling Hittite-Language Cuneiform Tablet Fragments into Larger Texts Stephen Tyndall University of Michigan [email protected] Abstract the Hittite textual tradition itself. First, the bulk of the cuneiform material is frag- This paper presents the problem within Hit- mentary. The tablets, discovered in var- tite and Ancient Near Eastern studies of frag- ious depots in the Hittite capital and in mented and damaged cuneiform texts, and some provincial centers, normally were of proposes to use well-known text classification metrics, in combination with some facts about a larger size. When the archives were de- the structure of Hittite-language cuneiform stroyed, the tablets for the most part broke texts, to help classify a number of fragments of into many pieces. Therefore, the joining clay cuneiform-script tablets into more com- of fragments became an important prereq- plete texts. In particular, I propose using uisite for interpretation(Klengel, 2002). Sumerian and Akkadian ideogrammatic signs within Hittite texts to improve the perfor- Most Hittite texts are broken, but a number exist mance of Naive Bayes and Maximum Entropy in more than one fragmentary copy. classifiers. The performance in some cases Figure 1 shows a photograph, taken from the is improved, and in some cases very much University of Meinz Konkordanz der hethitischen not, suggesting that the variable frequency of Texte1, of a typical Hittite cuneiform fragment. occurrence of these ideograms in individual fragments makes considerable difference in Complete or partially-complete texts are assem- the ideal choice for a classification method. bled from collections of fragments based on shape, Further, complexities of the writing system writing size and style, and sentence similarity. Joins and the digital availability of Hittite texts com- between fragments are not made systematically, but plicate the problem. are usually discovered by scholars assembling large numbers of fragments that reference a specific sub- 1 Introduction ject, like some joins recently made in Hittite treaty documents in (Beckman, 1997). The Hittite empire, in existence for about 600 years Joins are thus fairly rare compared to the fre- between 1800 and 1200 BCE, left numerous histori- quency of new publishing of fragments. Such joins cal, political, and literary documents behind, written and the larger texts created therewith are catalogued in cuneiform in clay tablets. There are a number of according to a CTH (Catalogue des Textes Hittites2) common problems that confront Hittite scholars in- number. Each individual text is composed of one or terested in any subdiscipline of Hittitology, be it his- more cuneiform fragments belonging to one or more tory, philology, or linguistics. Horst Klengel sum- copies of a single original work. marizes the issue most crucial to this paper: 1available at http://www.hethport. uni-wuerzburg.de/HPM/hethportlinks.html Some general problems, affecting both 2available at http://www.hethport. philologists and historians, are caused by uni-wuerzburg.de/CTH/ 243 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 243–247, Jeju, Republic of Korea, 8-14 July 2012. c 2012 Association for Computational Linguistics Figure 2 shows a published join in hand-copied cuneiform fragments. In this case, the fragments are not contiguous, and only the text on the two fragments was used to make the join. The task then, for the purposes of this paper, is to connect unknown fragments of Hittite cuneiform tablets with larger texts. I’m viewing this as a text classification task, where larger, CTH-numbered texts are the categories, and small fragments are the bits of text to be assigned to these categories. 2 The Corpus of Hittite Hittite cuneiform consists of a mix of syllabic writing for Hittite words and logographic writing, typ- ically Sumerian ideograms, standing in for Hittite words. Most words are written out phonologically using syllabic signs, in structure mostly CV and VC, and a few CVC. Some common words are written with logograms from other Ancient Near Eastern languages, e.g. Hittite antuhsa-ˇ ‘man’ is commonly Figure 1: Photograph of a Hittite Tablet Fragment written with the Sumerian-language logogram tran- scribed LU.´ Such writings are called Sumerograms or Akkadograms, depending on the language from which the ideogram is taken. The extant corpus of Hittite consists of more than 30,000 clay tablets and fragments excavated at sites in Turkey, Syria, and Egypt (Hoffner and Melchert, 2008, 2-3). Many of these fragments are assigned to one of the 835 texts catalogued in the CTH. 3 Prior Work A large number of prior studies on text classification have informed the progress of this study. Cat- egorization of texts into genres is very well studied (Dewdney et al., 2001). Other related text classification studies have looked at classifying text by source, in contexts of speech, as in an attempt to classify some segments of speech into native and non-native speaker categories (Tomokiyo and Jones, 2001), and writing and authorship, as in the fa- mous Federalist Papers study(Mosteller and Wal- lace, 1984), and context, as in a categorization of a set of articles according to which newspaper they Figure 2: Published Fragment Join appeared in (Argamon-Engelson et al., 1998). Measures of similarity among sections of a single document bear a closer relation to this project than the works above. Previous studies have examined in- 244 ternal document similarity, using some vector-based [ ]A-NA KUR URUHa[t-ti? metrics to judge whether documents maintain the [ i]s-tar-ni=sum-m[i same subject throughout (Nicholson, 2009). [ ]x nu=kn ki-x[ Very little computational work on cuneiform lan- [ ] KUR URUMi-iz-ri=y[a guages or texts exists. The most notable example [is-tar-ni]=sum-mi e-es-du [ is a study that examined grapheme distribution as a way to understand Hurrian substratal interference [ ] nu=kn A-NA KUR URUMi-iz-ri[ in the orthography of Akkadian-language cuneiform [A-NA EGI]R UDmi is-tar-ni=su[m-mi texts written in the Hurrian-speaking town of Nuzi This fragment, KUB XXI25, is very small and (Smith, 2007). Smith’s work, though using different broken on both sides. The areas between brackets classifying methods and and an enormously differ- are sections of the text broken off or effaced by ero- ent corpus on a language with different characteris- sion of tablet surface material. Any text present be- tics, is the most similar to this study, since both are tween brackets has been inferred from context and attempts to classify cuneiform fragments into cat- transcriber experience with usual phrasing in Hittite. egories - in Smith’s case, into Hurrian-influenced In the last line, the sign EGIR, a Sumerian ideogram, Nuzi Akkadian and non-Nuzi standard Akkadian. which is split by a bracket, was partially effaced but still recognizable to the transcriber, and so is split by 4 The Project Corpus a bracket. For this project, I use a corpus of neo-Hittite 5 Methods fragment transcriptions available from H. Craig Melchert (Melchert, ). The corpus is one large text For this project, I used both Naive Bayes and Max- file, divided into CTH numbered sections, which imum Entropy classifiers as implemented by the themselves are divided into fragments labeled by MAchine Learning for LanguagE Toolkit, MAL- their publication numbers - mostly KUB, which LET(McCallum, 2002). stands for Keilschrifturkunden aus Boghazkoi¨ or Two copies of the corpus were prepared. In KBo, Keilschrifttexte aus Boghazkoi¨ , the two major one, anything in brackets or partially remaining after publications for Hittite text fragments. brackets was removed, leaving only characters actu- I restricted the fragments used in this project to ally preserved on the fragment. This copy is called fragments belonging to texts known to exist in at Plain Cuneiform in the results section. The other least two copies, a choice that produces a larger has all bracket characters removed, leaving all actual number of fragments per text without requiring a characters and all characters suggested by the tran- judgment about what number of fragments in a text scribers. This corpus is called Brackets Removed in constitutes “fragmented enough” for a legitimate test the results section. By removing the brackets but of this task. This leaves 36 total CTH-numbered leaving the suggested characters, I hoped to use the texts, consisting of 389 total fragments. transcribers’ intuitions about Hittite texts to further The fragments themselves are included as plain improve the performance of both classifiers. text, with restorations by the transcribers left intact The corpora were tokenized in two ways: and set off by brackets, in the manner typical of 1. The tokens were defined only by spaces, cap- cuneiform transcription. In transcription, signs with turing all words in the corpus. phonemic value are written in lower case characters, while ideograms are represented in all caps. Sign 2. The tokens were defined as a series of capital boundaries are represented by a hyphen, indicating letters and punctuation marks, capturing only the next sign is part of the current word, by an equals the Sumerian and Akkadian ideograms in the sign, indicating the next sign is a clitic, or a space, text, i.e. the very common Sumerian ideogram indicating that the next sign is part of a new word. DINGER.MESˇ, ‘the gods’. {KUB XXXI 25; DS 29} The training and tests were all performed using x MALLET’s standard algorithms, cross-validated, 245 perhaps unsurprisingly, that the accuracy of classi- Table 1: Results for Plain Corpus fication using Sumerograms and Akkadograms is Tokenization Naive Bayes Max Ent heavily dependent on the structure of the fragments All Tokens .55 .61 in question. Ideograms Only .44 .51 Maximum Entropy classification proved to be slightly better, in every instance, than Naive Bayes classification, a fact that will prove useful in future Table 2: Results for Tests on Corpus with Brackets Re- tests and applications.

Toward Automatically Assembling Hittite-Language Cuneiform Tablet Fragments Into Larger Texts

Bibliography

Hanigalbat and the Land Hani

Bryn Mawr Classical Review 2017.08.38

Central Anatolian Languages and Language Communities in the Colony Period : a Luwian-Hattian Symbiosis and the Independent Hittites*

150506-Woudhuizen Bw.Ps, Page 1-168 @ Normalize ( Microsoft

People on Both Sides of the Aegean Sea. Did the Achaeans And

A STUDY of WRITING Oi.Uchicago.Edu Oi.Uchicago.Edu /MAAM^MA

Following a False Trail — the Search for the Hittites

Premodern East Asia and the Power of Character Scripts Wiebke Denecke

Numerical Notation: a Comparative History

Sociolinguistics of the Luvian Language, Leiden, Brill, Brill’S Studies in Indo-European Languages & Linguistics 2, 454 Pages

University College London Department of Mtcient History a Thesis