<<

Downloaded by guest on September 26, 2021 ayoinheritage Babylonian Achaemenid from the texts Akkadian modeling ancient . by in period breaks scholars, the com- assisting automatically pleting and networks of neural inves- recurrent using We possibility language experts. these by the reconstructed Currently, of manually tigate information. Many is missing text tablets. to missing the leading damaged, clay are are tablets culture and Mesopotamian regarding information of sources main The 2020) 27, February review for Pag Emilie by Edited and Israel; 7610001, Rehovot Science, a Fetaya Ethan networks neural using recurrent texts Babylonian fragmentary of Restoration yshlr r fe ujcie utemr,teei oway no www.pnas.org/cgi/doi/10.1073/pnas.2003794117 is there proposed restorations Furthermore, the subjective. and often of texts, are knowledge of scholars and expert corpus by to lexical, requires and scientific, possible process genre literary, is this each in Still, it texts. tablets, par- of religious intact comparing lines and Akka- by many damaged and the restore on copies, Fortunately, both duplicate passages system. mastered in allel writing survive have out cuneiform texts who carried the many experts and process language of time-consuming entire dian handful a to a line is a by This in manually. sign tion single a from ranging the text 1). recover (Fig. of of fully sections patches loss results to This large a text. difficult or inscribed in it the small in renders recorded and originally clay, cracks information eroded of surfaces form or written the flaking the and excava- to in after brittle Damage tablets, stored (5). become of collections and may museum have conserved in they properly they tion elements, not once if the and deteriorate to condition, exposed fragmentary been in found quently various in (4). kept on world are the attested that Babylonia, around materials are collections other , words and monumental stone of million of on hundreds 10 inscriptions empires and tablets than clay Age inscribed more 600,000 Axial dur- all, some the prominence In Persia. its of and an retained rise used it the which but ing , system, lingua by writing a displaced alphabetic as BCE, gradually millennium served first was it the sec- During Akkadian Age, East. early Near Bronze entire the the the Late of for to franca the expansion rapidly In Amorite spread millennium. the it ond during and west Sargon, millennium and of third north a Empire the on the in attested under is BCE southern language in Akkadian scale The Semitic writ- limited the 3). documents to (2, belongs in family which Akkadian, recorded language of been of dialects various most have in across ancient East ten activity the Near human of ancient of y languages the 2,500 main Over the Akkadian. several write of is world, to phase one used earliest later including this was to languages, It analogy (1). good “spreadsheets” A modern the medium. clay Sumerian a the ini- in on and procedures cities accounting BCE daily Millennium record fourth to the used tially of end the at Mesopotamia M empire Achaemenid aut fEgneig a-lnUiest,RmtGn5902 Israel; 5290002, Ramat-Gan University, Bar-Ilan Engineering, of Faculty h urn rciei orcntuttemsiginforma- missing the reconstruct to is practice current The fre- are medium, durable rather a although tablets, Clay ytm nw.I a rbbyivne nsouthern in writing invented earliest probably the was of It one known. is systems cuneiform esopotamian a,1,2 -ern nvriyo aiona o nee,C,adacpe yEioilBadMme laM emn uy7 00(received 2020 7, July Redmond M. Elsa Member Board Editorial by accepted and CA, Angeles, Los California, of University e-Perron, ´ | | erlnetworks neural uefr script cuneiform oaa Lifshitz Yonatan , c aut fSca cecsadHmnte,DgtlHmnte re a,AilUiest,Ail470 Israel 40700, Ariel University, Ariel Lab, Ariel Humanities Digital Humanities, and Sciences Social of Faculty | aeBblna dialect Babylonian Late b ldAaron Elad , c n hiGordin Shai and , | b lh rga,Dvdo nttt fSineEuain ezanIsiueof Institute Weizmann Education, Science of Institute Davidson Program, Alpha doi:10.1073/pnas.2003794117/-/DCSupplemental at online information supporting contains article This 2 1 the under Published available data limited the is tool text-completion automated Corpora. an Akkadian Digital at Choosing available is code source Our ( community. GitHub scholarly wide a to is ulse etme ,2020. 1, September published First Editorial the by invited editor guest a is E.P. Submission. Direct PNAS Board. a is article This interest.y competing paper. no y the declare per- E.A., wrote authors Y.L., Sh.G. Sh.G. The E.F., and tools; and E.F. and reagents/analytic E.A., data; new Y.L., analyzed contributed E.F., Sh.G. and Sh.G. research; and designed E.F. research; Sh.G. formed and E.F. contributions: Author here. demonstrate Atrahasis will called tool, we online as ( an developing well, are we work texts— Furthermore, Babylonian should we Late models but administrative these texts, and economic, of syntax—such legal, structured types highly as with all genres in for avail- that restorations hypothesized texts methods plausible data-driven cuneiform such yield whether digitized uncertain would of is it number present, at limited able the (6). (RNNs) to learn- networks machine Due neural modern recurrent specifically on Late methods, relies ing in that passages texts an archival broken investigate Babylonian completing we article, automatically this the to In in approach texts. engaged damaged experts restoring human of aid task can that process automatic an examples intact on studying passages by genre. damaged one identified the of cases, conventions of such content of In the basis form. predicting the duplicate to in well- resort exist a must for not to deeds, do belong or but that contracts example, texts as is such damaged documents problem of genre-archival The case known the restoration. in each harder in even uncertainty the quantify to owo orsodnemyb drse.Eal ta.eaabua.lor [email protected] Email: addressed. be may correspondence [email protected] whom To work.y this to equally contributed Sh.G. and E.F. https://babylonian.herokuapp.com/ rr et.Teeoe hsi rtse o large-scale a for lit- step or first scientific heritage. a ancient as lost is a such damaged this of reconstruction genres, restore Therefore, to other texts. digitized trained of erary to be amount belonging can the model texts As the BCE). grows, centuries texts fourth empire to Persian eco- the (sixth from daily documents restore administrative by and training to nomic restored for algorithms texts be writ- digitized machine-learning must texts available advanced portions uses the missing paper in the This script. experts. gaps and them, cuneiform leaving on the damaged, ten in are inscribed tablets and tablets Most economic, clay of hundreds of political, constitute thousands Mesopotamia the ancient of for history social sources documentary The Significance n osbewyt mloaeteedfclisi odesign to is difficulties these ameliorate to way possible One y c,1,2 PNAS https://github.com/DHALab/Atrahasis | NSlicense.y PNAS etme 5 2020 15, September | . y o.117 vol. n hleg ndesigning in challenge One omk u okavailable work our make to ) https://www.pnas.org/lookup/suppl/ | o 37 no. ). | 22743–22751

COMPUTER SCIENCES ANTHROPOLOGY Fig. 1. Fragmentary obverse of the house sale contract YBC 7424 (Yale Oriental Series 17 3) from the third year of Babylonian king Nebuchadnezzar II, famous for burning the city of Jerusalem and exiling the Judean elite as described in the Book of Kings. Courtesy of the Yale Babylonian Collection. Image credit: Klaus Wagensonner (Yale University, New Haven, CT).

in digital form. For similar unsupervised language modeling limited amount of digital text is available to train the learning tasks in English, for example, one can collect practically endless algorithm. amounts of texts online, and the main limitation is the compu- Three temporally and geographically defined corpora in par- tational challenge of storing and processing large quantities of ticular are well represented in the digital transliterations avail- data (7). For cuneiform texts this is not the case. Automatic able today: the Old Babylonian (approximately 1900 to 1600 optical character recognition cannot be used to reliably identify BCE; see ARCHIBAB website: http://www.archibab.fr/), Neo- cuneiform signs, neither in their two-dimensional (2D) repre- Assyrian (approximately 1000 to 600 BCE; see State Archives sentations ( copies) nor in three-dimensional (3D) scans of Assyria online website: http://oracc.org/saao/), and Neo- of actual clay tablets (8–12). Various visual recognition algo- and Late Babylonian (approximately 650 BCE to 100 CE). rithms are being applied to cuneiform, but the results are yet Our choice of texts, however, was governed by a corpus- in their infancy (13–16). Therefore, one has to rely on a lim- based approach so that we might exercise greater control over ited corpus of manually transliterated texts. Although Akkadian the diversity of text genres and phrasing. Therefore, we have cuneiform texts span more than two millennia and the genres decided to gather 1,400 Late Babylonian transliterated texts from available for study are heterogeneous, for many periods, only a Achaemenid period Babylonia (539 to 331 BCE) in Hypertext

22744 | www.pnas.org/cgi/doi/10.1073/pnas.2003794117 Fetaya et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 eaae al. et Fetaya u ups.Framr ealdbekonit ettypes text into breakdown see S6. Dataset content, for detailed the and ideal more structure them their a makes and For which and purpose. model, read easy to our to relatively for algorithms are humans 2 that learning patterns for Fig. for many tedious usu- display (see they are structured, hypotaxis but texts complete, highly over These parataxis are example). prefer They an and the on. short, bureau- promissory so despite official ally receipts, and texts, are proceedings, contracts, tablets these legal these notes, on e.g., that we documents, well is reason data, cratic work of main to amount The small models (19–21). juridi- our economic, genres expect mil- to administrative belonging first and documents whole cal, archival the are over script, period this of Babylonia cuneiform describe continuity BCE. of the to the lennium dialect prefer of of recognition Akkadian We use in (17). the the Neo-Babylonian, CE of as in century Akka- corpus empire end first Neo-Babylonian of the the the dialect around until of Babylonian rise BCE the Late 627 from the attested termed dian, commonly is Corpus. Neo-Babylonian The website Achemenet (http://www.achemenet.com/). the from format (HTML) Language Markup 11. 7 Series 2. Fig. † * e-adLt ayoindaet osntlnusial xs 1)(e also (see (18) exist linguistically not does Methods). dialects and Babylonian Late and Neo- atre France). Nanterre, acse 7My22) ti diitrdb h itiee Arch et Histoire the by texts administered Empire. 2,709 is include Achaemenid it to the Cun grown 2020); has of May section art text 27 Babylonian and (accessed the study, texts, our culture, began we material Since history, the to dedicated ssonarayb tekadrcnl yHcl nata hr itnto between distinction sharp actual an Hackl, by recently and Streck by already shown As ntae yPer rato h Coll the of Briant Pierre by Initiated iom emo rni Joann Francis of team eiforme ´ From Left † to h ags ubro et nw rmthis from known texts of number largest The h rgnlcniomln r,tetasieain n h rnlto fteAheei eidBblna etYl Oriental Yale text Babylonian period Achaemenid the of translation the and transliteration, the art, line cuneiform original the Right, s(Unit es ` h opscoe switni what in written is chosen corpus The g eFac n20,ti est sentirely is website this 2000, in France de ege ` ∗ it eRceceASA 01 CNRS, 7041, ArScAn Recherche de Mixte e ´ aeil n Methods and Materials ooi el’Orient de eologie ´ Materials and oe.Tenga oe samdlta probability a assigns that model a is model n-gram The model. been have (22). (LSTM), gating proposed memory a short-term introduce long that as before. of such modifications steps mechanism, various many probability issue, seen this the information solve long on which To depends modeling in token vanish- in next situations to the difficulties i.e., due have dependencies, effective, RNNs produce time and simple to simple gradients log-likelihood While is ing model model. training parametric output the This the maximizing token. by next possible trained each for and bilities mappings, linear indicates where More step. next the to memory updated the formally: passes and token memory input hidden previous the Given that h memory. tokens hidden the a using of ted basis the pre- on to text sentence a full in it. a precede token modeling next of the problem dicting the reduce factorization can the model we use i.e., parametric model, a autoregressive p an find use to to is wish step we distribution the i.e., learns that sequences; discrete of such series tech- a for as to language detailed introduction view more can brief tokens We a 6). very (for ref. RNNs a see using account, give language modeling will for we niques section, this In Background Algorithmic t (x −2 sabsln oe o oprsnw rie nn-gram an trained we comparison for model baseline a As model autoregressive this RNNs, In 1 , h ewr rtudtstemmr ae ntenew the on based memory the updates first network the , . . . x t x x −1 t 1 , −1 , x n hnue h pae eoyt rdc h next the predict to memory updated the uses then and . . . T h saoehtrpeetto fteipttoken, input the of representation one-hot a is = ) t −1 , PNAS x prob Q T = n u oli ofi rbblsi model probabilistic a fit to is goal our and , t T tanh =1 | t = etme 5 2020 15, September p softmax (x (W t |x p hh 1 (x , h 1 . . . t −2 , (W . . . prob , + x ho , t −1 W x h t T t | ih htti en sthat is means this What ). −1 ) stevco fproba- of vector the is p x o.117 vol. rmsmls h first The samples. from (x t + −1 t | b x o + 1 ), , b . . . | h ), o 37 no. , x t −1 ) | sfit- is 22745 [2] [1] W

COMPUTER SCIENCES ANTHROPOLOGY Table 1. Loss and perplexity while training the model on pletions using our model and beam search and show the results Achemenet dataset in Table 3. Training loss Training perplexity Test loss Test perplexity It is clear from the results in Tables 2 and 3 that our algo- rithm can be of great help in completing a missing token, with an 2-Gram 3.26 9.55 3.54 11.60 almost 85% chance of completing the token correctly and a 94% LSTM 1.41 4.10 1.62 5.05 chance of including the correct token in the top 10 suggestions. However, as expected, the task becomes much harder and per- formance is degraded when more tokens are missing. We note to each token based on how frequently the sequence of the last that even with two or three missing tokens, however, the model n − 1 tokens in the training set ended in that token. The main is still useful as the correct completion is present in the top 1 limitation of n-gram models is that for small n, the context used (two missing) or 10 (three missing) completions almost half of for prediction is very small, while for large n, most test sequences the time. of that length are never seen in the training set. We used a 2-gram model, i.e., each word is predicted according to the frequency Designed Completion Test. We designed another experiment in with which it appeared after the previous one. order to evaluate our completion algorithm and understand its strengths and weaknesses. We generated a set of 52 multiple Results choice questions in which the model is presented with a sentence In order to generate our datasets, we collected transliterated missing one word and four possible completions, and the goal texts from the Achemenet website, based on data prepared was to select the correct one. Of the three wrong answers, the by F. Joann`es and coworkers in the framework of the first was designed to be wrong semantically, the second wrong Achemenet Program (National Center for Scientific Research syntactically, and the third both. This allowed us to track the [CNRS], Nanterre, France) (http://www.achemenet.com/fr/tree/ types of mistakes the algorithm makes. The assumption is that ?/sources-textuelles/textes-par-langues-et-ecritures/babylonien). the learning algorithm would be more likely than a human to We designed a tokenization method for Akkadian translitera- make semantic mistakes but should be better than a nonexpert tions, as detailed in Materials and Methods. We trained a LSTM in . If this is the case, then the effectiveness of our recurrent network and a n-gram baseline model on this dataset approach as a way to assist humans should rise, as the strengths (see Datasets S1–S3 for model and training details). of human and machine each other. Results for both models are in Table 1. Loss refers to mean When we used our model to rank four possible restorations for negative log-likelihood and perplexity is two to the power of the each of the missing words in the 52 random sentences, it achieved entropy (in both cases, lower is better). 88.5% accuracy in selecting the one with the highest likelihood As expected, the RNN greatly outperforms the n-gram base- (see Dataset S4 for the complete list of questions and answers). line, and despite the limitations of the dataset, it does not suffer Looking at the six failed completions—questions 18, 26, 32, 35, from severe over-fitting. 45, and 50—we see that four are semantically incorrect, one is syntactically incorrect, and one is both, which agrees with our Completing Random Missing Tokens. In order to evaluate our mod- hypothesis. els’ ability to complete missing tokens, we took random sentences from the test corpus, removed the middle token and tried to Discussion predict it using the rest of the sentence. Our model returns a Further study of the different restorations of the designed com- ranking of probable tokens and we report the mean recipro- pletion test, taking into account the full ranking of the answers, cal rank (MRR). The MRR is the average over the dataset of results in some interesting patterns. This qualitative analysis the reciprocal of the predicted rank of the correct token. It is a considers four categories for the answer ranking: 1) correct syn- very common and useful measure for information retrieval as it tax (i.e., sentence structure), 2) correct semantic identification, is highly biased toward the top ranks, which is what the user is 3) poor syntax, and 4) poor semantic identification (see mostly interested in. We also evaluate the “hit@k,” which mea- Dataset S5 for the full data analysis). sures the percentage of sentences where the correct completion The majority of the restorations, 44 cases, shows that the is in the top k suggestions. For evaluation, we used all test sen- algorithm best identifies correct sentence structure (category 1: tences 10 or more tokens in length that contain no breaks, which questions 1 to 23, 25 to 27, 29 to 35, 38 to 42, 46, and 48 to 52). Put yielded a total of 520 sentences. more accurately, this means correct syntactic sequences of parts We compared two variations of our model, one that finds of speech based on the statistical frequency of smaller syntag- the optimal completion based only on the tokens that precede matic structures. A total of 34 restorations show correct semantic the missing token, denoted “LSTM (start),” and one that takes identification of noun class as well as related verbs in the answer the full sentence into account, denoted “LSTM (full).” As the ranking (category 2: questions 1, 3, 6 to 13, 16 to 18, 22, 24 to 27, “LSTM (full)” model needs to run separately for each candidate 29, 31 to 33, 35 to 40, 42, 46 to 49, and 51). In fact, a large subset for the missing token, we first picked the top 100 candidates using of these cases, 30 questions, shares also identification of correct “LSTM (start).” We then generated 100 sentences, one for each sentence structure (i.e., both categories 1 and 2). possible completion, and reranked them based on the full sen- Good semantic identification probably derives from paradig- tence log-likelihood. If the right completion was not in the top matic relationships between certain classes of words. For exam- 100, we took the reciprocal rank to be zero. ple, five cases possibly correctly identified usage of verbal forms For comparison, we used two simple 2-gram baselines: one based on their context (e.g., in direct speech; questions 3, 6, that takes into account only the previous token, denoted “2- Gram (start),” and one that takes into account both the previous and the next token denoted “2-Gram (full).” While this is a Table 2. Completing missing fifth token in sentences relatively weak model, we found it to work surprisingly well, 5th index MRR Hit@1 Hit@5 Hit@10 although it was still significantly inferior to the LSTM model in 2-Gram (start) 0.64 52.0% 78.2% 83.6% the accuracy (Hit@1) metric. LSTM (start) 0.754 66.1% 86.9% 91.9% To further investigate our model’s ability to complete various 2-Gram (full) 0.80 74.8% 85.5% 90.6% numbers of missing tokens in various locations, we removed up LSTM (full) 0.89 85.4% 93.2% 94.6% to three tokens in random locations. We ranked possible com-

22746 | www.pnas.org/cgi/doi/10.1073/pnas.2003794117 Fetaya et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 ˇ ae nfeunyi pcfi otxs oevr ueysta- purely a Moreover, contexts. prepo- specific related in the frequency between on 42, based 7, 34, question 7, in (questions sitions choice pronouns The or 51). use, identification and correct particle so-called show prepositions, (the cases clause of Four legal 51). tem- formulaic (question question 46), contextual elat-clause; a and 41), even and 39, and and 36, 38, 31), 29, 12, 11, (questions 17, (questions designations 9, professions poral 1, of (questions names correct nouns 33), possible show countable questions of 14 class; identifications noun to regard with ent verb the of prop- (taaddinu person differentiate not grammatical does it the where made erly 32, mistakes question the in of model one the in by clearly more to emerges similarity inference its tical of recognition of model’s frequency the and statistical success the model’s the reflects however; probably form, of root verbal forms a of different understanding also two but verb between structure relationship the sentence the of of identification recognition correct a shows LOCATION s eaae al. et Fetaya square. red a by marked obverse of 3. Fig. a Hit@10 NAME 3: follows: as question answers example, for Take, Hit@5 26). and 25, 18, Hit@1 MRR Two One tokens tokens of Missing number various Completing 3. Table Three a ´ h ee ftemdlssmni nweg eoe appar- becomes knowledge semantic model’s the of level The NAME ieatadtasieaino camndpro ayointx aeOina eis75 rmteEnaaciei rk rgetr pe half upper Fragmentary . in archive Eanna the from 51 7 Series Oriental Yale text Babylonian period Achaemenid of transliteration and art Line ina qab vs. and ana inamdin). t pa. tde o eesrl eetan reflect necessarily not does It speak.” “to u, ˆ ae tcerta hs hie r again are choices these that clear it makes ana, AEl NAME h oe akdtefu possible four the ranked model The umma. iqbi; .54.%6.%70.0% 94.4% 62.2% 92.5% 47.8% 81.5% 0.55 0.86 .02.%3.%42.6% 37.2% 24.3% 0.30 liqbu u ´ q ´ ıipi a;b bar; u; ´ ´ ebabbarra ´ n h xml o only not example The an. nad u t ie pay” give, “to , ¯ iqbi liqbu AEl NAME nti context this in hsstatis- This u. ´ sanga u ´ ˇ s u ´ in ae ntesaitclfeunyo otxulsemantic contextual identifica- of false frequency of statistical number the the on reduce based greatly tions to morphol- order and In rules ogy). grammatical inference underlying statistical finding context-based than of (rather semantic basis it making the in However, on expected as structures. identifications we than experiment—is, sentence better this surprisingly out also by teasing was judged in be good can expected, as far model—as Our Conclusion 45). question example, (for “educated” mistake an makes a algorithm in verb. the resulting a so similar guess or and few on, very noun train are a to there either sentences when sentence, cases are the problematic in Especially by close and answer word the between another similarity can a confusion detects Further model 40). the “descen- when question occur or in “son” answers top meaning the have both are dumu, stud- dant,” and different a the when (e.g., identify in usage does similar interchangeability it since such corpus, Akk. of ied = examples da enough Sum. lacks Akk. e.g., = words, im.dub same Sum. or the of phonetic preferring restoration—by over correct igi the contex- of with interfere identification can for tual it method example, reliable For consistently test, completions. a choosing completion not designed is the inference in statistical that restorations up other comparing to ended by group clear, restoration is this it one However, 45). only (question erroneous 37)—with being (question sur- LOCATION achieves over after inference kurkur statistical preferring Such results—e.g., 47). good and 28, prisingly 45, 24, to (questions 43 syntax in 37, the poor to factor in 3: 36 identified decisive category best fitting a are those be which as restoration, to analysis of seems cases speech eight least of at parts of grasp tistical h oe osntse oietf lent oorpi and logographic alternate identify to seem not does model The ina ˇ su II PNAS eoeNM qeto 35). (question NAME before | etme 5 2020 15, September tuppi qetos1 n 9.I obviously It 19). and 14 (questions | o.117 vol. | o 37 no. | 22747 itti, ina

COMPUTER SCIENCES ANTHROPOLOGY relationships, much more training material will be needed. are now widely used in NLP tools, including those designed for Nevertheless, we have demonstrated that even without access to modern Indian languages (40). A bigram model was successfully large amounts of data, we can successfully train LSTM models applied to a hidden Markov model to restore missing or damaged and use them to complete missing words. In our completion test, sign sequences in the ancient undeciphered Indus script (41, 42). we show good results that, while not sufficient for fully automatic However, neural network LMs can perform better in developing completion, prove that the model can be an invaluable tool in meaningful patterns of representations of words and the contexts helping scholars with text restoration. around them. When these “embeddings” are learned from unsu- Our results with the Late Babylonian corpus are significant pervised large corpora, they can be transferred to various tasks, because most entry-level scholars or other interested historians retaining a boost in performance (43). and social scientists who focus on the large first-millennium BCE Specifically, RNN language models, like the one employed in Babylonian archives cannot acquire the very specific knowledge this study, have shown success in encoding both semantic and and expertise to understand underlying political, social, or his- orthographic data in languages of varying levels of morphological torical structures without reading through hundreds of texts. For complexity (44). The best results in solving the restoration task, this reason, we are in the process of incorporating our model so far, have been achieved in studies of machine-reading com- into an online tool, called Atrahasis. It will be of immense help prehension, specifically of the cloze-style, in which both the level to scholars in the historical sciences, allowing them to overcome of character and word are identified (45). In the field of ancient the high entry barrier to restoring fragmentary Akkadian texts. languages, we know of only one other study that used an algo- Initially, the model will achieve the most success with struc- rithm with neural network architecture to recover missing letters, tured archival documents, but as the dataset grows, one can train in the context of epigraphic inscriptions in (46). the model on more genres, such as scientific or literary texts. Their model, named PYTHIA, uses a sequence-to-sequence NN Both access to the primary sources in their original state and with LSTM and was trained on the Packard Humanities Insti- the ability to restore broken passages are equally necessary for tute (PHI) database, the largest digital dataset of ancient Greek understanding Akkadian corpora on a macroscale. inscriptions. It gives best predictions with 30.1% character error Related Work rate, compared with the 57.3% error rate of human epigraphists (based on testing the performance of two doctoral students on The task of text modeling and its applications in tasks such as the training material over 2 h). text restoration, lies in the intersection of Natural Language Pro- Lastly, joining fragments of texts is one of the major chal- cessing (NLP) and Computer . The Computational lenges of restoring cuneiform manuscripts as close as possible to Linguistics models are predominantly rule based, while cur- their original state. Matching fragments are usually only iden- rent NLP models are predominantly statistical and use machine tified by a handful of experts, and the fragments are often so learning. Currently, machine-learning methods achieve state of small as to retain only a few signs. An initial study on the Hit- the art performance on most NLP challenges. In most modern tite corpus employed matching classifiers, achieving best results languages, basic text modeling tools can identify spelling errors with Maximum Entropy Classifiers (47). The Electronic Babylo- such as missing, added, transposed, or wrong letters (23–26). nian Literature project aims to reconstruct the tens of thousands These tasks can become more challenging when the language in of fragments that make up the remnants of ancient Babylonian question has a richer , like Arabic, for example (27), and Assyrian literature. A digital corpus of largely inaccessi- or limited digital corpora (28), as in the case of ancient Near ble tablet fragments from museum collections (15,510 fragments Eastern languages (29, 30). as of June 2020) allows users to query these fragments with One approach is to first parse the original text and then use this sequence-alignment algorithms based on the word method Basic as an input for further tasks. To this end, many studies use rule- Local Alignment Search Tool algorithm. Initial results already based models derived from grammar (31), finite-state machines (32), or lookup in a machine readable dictionary (33) or employ show that one can identify new pieces of text as well as many statistical models such as clustering algorithms (k-nearest neigh- possible text joins. Advances in cost-effective high quality 3D bors or kMeans) (34). Most work on rule-based models has scanning allow exact measurements of inscribed objects that can been done in Akkadian (see literature cited in ref. 35). The lead to the joining of broken tablet fragments with a match- most recent study dedicated to Akkadian word segmentation ing algorithm in 3D space, as done for example by the Virtual used a combination of rule-based, dictionary-based, and statisti- Cuneiform Tablet Reconstruction Project (48, 49). cal algorithms, with best results in dictionary-based models (60 to 80%) (35). Because its algorithms were fitted for East Asian lan- Materials and Methods guages like Chinese and Japanese, we concluded that a specific Neo-Babylonian Archives. Babylonian archives from the end of the sixth to the NLP model for Akkadian should be designed. Sumerian, with fourth century BCE are one of the main sources for reconstructing the official its simpler syntax, is in the center of the Machine Translation and ephemeral heritage of the and its peoples and Automated Analysis of Cuneiform Languages project, which in Mesopotamia. These “archives” were not recovered in situ but are artifi- employed dictionary- and rule-based models for annotation of cial constructs imposed by modern scholars. Most Neo-Babylonian texts come from uncontrolled or poorly documented excavations, and the majority are Sumerian (36, 37). A similar study on Hittite designed rule- kept in large museum collections (see below). One cannot rely on physical based models derived from grammar (38). Statistical/machine- proximity between texts in a given find context to define an archive, since learning models achieve better results overall, if they are tailored such a context is frequently unavailable or was disturbed in antiquity. to the problem at hand. Our project, the Babylonian Engine, The organization of Neo-Babylonian archives by modern scholars is based recently achieved state-of-the-art results for Akkadian prediction mostly on an artificial division between private and institutional ownership of sign and word transliteration and segmentation using NLP (21). Further criteria employed to define an archive include prosopogra- and machine-learning models on Unicode cuneiform: up to 97% phy (i.e., grouping tablets that feature a common core of principal actors using a BiLSTM Neural Network algorithm, see the web-tool engaging in connected activities), document type and content or a common Akkademia (https://babylonian.herokuapp.com/). A similar task setting in a social or political institution, such as a business firm, temple, or palace. Several studies try to mitigate the lack of archaeological context of automatic phonological transcription of Akkadian (usually by employing museum-based archaeology to trace the acquisition history termed normalization) based on ORACC material has recently of related texts within a single collection or across different museums (50). achieved promising results (39). The end result, nevertheless, is that for the most part, groupings of tablets For the specific task of text restoration and prediction, statisti- according to any of the aforementioned criteria are artificial constructs, cal n-gram Language Models (LMs, e.g., bigrams, trigrams, etc.) with few exceptions.

22748 | www.pnas.org/cgi/doi/10.1073/pnas.2003794117 Fetaya et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 eaae al. et Fetaya Egibi room the or of archives building business a the to antiquity: N back in and traced deposited be were they each where can Babylonia Achaemenid in rs,tbesfo fca xaain nBblnadUu,frexample, for Uruk, con- and In ones. in broken excavations curators over official as tablets from selection, complete tablets active of trast, nearly excava- process clandestine or a or complete through illicit preferred acquired from came were late Many and the century. States tions during 20th United exploration of early the period and and initial 19th Europe the in in discovery collections their museum tablets following into archives best-preserved The way some archives). their (e.g., “dead” found or antiquity discarded in as on practiced regarded or were times processes recent selection in preservation archival and the excavation of method their on either (At Uruk and large every (B almost Sippar from of Babylonia groups on fraction in archival texts negligible city of Babylonian a representative period form Achaemenid are the they Achemenet Altogether, early case, dataset. the any to total content in our and type texts; in similar period are Achaemenid they because pri- dataset period our and Neo-Babylonian Neo-Babylonian in the institutional of texts the some larger included in nevertheless the We both especially archives.) vate attested period, are Achaemenid many here [early] fact, (In and mentioned BCE). the 539 archives to to the (556 date these of king, of Babylonian last Most lies the framework. of technically chronological reign which chosen Empire, our Neo-Babylonian of the outside of period the to belongs of Ebabbar bulk the the Egibi/N up and private Mura make Uruk the archives alongside from dataset, asso- institutional Achemenet archive archives two the multifile These Eanna Sippar. two the from the temples: the archive are city far illuminate by with and period largest ciated y the this However, 200 from (53). exile groups the nearly Babylonian textual in of in community “archive,” Judean period communities the Yahudu of a the minority fate during as foreign Empire known on centers Achaemenid information rural significant several in provide written texts (designated of archive 50). Kasr the ref. as B N6; known governor Kasr Babylon, Persian of the complex of palace archive the contemporary from closely the as well as § ‡ ipr r o nlddi hslist. this Mura in like not included above, are not mentioned archives are already Ur Nippur, Archives the collection. website, that Achemenet in the represented on yet description the in mentioned being oedtie icsino ahtx ru sfudi e.20. ref. in found is group text each of span. time discussion their detailed and more Babylonia A period Achaemenid from archives cuneiform of overview eintoso rhvsaelse nprnhssfloigec iynm.Despite name. city each following parentheses in listed are archives of Designations arhs nfc,amxdpiaeadisiuinlbcgon.Serf 1fran for 51 ref. See background. institutional and private mixed a fact, in has, Kasr otntl,tetrelrettx rusrcgie spiaearchives private as recognized groups text largest three the Fortunately, h edfrtx etrtosvre rmaciet rhv,depending archive, to archive from varies restorations text for need The Egibi/N the of Part umaterial. su ˇ ur-S ¯ el-r ¯ nfmle rmBblnado h Mura the of and Babylon from families ˆ ın mni Ea-epp emanni, ¯ u). ˆ § ‡ ayo (Ea-epp Babylon : h Mura The ettype Text rmsoynt o sesdfil et57 court/deposition in rent Statement field assessed for note Promissory Purchase Correspondence property immovable of Lease property movable of Lease land/orchard arable of Lease partnership Business transfer of Inventory/list/record archival into algorithm (see our categories train 17 to top used types; dataset text Achemenet administrative of and Breakdown 4. Table rgetr:lglcontract legal Fragmentary: note Promissory order Letter Summon/oath/injunction Receipt araeareet n or et 7 texts dowry and agreements Marriage aacdaccounts Balanced okcontract Work ur-S ¯ naciei h ceee est 8 texts) (80 website Achemenet the in archive ˆ ın e ¯ utxsepcal,aogwt nte cluster another with along especially, texts su ˇ s-il ˇ ,I A, ¯ ı e ¯ s ˇ s-il ˇ sar-tar ˇ ,Ghl Napp Gahal, ¯ ı, b,-r ¯ ıbi, u“r”fo Nippur, from “firm” su ˇ ur-S ¯ h) Ki ahu), ¯ mni R emanni, nacieadthe and archive ˆ ın ¯ (Epp s ˇ uniyAkda ewr()Reference keyword(s) Akkadian Quantity e’i-sis ¯ el ufrom su ˇ ¯ e ¯ 266 464 236 sunu s-il ˇ ˇ 45 75 14 19 16 19 33 35 40 18 7 7 e), ˆ ¯ ı), rhvladlglfrua,ms ettpsdsrbdi h al are table reference the primary in and described keywords types Akkadian relevant materials. text by Neverthe- most Neo-Babylonian accompanied therein). formulae, of also consistency references legal and extensive structure and witnesses, the the archival parties, exemplify and to of order 20 in study ref. less, prosopographical per- (see issuing the scribes the and and keywords, and institution, or document each clauses usu- or of legal are son This section specific which operative training. consideration typologies, main our into text the of take individual of results analysis on the on elaborate effected based to ally this place if con- the seen and other not be form is and to in documents, remains standardized It transfer less tent. considerably lists, the is hand, corpus. inventory other material of the the administrative On number of formulae. high documents, structured elements archival highly relatively legal thematic contain inclu- different which different of rather of percentage but the most higher granular, a reflect is be there to to Overall, order meant the in not in is sive, recorded genres juridical, content economic, administrative their of and subcategories of into summaries division The on database. Achemenet based admin- and types, archival respective text their to istrative according algorithm our train a to used have 61). not (19, do lists but witness no dated practically usually abbreviated and are signatures using They scribal parties appear keywords. and hand, specific objects other and involved the formulae detailing regnal on and form day, texts, list month, Administrative scribe in in king. mostly the given reigning scribal date of the precise a lineage and of 3) and issue year and name of 60); place the the and only also 58 (seldom not but refs. (58) includes tablet; witnesses latter or the of The object(s) on one list signature. seals a the of 2) their on up protagonists; by statement made relevant accompanied a section contract, the with by operative a Neo- question beginning an or in Each usually 1) note clauses, content. parts: formal promissory main and more three a form least as their at on such has based document, types archival text Babylonian into divided Content. are and S1). Types Fig. Text restoration Appendix, Neo-Babylonian proposed (SI a study with scholarly along and 3, parallels The known Fig. on in 57). based seen is (56, to be dating that can tablets, administration Cyrus, Eanna the temple of of reign one the of the after half by upper antiquity fragmentary needed the in of longer already obverse smashed no or were discarded effects they deliberately the were from I suffered Darius or excavation war. following of handling poor like by archives aged large Some texts. fragmentary Mura of percentage higher a contain ¶ a.Mn fte a led ufrdacetfiedmg uigo fe the after or during damage World First fire the ancient by suffered triggered already events (55). had of period them Achaemenid sequence of grim a Many War. survived partially texts Kasr h Mura The n nukuribb ana al hw h ueia radw fteNoBblna texts Neo-Babylonian the of breakdown numerical the shows 4 Table aae S6 Dataset uo ar(hc a led irfidfo nacetfie eedam- were fire) ancient an from vitrified already was (which Kasr or su ˇ mah ¶ ag ubro an alt rdcdbfr h eg of reign the before produced tablets Eanna of number large A utxswr aae uigtertasototo ipr(4,adthe and (54), Nippur of out transport their during damaged were texts su ˇ n id ana ¯ uinb ιru iksuep nikkassu e.g., n id ana et n muhhi ina . harr uti, ir ¯ imittu PNAS o ullist) full for dullu ¯ ι, e / dab ¯ maddatti mahir ¯ ι, anu (immovable) err ¯ zitti abu ¯ e | ¯ e s ¯ ˇ and uti, su ˇ etme 5 2020 15, September ¯ s uti ¯ h e-ayoinacia texts archival Neo-Babylonian The | 19 19 67 66 63 62 65 64 19 69 68 20 o.117 vol. | o 37 no. | 22749

COMPUTER SCIENCES ANTHROPOLOGY Data Scraping and Transcriptions of the Neo-Babylonian Dialect. We scraped 1,400 Late Babylonian transliterated texts from Achaemenid period Babylonia (539 to 331 BCE) in HTML format from the Achemenet website (http://www.achemenet.com/fr/tree/?/sources-textuelles/textes-par- langues-et-ecritures/babylonien). As the Achemenet website does not have an Application Programming Interface, we received written permission from the project head F. Joannes` to scrape the data. We built a scraping script in Python 2.7 to scrape the texts, preprocess and tokenize them. The script uses the “Beautiful Soup” library to remove all of the unnecessary HTML tags and take only the transliterated text itself from the site. When deciding how to present an Akkadian text to the algorithm, we Fig. 4. Mechanical bound transcription of Babylonian text Yale Oriental had to make a choice between (unbiased) transliterated texts and normal- Series 7 11. ized text. A rudimentary distinction may be drawn between the two: a transliterated Akkadian text is a sign-by-sign transcription of the cuneiform text in which the signs that make up each word are separated by hyphens. sign, but we kept the HTML start italics and stop italics <\i> symbols To mark the necessary contrast between phonetic and logographic writings, so that the use of the word as a syllable or can be inferred from they are represented in italic type and roman type, respectively (Fig. 2, mid- the context. Although using two tokens to represent the two values of a dle column). A normalized text eliminates the hyphens between signs and sign has some advantages, we found that doing so adds a large amount of attempts to represent noun and verb morphology correctly; it presents a noise to the preprocessing step and decided to use this method instead. phonetic approximation of how each word was pronounced in Akkadian. Furthermore, cuneiform uses , which signify (among There are some rules that govern the normalization of Neo-Babylonian. others) proper names, such as masculine names, god names, and female However, in general, it is avoided in most recent publications unless useful names. They are written in the transliterations in superscript as “I” or “Id,” for linguistic or pedagogic purposes (20). Neo-Babylonian is the longest con- “d,” and “f,” respectively. Since, for our purpose, proper names are of secutive language phase of Akkadian, covering the entire first millennium importance only in their syntactical function, we replace the names with BCE and ending sometime after the first century CE. The genres and writ- a tag, written in capitals, that identifies the particular type of proper name: ing conventions of this phase are characterized by their departure from the such as “NAME,” “GODNAME,” or “FEMALENAME” token. Locations, iden- standardized orthography practiced throughout the second millennium BCE. tified by the superscript “uru” before a toponym, or by the Many spellings are inconsistent with the actual phonemic renderings of words superscript “ki” after a toponym, were replaced by a “LOCATION” token. # and can vary to a considerable extent, especially on account of the intensive Month names, with the superscript determinative “iti” before the noun, language contact and interference between Akkadian and Aramaic (70, 71). were replaced by MONTH, and simple numbers were replaced by “NUM.” For this reason, we have chosen not to train the algorithm in any kind In order to simplify the tokenization of damaged parts of the text, each of normalization practices for the time being. In our training corpus, we damaged part was replaced with the token “,” and words that remained on the level of (unbiased) transliteration, by creating a mechan- appeared only two times or less as “,” since we do not have enough ical (unnormalized) bound transcription: Akkadian phonetic spellings and information to determine their meaning. logographic writings are taken at face value, by simply removing connecting The number of unique words in the vocabulary that was subsequently hyphens between syllables and between logograms. compiled is 1,549, and the total number of words is 220,926. The num- ber of words that appear only once is 3,175, and 932 words appear twice. Tokenization. Tokenization is an automatic process in which the text is split For comparison, the Penn treebank dataset, a standard and relatively small into words and each one is replaced by a numeric token. This is an important English dataset comprising texts from the Wall Street Journal, contains process that requires language-specific knowledge to prevent the loss of a 10,000 unique words and a total number of 1,036,580 words. While over- great deal of semantic content. A classic example in English is tokenizing a fitting is something to be aware of, given the scale of the data, the unique word like “aren’t.” If we do not break it into two tokens, then it is consid- nature of these texts comprised of well-structured bureaucratic information ered a word on its own and loses the connection to “are” and “not.” While it makes them well suited for machine-learning modeling. The resulting Akka- might be possible for the learning algorithm to learn that “aren’t” is equiv- dian texts used to train the algorithm look like the example in Fig. 4, which alent to “are not,” bad tokenization can complicate matters considerably by shows the same text as in Fig. 2 but in our mechanical bound transcription. creating a large number of unnecessary words in our dictionary. Open source Akkadian tokenizers use the ASCII Transliteration Format Data Availability. All study data are included in the article, SI Appendix, and (ATF) format that our dataset does not support. Therefore, we created an Datasets S1–S6. Atrahasis can be accessed on the Babylonian Engine Website alternative Akkadian tokenizer. We took into consideration some of the (https://babylonian.herokuapp.com/). Our source code is available at GitHub aspects of the current form of the transliterations. We retained the distinc- (https://github.com/DHALab/Atrahasis). tion between phonetic and logographic readings (i.e., italic type or roman type): during tokenization, we used the same token for both values of the ACKNOWLEDGMENTS. This research was supported by the Ministry of Sci- ence & Technology, Israel, Grant 89540 for the project “Human-Computer Collaboration for Studying Life and Environment in Babylonian Exile” of Sh.G. and Amos Azaria, as part of Sh.G.’s Babylonian Engine initiative. We thank Eugene McGarry for language editing, Avital Romach for her assis- #Take, for example, the form of a very common word in the Nippur Achaemenid-period tance with the final proofs of this paper, Klaus Wagensonner for tablet Murasuˇ archive hatru. As shown by Stolper (54), the different spellings of this term leave photographs, and Moshe Shtekel for designing the web-tool for Atraha- the quality of the middle, dental consonant uncertain: (lu)´ ha-ad/t/ t.-ru/ri, its variants sis. We especially thank the anonymous reviewers for their detailed remarks range from (lu)´ ha-d/ t.a-ri, (lu)´ ha-dar/tar/´ t.ar´ , and (lu)´ ha-d/ t.a-ad/t/ t.-ri. and corrections.

1. N. Veldhuis, “Cuneiform: Changes and developments” in The Shape of Script. How 7. A. Radford et al., Language models are unsupervised multitask learners. OpenAI Blog and Why Writing Systems Changes, S. D. Houston, Eds. (School for Advanced Research 1, 9 (2019). Press, 2012), pp. 3–24. 8. J. Cohen et al., “iClay: Digitizing cuneiform” in VAST 2004: The 5th International 2. J. Huehnergard, C. Woods, “Akkadian and eblaite” in The Cambridge Encyclopedia Symposium on Virtual Reality, Archaeology and Cultural Heritage, Y. Chrysanthou, of the World’s Ancient Languages, R. D. Woodard, Eds. (Cambridge University Press, K. Cain, N. Silberman, F. Niccolucci, Eds. (The Eurographics Association, 2004), pp. 2004), pp. 218–280. 135–143. 3. M. P. Streck, “Akkadian in general” in The Semitic Languages: An International Hand- 9. H. Mara, S. Kromker,¨ S. Jakob, B. Breuckmann, “Gigamesh and – 3D mul- book, S. Weninger, G. Kahn, M. P. Streck, J. C. E. Watson, Eds. (De Gruyter, 2011), pp tiscale integral invariant cuneiform character extraction” in VAST: International Sym- 330–339. posium on Virtual Reality, Archaeology and Intelligent Cultural Heritage, A. Artusi, 4. M. P. Streck, “Großes altorientalistik. Der umfang des keilschriftlichen textkorpus” in M. Joly, G. Lucet, D. Pitzalis, A. Ribes, Eds. (The Eurographics Assocation, 2010), pp. Mitteilungen der Deutschen Orient-Gesellschaft (Deutsche Orient Gesellschaft, 2010), 131–138. vol. 142, pp. 35–58. 10. G. Earl et al., “Reflectance transformation imaging systems for ancient documentary 5. C. Gutschow,¨ Methoden zur Restaurierung von ungebrannten und gebrannten artefacts” in Electronic Visualisation and the Arts (EVA 2011), S. Dunnand, J. P. Bowen, Keilschrifttafeln (PeWe-Verlag, 2012). K. C. Ng, Eds. (BCS: The Chartered Institute for IT, 2011), pp. 147–154. 6. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, 11. H. Hameeuw, G. Willems, New visualization techniques for cuneiform texts and 2016). sealings. Akkadica 132, 163–178 (2011).

22750 | www.pnas.org/cgi/doi/10.1073/pnas.2003794117 Fetaya et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 0 .Jursa, M. 20. usage, Structure, archives: institutional Neo-Babylonian in “Accounting Jursa, Ein M. v.Chr. 19. Jahrtausends ersten des Babylonien im Sprachsituation “Zur Hackl, J. 18. 3 .A kn .D kn ebrk noe oreNPfaeokfrTurkic survey. for A framework NLP: NLP in techniques source checking open Spell Mathur, an P. Zemberek, Gupta, N. Akın, 24. D. M. Akın, A. A. memory. short-term 23. Long Schmidhuber, J. Hochreiter, S. 22. Jursa, M. 21. in assyrian” and “Babylonian Streck, P. M. machine 17. and processing Image – annotations “Automatable Mara, H. Bogacz, B. 16. from cuneiform of transliteration “Automating Mara, H. Klingmann, M. Bogacz, B. in script” 15. cuneiform vectorized of retrieval “Character Mara, H. Gertz, N. Bogacz, B. 14. 5 .Ku,Rve nerrdtcinaderrcreto ehiusi NLP. in techniques correction error and detection error on Review Kaur, B. 25. M W. Gerfrid. G. Weichert, F. Fisseler, 3D D. using 13. artefact historical Malaysian of preservation “Digital Asyraf, M. Pauzi, M. 12. 0 .B uox .R asl,A iLudovico, Di A. Gansell, R. A. Juloux, B. V. 30. Tasks, Nirenburg, languages: S. minority 29. for NLP “Example-based Luca, De W. E. Streiter, survey. O. A NLP: 28. in techniques errors detection Spelling Altarawneh, R. methods correction 27. and detection error real-word of “Review Singh, S. Singh, S. 26. 9 .Shl,M ifebr,A rp,K Lind K. Arppe, A. Silfverberg, M. Sahala, A. 39. As¸uro T. Yesiltepe, Aktas¸, B. Z. A. 38. Chiarcos C. 37. Pag E. Khait, I. Sukhareva, M. 36. in cuneiform” k-means Akkadian for using segmentation “Word values Chiarcos, attribute C. Homburg, missing T. Predicting 35. Thanushkodi, G. K. Suguna, of N. implementation and 34. Design Tsalidis, C. Sgarbas, K. Panagiotakopoulos, C. Gakis, P. 33. Lind K. Arppe, A. Silfverberg, M. Sahala, A. 32. Singh P. S. 31. eaae al. et Fetaya n Archives and East 145–198. Near pp. Ancient 2004), Press, the (CDL in Accounting of Development in 2018), implications” and (Zaphon, Eds. Schretter, M. Lang, M. Fink, S. 209–238. pp. Esperanto, zum bis Orient j Alten des Sprachgeschichte zur Beitrag 359–395. pp. Handbook opt c.Sfwr Eng. Software Sci. Comput. languages. (1997). 1780 2010). (-Verlag, Busch, H. 4, 137–149. pp. Age 2017), Digital Demand, the on (Books in Eds. Palaeography Sahle, P. and Fischer, Codicology F. – in Gigamesh” 4 with Zeitalter 2D Digitalen and 3D in script for learning (ICDAR) 615–620. in pp. Recognition data” and Analysis sparse Document with lines parallel (ICDAR) 326–330. Recognition pp. and 2015), Analysis Society, Computer Document (IEEE on Conference International 13th 2015 165–172. pp. 2014), Association, Eurographics (The Eds. cultural Santos, of P. analysis to applied in graphics computer heritage” 3D of methods with research logical e.Cmu.Si otaeEng. Software Sci. Comput. Res. University, Multimedia dissertation, PhD (2017). Malaysia mask,” Selangor, Meri Cyberjaya, 63100 Mah Multimedia, of Persiaran study case A scanner: ER] 00,p.3528–3534. pp. 2020), 2020) [ELRA], (LREC Conference in Evaluation text” and cuneiform Akkadian of tion and Texts, Objects, Data, Archaeological on Archiving Studies Digital Case Regions: Neighboring and 2009). TALN 233–242. Conference pp. 10e 2003), Langues”, France, (Batz-sur-Mer, Petites des et Minoritaires Langues in tools” and resources Appl. (ICECA) 1076–1081. Technology Aerospace and munication in documents” text in n nweg-ae ytmapiainexamples. application system knowledge-based and syntax. and morphology Sumerian 10–16. pp. 2017), Linguistics, Anthology Computational for Linguistics (Association Computational Humanities Eds. for Sciences, Social Association Heritage, Literature, Cultural and for Linguistics Computational on in shop language” Sumerian the of analysis mated Evaluation 4067–4074. and pp. Calzolari 2016), Resources (ELRA), Language N. on 2016), Conference International (LREC Tenth the of ings clustering. Greek. modern for lexicon electronic an 3886–3894. pp. 2020) 2020), (LREC [ELRA], Evaluation Association and Resources Language in on Babylonian” ancient of model computational Techniques in ing” – (2017). 1–5 172, 06ItrainlCneec nEetia,Eetois n Optimization and Electronics, Electrical, on Conference International 2016 e-ayoinLgladAmnsrtv ouet.Tplg,Contents Typology, Documents. Administrative and Legal Neo-Babylonian set fteEooi itr fBblnai h is ilnimBCE Millennium First the in Babylonia of History Economic the of Aspects .Wnne,G an .P tek .C .Wto,Es D rye,2011), Gruyter, (De Eds. Watson, E. C. J. Streck, P. M. Kahn, G. Weninger, S. , .Cmu.Sci. Comput. J. Feunybsdselcekn n uebsdgamrcheck- grammar based rule and checking spell based “Frequency al., et Structure IEO,21) p 4435–4439. pp. 2016), (ICEEOT, noaigalwrsuc agaewt LDtechnology: LLOD with language low-resource a Annotating al., et Uai-elg 2005). (Ugarit-Verlag, .Klein, R. Heritage, Cultural and Graphics on Workshop Eurographics agaeEgneigfrLse-tde Languages Lesser-Studied for Engineering Language Bil 2018). (Brill, raigEooi re:Rcr-epn,Sadriain and Standardization, Record-keeping, Order: Economic Creating – (2007). 1–5 10, 08Scn nentoa ofrneo lcrnc,Com- Electronics, on Conference International Second 2018 rceig fteWrso Tatmn uoaiu Des Automatique “Traitement Workshop the of Proceedings 1–2 (2011). 216–224 7, tal. et 1–2 (2012). 217–221 2, -ern .Ciro,“ahn rnlto n auto- and translation “Machine Chiarcos, C. e-Perron, ´ l,Cmueie itt uefr inrecognition sign cuneiform Hittite Computerized glu, 5–5 (2014). 851–853 4, ˘ d.(uoenLnug eore Association Resources Language (European Eds. , Information nee kaice”in Akkadischen” ungeren ¨ 071t ARItrainlCneec on Conference International IAPR 14th 2017 rceig fTe1t agaeResources Language 12th The of Proceedings i.Ln.Comput. Ling. Lit. Erpa agaeRsucsAssociation Resources Language (European le,M amrsn,“xedn philo- “Extending Cammarosano, M. uller, ¨ n BbFT oad nt-tt based finite-state a Towards “BabyFST: en, ´ h eii agae:A International An Languages: Semitic The yeRsac nteAcetNa East Near Ancient the on CyberResearch n Atmtdpoooia transcrip- phonological “Automated en, ´ rceig fteJitSGU Work- SIGHUM Joint the of Proceedings IE optrScey 07,vl 1, vol. 2017), Society, Computer (IEEE 9 (2018). 290 9, rceig fte1t Conference 12th the of Proceedings IE optrScey 08,pp. 2018), Society, Computer (IEEE oiooi n Pal und Kodikologie u.Si J. Sci. Eur. Erpa agaeResources Language (European .Hdo,C ush Eds. Wunsch, C. Hudson, M. , 5–6 (2012). 155–169 27, erlComput. Neural erpahgetvom Mehrsprachigkeit 25 (2019). 32–54 15, n.J d.Res. Adv. J. Int. .Srie,Ed. Streiter, O. , .Alex B. , n.J Comput. J. Int. orpi im aographie ¨ IsPress, (Ios n.J Adv. J. Int. Proceed- 1735– 9, tal., et 0 .P tek rhgahe .Akdshi I n .Jt. I. und II. im Akkadisch B. Orthographie. Streck, P. M. 70. Roth, T. Schmidl, M. M. Jursa, 69. M. Hackl, J. 68. early and Neo-Babylonian the in forfeiture and pledge interest, “Debt, Wunsch, C. 67. Zawadzki, S. 66. MacGinnis, J. Per- 65. and Neo-Babylonian in oath assertory The Wunsch, C. Magdalene, R. F. Wells, B. 64. Holtz, S. 63. Stolper, W. in M. action 54. political Wunsch, C. and Pearce, E. Archives L. resistance: 53. of network “The Waerzeggers, C. 52. Peders O. 51. models” language neural “Character-aware Rush, M. A. Sontag, D. Jernite, Y. Kim, Y. 44. character-aware unsupervised “An Maggini, M. Melacci, S. Zugarini, A. Marra, G. 43. Yadav N. 42. Rao N. P. R. 41. approach independent language A “N-gram: Chaudhuri, B. B. Mitra, M. Majumder, P. 40. 1 .P tek Invtosi h e-ayoinlxcn in lexicon” neo-Babylonian the in “Innovations Streck, P. M. 71. Lanz, H. 62. Sp “Uruk: Gehlken, E. 61. Selection Ehrenberg, Seal E. of Study 60. A : Nippur, B.C. Century Fifth in Use “Seal Bregstein, B. L. 59. in documents” Neo-Babylonian in witnesses the “Introducing Dassow, von E. 58. Uruk, K. K. 57. Jursa, M. in Frahm, E. Berlin” in 56. not Excavated–but texts: “Kasr Stolper, W. M. 55. Jursa, M. 50. acqui- model of aspects “Computational Ch’ng, E. Gehlken, E. Woolley, S. Collins, T. 49. Collins T. 48. frag- tablet cuneiform Hittite-language assembling automatically “Toward Tyndall, S. 47. A learning: deep using text ancient “Restoring Prag, J. Sommerschield, T. Assael, Y. 46. embedding word character-augmented “Effective Zhao, H. Zhu, P. Huang, Y. Zhang, Z. 45. ece,1989). Bercker, in 221–255. archives” East pp. Near 2002), private Ancient from the in evidence Renewal The period: Achaemenid 2018). Agade, (Wydawnictwo Period Late-Babylonian the (2010). texts. administrative sian 2014). Press, (CDL Sofer David of Collection the in 89–134. pp. 2018), (Peeters, Eds. in Seire, M. BCE” Waerzeggers, C. 484 before Babylonia 1899-1917 Koldeweys in 126–136. pp. 2018), K (Springer, Eds. V. Maglogiannis, Networks, I. Iliadis, Neural L. Artificial in learning” on representation context ference and word to approach neural (2010). e9506 (2009). 13685–13690 in (2002). NLP” and IR to nin erEs:Poednso h 3 ecnr syilgqeInternationale, Assyriologique Rencontre 53e Kogan the E. of L. Proceedings East: Near Ancient (2003). 137–140 10, 5 Endberichte Uruk-Warka. in Ausgrabungen 1999). Zabern, 1993). Mura PA, the Philadelphia, Pennsylvania, in Practices 3–22. Sealing Levine, pp. A. and 1999), B. (Eisenbrauns, Levine, Eds. Schiffman, A. H. Baruch L. of Hallo, Honor W. W. in Chazan, Studies C. R. Judaic and Evidence, Biblical, Eastern, Cuneiform Near The 73–87 pp. Babylonia: 2018), (Peeters, and Eds. Seire, Xerxes M. Waerzeggers, in evidence” archaeological 2011). Press, (The University (Yale Eds. 243–283. Bechtolsheim, pp. von 2007), P. Chicago, Stolper, of W. University M. the Farber, of W. Institute Roth, Oriental T. M. Biggs, D. Robert 1985). (NINO, Babylonia in Rule Persian 1–6. pp. 2017), Society, cuneiform Computer Atrahasis (IEEE Eds. the Addison, of A. Goodman, reconstruction L. virtual the for in tablet” geometry join 70–77. and pp. 2014), sition Society, Computer (IEEE Eds. Shaw, J. pp. Kenderdine, S. 2012), Thwaites, H. Linguistics, Computational in for tablets” (Association Osborne, Eds. M. Park, Lin, C.-Y. C. Li, J. 243–247. H. Lee, 2, G. Papers-Volume G. Short Linguistics: Computational for in texts” larger into ments (Association Eds. 6368–6375. Huang, pp. R. 2019), Linguistics, Pad, Computational S. Conference for (EMNLP-IJCNLP), Joint International Processing 9th Language the Natural and on Processing International Language Natural (Springer in in Methods Eds. epigraphy” Greek Zan, on study H. case Li, S. Zhao, 27–39. D. pp. 2018), Ng, Cham, V. Publishing, Zhang, M. in Computing, comprehension” reading machine for 2741–2749. pp. 2016), Press, (AAAI Eds. .Shumn,M Wellman, M. Schuurmans, D. Intelligence, Artificial on Conference AAAI Thirtieth i ebboice Harr Neubablonischen Die e-ayoinCutProcedure Court Neo-Babylonian en, a rhvdsB des Archiv Das Cmue-sitdrcntuto fvrulfamne cuneiform fragmented virtual of reconstruction “Computer-assisted al., et ´ ttsia nlsso h nu cituign-grams. using script Indus the of analysis Statistical al., et 072r nentoa ofrneo ita ytmMliei (VSMM), Multimedia System Virtual on Conference International 23rd 2017 akvmdlo h nu script. Indus the of model markov A al., et 04ItrainlCneec nVrulSsesMliei (VSMM), Multimedia Systems Virtual on Conference International 2014 d.(ieban,21) p 647–660. pp. 2010), (Eisenbrauns, Eds. al., et Teft fteEnaacie h Gimil-Nan the archive, Eanna the of fate “The ayoinMrig gemns t-r etre BC Centuries 7th-3rd Agreements: Marriage Babylonian rhv n ilohkni ayo:DeTnaendrGaugRobert Grabung der Tontafeln Die Babylon: in Bibliotheken und Archive h etlHue nteNoBblna eid(IVCnuisBC) Centuries (VI-V Period Neo-Babylonian the in Houses Rental The etrOdr rmSpa n h diitaino h bbaain Ebabbara the of Administration the and Sippar from Orders Letter rk aeBblna elIpesoso Eanna-Tablets on Impressions Seal Babylonian Late Uruk: nrpeer n mie h Mura The Empire: and Entrepreneurs PNAS nentoa ofrneo nvra nweg n Language and Knowledge Universal on Conference International e-ayoinLtesadCnrcsfo h an Archive Eanna the from Contracts and Letters Neo-Babylonian SVSaarl (SDV tayoiceWrshfset u e an-rhv in Eanna-Archiv” dem aus Wirtschaftstexte atbabylonische ¨ | ouet fJda xlsadWs eie nBabylonia in Semites West and Exiles Judean of Documents el-r ¯ eu nentoaedsDot el‘Antiquit l de Droits des Internationale Revue rceig fte5t nulMeigo h Association the of Meeting Annual 50th the of Proceedings etme 5 2020 15, September Bnm,1995). (Bonami, emanni ¯ Sp niceDukriudVra,2005). Verlag, und Druckerei andische tayoicePrivatbriefe atbabylonische ¨ ¨ exsadBblna h uefr Evidence, Cuneiform The Babylonia: and Xerxes .Hdo,M a eMeop d.(D Press, (CDL Eds. Mieroop, De Van M. Hudson, M. , s anu-Gesch ˇ ˆ rceig fte21 ofrneo Empirical on Conference 2019 the of Proceedings rhv”(otrldsetto,Uiest of University dissertation, (Doctoral Archive” u ˆ NN,1999). (NINO, aua agaePoesn n Chinese and Processing Language Natural Bil 2009). (Brill, urkov aftsunternehmen ˚ ¨ PiipvnZbr,1990). Zabern, von (Philipp | ,Y aoools .Hammer, B. Manolopoulos, Y. a, ´ s ˇ rhv,teMura the Archive, u ˆ o.117 vol. rc al cd c.U.S.A. Sci. Acad. Natl. Proc. . eleio e Assyriologie der Reallexikon y rhv,adtheir and archive, B aya ¯ Uai-elg 2014). (Ugarit-Verlag, Shete,1976). (Schweitzer, tde rsne to Presented Studies | etadEconomic and Debt agae nthe in Languages nentoa Con- International o 37 no. LSOne PLOS s Piipvon (Philipp ˇ e ´ im and Firm, u Bto u. (Butzon ˆ 13–29 57, | Ancient 22751 106, 5,

COMPUTER SCIENCES ANTHROPOLOGY