Dkrmorph a Syriac Morphological Analyzer
Total Page:16
File Type:pdf, Size:1020Kb
dkrMorph A Syriac Morphological Analyzer Lars J. Lindgren Uppsala University Department of Informatics and Media Bachelor’s thesis in Information Systems, 15 credits 7 June 2011 Supervisor: Beáta Megyesi, Dept. of Linguistics and Philology at Uppsala University Abstract This thesis proposes a method for automatic morphological analysis of Syriac - an under-resourced language for which there are no natural lan- guage processing tools such as morphological analyzers readily available. The proposed method uses a data-driven approach with automatically generated and weighted regular expression rules and patterns to cater for morphological attribute tagging and root- and lexeme derivation for dictionary linkage. The method is compared against a baseline, which it outperforms on all tests, and significantly outperforms for unknown words. When trained on all available training data, the analyzer achieves an accuracy of 95.53%. Keywords: Automatic morphological analyzer, Natural Language Processing, Syriac, Semitic languages, Data-driven, Regular Expressions Acknowledgements I would like to express my sincerest gratitude to my supervisor Beáta Megyesi of the department of Linguistics and Philology at Uppsala University for her generous guidance, kindness, enthusiasm and patience. I would also like to thank my wonderful wife Sonia and my lovely daughters Sarah and Shamiram whose constant encouragement and love have strengthened me along the way. 3 Contents Abstract 2 Acknowledgements 3 1 Introduction 8 2 Language Resources for Syriac 9 2.1 The Syriac Language . 9 2.1.1 Syriac Morphology . 10 2.2 Syriac Resources . 11 2.2.1 SEDRA . 11 2.2.2 Annotated Peshitta New Testament . 11 2.3 Automatic Analysis of Syriac . 11 2.3.1 Finite-State Automata and .semh. e . 12 2.3.2 Syromorph . 12 3 The dkrMorph Approach 13 3.1 Data Requirements . 13 3.2 Morphological Analysis . 14 3.2.1 Dictionary Match . 15 3.2.2 Regular Expression Match . 15 3.2.3 Regular Expression with Generalized Prefix Match . 17 3.2.4 Greedy Prefix and Suffix Match . 17 3.3 Derivation of Root and Lexeme . 17 4 Experiment 20 4.1 Setup . 20 4.1.1 Baseline . 20 4.2 Results . 21 4.2.1 Baseline vs. dkrMorph . 21 4.2.2 Morphological Analysis . 23 4.2.3 Root and Lexeme Derivation . 24 4.2.4 Morphological Analysis and Root and Lexeme Deriva- tion Combined . 25 5 Discussion 27 5.1 dkrMorph vs. Baseline Results . 27 5.2 dkrMorph Results . 28 5.3 dkrMorph vs. Syromorph . 28 4 5.4 Future Work . 29 6 Conclusion 30 Appendices 30 A Transliteration Table 31 B Table of Morphological Attributes 32 C SEDRA Corrections and Modifications 33 D Ten-Fold Cross Validation Results and Experiment Results 34 Bibliography 36 5 List of Figures 3.1 A sample dictionary entry. 15 3.2 An example of a regular expression built from the pattern B#e#HuON. ............................... 16 3.3 Listing of algorithm for building root- and lexeme pattern dictio- naries. 18 D.1 Ten-Fold Cross Validation Results and Experiment Results for the SEDRA dataset. 34 D.2 Ten-Fold Cross Validation Results and Experiment Results for the PNT dataset. 35 6 List of Tables 3.1 Sample of training data fed into the system - tubayhon laylen dadken blebbhon dhennon neh. zon lalah¯ a¯ "Blessed are the pure in heart: for they shall see God." (Matthew 5:8) . 14 3.2 Morphological attribute tagset for the word BLeBHuON "in their hearts". 14 3.3 Examples of patterns generated from word-root pairs. 16 4.1 Number of words in the training- and test data per dataset. 21 4.2 Morphology analysis accuracy per dataset. 22 4.3 Root accuracy per dataset. 22 4.4 Lexeme accuracy per dataset. 23 4.5 Accuracy of correct morphological analysis and root- and lexeme derivation per dataset. 23 4.6 Distribution of known and unknown morphological analysis matches per dataset. 23 4.7 Breakdown of the distribution of morphological analysis matches per method and dataset. 24 4.8 Accuracy of morphological analysis methods per dataset. 24 4.9 Distribution of correct root matches per morphological analysis method and dataset. 25 4.10 Distribution of correct lexeme matches per morphological analysis method and dataset. 25 4.11 Accuracy of root- and lexeme derivation per dataset. 25 4.12 Distribution of correct morphological analysis matches and root- and lexeme matches per morphological analysis method and dataset. 26 4.13 Total accuracy of morphological analysis and root- and lexeme derivation per dataset. 26 A.1 Transliterations used in this work. 31 B.1 Morphological attributes and their values and codes. 32 C.1 Corrections and Modifications made to the SEDRA dataset. 33 7 1 Introduction There are still languages that despite their significance are classified as under- resourced languages in the field of computational linguistics. One such signifi- cant and under-resourced language is Syriac, the prominent dialect of Aramaic. Syriac’s significance stems from its rich and vast literary heritage. It is under- resourced since typical natural language processing tools such as morphological analyzers, stemmers and part-of-speech taggers are not readily available. There are, however, some data available in the form of an annotated version of the Peshitta New Testament 1 with an accompanying database for linguistic com- puting in Syriac, both compiled and prepared by Kiraz (1994). The lack of such language tools and data unfortunately makes the plethora of Syriac litera- ture more or less unavailable for systematic study. Therefore, developing such tools and resources should be given due attention. The objective of this thesis is to develop a method for automatic morpho- logical analysis of Syriac. The method is principally to be used by Dukhrana Biblical Research 2 in an ongoing project to make an annotated electronic ver- sion of the Peshitta Old Testament freely available online for study. The de- sired annotation for the project includes morphological attributes and deriva- tion of root and lexicon forms for possible dictionary linkage. Therefore, the method proposed in this thesis has been designed to accommodate for the de- sired annotation requirements set out by the project. The analyzer will use a data-driven approach and use the annotated Peshitta New Testament and ac- companying database as the only resource to build up from. The analyzer is designed to analyze isolated vocalized words without any prior segmentation. The goal is also, once the project is completed, to make the analyzer freely available online. The work presented in this thesis, and its fruits, should be of interest and value for scholars and laymen alike interested not only in natural language pro- cessing tools for Semitic languages, but also Syriac in general, biblical exegesis and Aramaic Christianity. The remainder of the thesis is outlined as follows: Chapter 2 gives an overview of Syriac and available resources. There is also a brief review of re- lated work. Chapter 3 presents dkrMorph, the method for automatic morpho- logical analysis proposed by this thesis. Chapter 4 describes how this method has been evaluated and presents performance results. Chapter 5 discusses the results and what needs yet to be done. Chapter 6 offers final remarks. 1The Syriac New Testament used by Aramaic speaking churches. 2Dukhrana Biblical Research is a non-profit organization primarily dedicated to the study of the Peshitta, the Bible in Syriac. Its purpose is to make the study of the Peshitta more easily accessible by providing useful tools and resources via the website http://www.dukhrana.com. 8 2 Language Resources for Syriac 2.1 The Syriac Language Syriac is an eastern dialect of Aramaic, with its home in Edessa1, in the king- dom of Osroene, where it served as both a spoken and literary language, cer- tainly long before the introduction of Christianity (Nöldeke and Euting, 1904, p. XXXII). Then, in the first century, with the spreading of Christianity in the region, Edessa became more and more the center for Aramaic Christianity, and Syriac started to attain special importance. It soon became the language of the Christians, in Syria and Mesopotamia, and the vehicle used for spreading Christianity eastwards throughout the Near East and later also into Asia. Syriac was a true spoken language up until the 8th century, when it, with the rise of Islam, started to give way for Arabic. Nevertheless, Syriac was up- held as a literary language for another four centuries. Through the centuries, many prolific authors have written in Syriac, and its, chiefly Christian religious, literary heritage is rich and vast. Today it is preserved and alive primarily as a liturgical language, although there are new texts still being composed or trans- lated into Syriac. Syriac is written from right-to-left using a twenty-two letter alphabet. At its core it is an abjad2 language, but over the centuries diacritical markings have been incorporated and used to distinguish homographs and act as vowel signs to clarify pronunciation. For the most part, Syriac texts do not use many dia- critical markings, and vowel signs are often vacant. However, Bibles and texts for liturgical use today almost always include a full set of diacritical markings to aid the reader. Syriac utilizes three different scripts, Est.rangela,¯ Madnh. ay¯ a¯3 and Sert.o, each with its own set of diacritical markings. Which script and set of diacritical markings that is used tends to depend on church denomination adherence. This work uses the Madnh. ay¯ a¯ script when showing Syriac. Two types of transliteration for Syriac are employed in this thesis. The transliteration used by Kiraz (1994) is used for internal representation by dkr- Morph, and all output used in examples retains this transliteration. The re- mainder of the thesis uses a simplified variant of the transliteration employed by Brockelmann (1906), as it is more pleasing to the eye and easier to read.