<<

dkrMorph

A Syriac Morphological Analyzer

Lars J. Lindgren

Uppsala University Department of Informatics and Media Bachelor’s thesis in Information Systems, 15 credits 7 June 2011

Supervisor: Beáta Megyesi, Dept. of Linguistics and Philology at Uppsala University Abstract

This thesis proposes a method for automatic morphological analysis of Syriac - an under-resourced language for which there are no natural lan- guage processing tools such as morphological analyzers readily available. The proposed method uses a data-driven approach with automatically generated and weighted regular expression rules and patterns to cater for morphological attribute tagging and root- and lexeme derivation for dictionary linkage. The method is compared against a baseline, which it outperforms on all tests, and significantly outperforms for unknown words. When trained on all available training data, the analyzer achieves an accuracy of 95.53%.

Keywords: Automatic morphological analyzer, Natural Language Processing, Syriac, , Data-driven, Regular Expressions Acknowledgements

I would like to express my sincerest gratitude to my supervisor Beáta Megyesi of the department of Linguistics and Philology at Uppsala University for her generous guidance, kindness, enthusiasm and patience. I would also like to thank my wonderful wife Sonia and my lovely daughters Sarah and Shamiram whose constant encouragement and love have strengthened me along the way.

3 Contents

Abstract 2

Acknowledgements 3

1 Introduction 8

2 Language Resources for Syriac 9 2.1 The Syriac Language ...... 9 2.1.1 Syriac ...... 10 2.2 Syriac Resources ...... 11 2.2.1 SEDRA ...... 11 2.2.2 Annotated New Testament ...... 11 2.3 Automatic Analysis of Syriac ...... 11 2.3.1 Finite-State Automata and .semh. e ...... 12 2.3.2 Syromorph ...... 12

3 The dkrMorph Approach 13 3.1 Data Requirements ...... 13 3.2 Morphological Analysis ...... 14 3.2.1 Dictionary Match ...... 15 3.2.2 Regular Expression Match ...... 15 3.2.3 Regular Expression with Generalized Prefix Match . . . 17 3.2.4 Greedy Prefix and Suffix Match ...... 17 3.3 Derivation of Root and Lexeme ...... 17

4 Experiment 20 4.1 Setup ...... 20 4.1.1 Baseline ...... 20 4.2 Results ...... 21 4.2.1 Baseline vs. dkrMorph ...... 21 4.2.2 Morphological Analysis ...... 23 4.2.3 Root and Lexeme Derivation ...... 24 4.2.4 Morphological Analysis and Root and Lexeme Deriva- tion Combined ...... 25

5 Discussion 27 5.1 dkrMorph vs. Baseline Results ...... 27 5.2 dkrMorph Results ...... 28 5.3 dkrMorph vs. Syromorph ...... 28

4 5.4 Future Work ...... 29

6 Conclusion 30

Appendices 30

A Transliteration Table 31

B Table of Morphological Attributes 32

C SEDRA Corrections and Modifications 33

D Ten-Fold Cross Validation Results and Experiment Results 34

Bibliography 36

5 List of Figures

3.1 A sample dictionary entry...... 15 3.2 An example of a regular expression built from the pattern B#e#HuON...... 16 3.3 Listing of algorithm for building root- and lexeme pattern dictio- naries...... 18

D.1 Ten-Fold Cross Validation Results and Experiment Results for the SEDRA dataset...... 34 D.2 Ten-Fold Cross Validation Results and Experiment Results for the PNT dataset...... 35

6 List of Tables

3.1 Sample of training data fed into the system - tubayhon laylen dadken blebbhon dhennon neh. zon lalah¯ a¯ "Blessed are the pure in heart: for they shall see God." (:8) ...... 14 3.2 Morphological attribute tagset for the word BLeBHuON "in their hearts"...... 14 3.3 Examples of patterns generated from word-root pairs...... 16

4.1 Number of words in the training- and test data per dataset...... 21 4.2 Morphology analysis accuracy per dataset...... 22 4.3 Root accuracy per dataset...... 22 4.4 Lexeme accuracy per dataset...... 23 4.5 Accuracy of correct morphological analysis and root- and lexeme derivation per dataset...... 23 4.6 Distribution of known and unknown morphological analysis matches per dataset...... 23 4.7 Breakdown of the distribution of morphological analysis matches per method and dataset...... 24 4.8 Accuracy of morphological analysis methods per dataset...... 24 4.9 Distribution of correct root matches per morphological analysis method and dataset...... 25 4.10 Distribution of correct lexeme matches per morphological analysis method and dataset...... 25 4.11 Accuracy of root- and lexeme derivation per dataset...... 25 4.12 Distribution of correct morphological analysis matches and root- and lexeme matches per morphological analysis method and dataset. 26 4.13 Total accuracy of morphological analysis and root- and lexeme derivation per dataset...... 26

A.1 Transliterations used in this work...... 31

B.1 Morphological attributes and their values and codes...... 32

C.1 Corrections and Modifications made to the SEDRA dataset. . . . . 33

7 1 Introduction

There are still languages that despite their significance are classified as under- resourced languages in the field of computational linguistics. One such signifi- cant and under-resourced language is Syriac, the prominent of . Syriac’s significance stems from its rich and vast literary heritage. It is under- resourced since typical natural language processing tools such as morphological analyzers, stemmers and part-of-speech taggers are not readily available. There are, however, some data available in the form of an annotated version of the Peshitta New Testament 1 with an accompanying database for linguistic com- puting in Syriac, both compiled and prepared by Kiraz (1994). The lack of such language tools and data unfortunately makes the plethora of Syriac litera- ture more or less unavailable for systematic study. Therefore, developing such tools and resources should be given due attention. The objective of this thesis is to develop a method for automatic morpho- logical analysis of Syriac. The method is principally to be used by Dukhrana Biblical Research 2 in an ongoing project to make an annotated electronic ver- sion of the Peshitta Old Testament freely available online for study. The de- sired annotation for the project includes morphological attributes and deriva- tion of root and lexicon forms for possible dictionary linkage. Therefore, the method proposed in this thesis has been designed to accommodate for the de- sired annotation requirements set out by the project. The analyzer will use a data-driven approach and use the annotated Peshitta New Testament and ac- companying database as the only resource to build up from. The analyzer is designed to analyze isolated vocalized words without any prior segmentation. The goal is also, once the project is completed, to make the analyzer freely available online. The work presented in this thesis, and its fruits, should be of interest and value for scholars and laymen alike interested not only in natural language pro- cessing tools for Semitic languages, but also Syriac in general, biblical exegesis and Aramaic . The remainder of the thesis is outlined as follows: Chapter 2 gives an overview of Syriac and available resources. There is also a brief review of re- lated work. Chapter 3 presents dkrMorph, the method for automatic morpho- logical analysis proposed by this thesis. Chapter 4 describes how this method has been evaluated and presents performance results. Chapter 5 discusses the results and what needs yet to be done. Chapter 6 offers final remarks.

1The Syriac New Testament used by Aramaic speaking churches. 2Dukhrana Biblical Research is a non-profit organization primarily dedicated to the study of the Peshitta, the Bible in Syriac. Its purpose is to make the study of the Peshitta more easily accessible by providing useful tools and resources via the website http://www.dukhrana.com.

8 2 Language Resources for Syriac

2.1 The Syriac Language

Syriac is an eastern dialect of Aramaic, with its home in Edessa1, in the king- dom of , where it served as both a spoken and , cer- tainly long before the introduction of Christianity (Nöldeke and Euting, 1904, p. XXXII). Then, in the first century, with the spreading of Christianity in the region, became more and more the center for Aramaic Christianity, and Syriac started to attain special importance. It soon became the language of the , in and , and the vehicle used for spreading Christianity eastwards throughout the and later also into . Syriac was a true spoken language up until the 8th century, when it, with the rise of Islam, started to give way for . Nevertheless, Syriac was up- held as a literary language for another four centuries. Through the centuries, many prolific authors have written in Syriac, and its, chiefly Christian religious, literary heritage is rich and vast. Today it is preserved and alive primarily as a liturgical language, although there are new texts still being composed or trans- lated into Syriac. Syriac is written from right-to-left using a twenty-two letter alphabet. At its core it is an abjad2 language, but over the centuries diacritical markings have been incorporated and used to distinguish homographs and act as vowel signs to clarify pronunciation. For the most part, Syriac texts do not use many dia- critical markings, and vowel signs are often vacant. However, Bibles and texts for liturgical use today almost always include a full set of diacritical markings to aid the reader. Syriac utilizes three different scripts, Est.rangela,¯ Madnh. ay¯ a¯3 and Sert.o, each with its own set of diacritical markings. Which script and set of diacritical markings that is used tends to depend on church denomination adherence. This work uses the Madnh. ay¯ a¯ script when showing Syriac. Two types of transliteration for Syriac are employed in this thesis. The transliteration used by Kiraz (1994) is used for internal representation by dkr- Morph, and all output used in examples retains this transliteration. The re- mainder of the thesis uses a simplified variant of the transliteration employed by Brockelmann (1906), as it is more pleasing to the eye and easier to read.

1Modern Urfa, . 2An abjad language uses an all consonant writing-system and does not require vowels. 3Sometimes also called Swad¯ ay¯ a¯ or Nestorian.

9 2.1.1 Syriac Morphology Syriac, being a dialect of Aramaic, belongs to the Semitic family of languages. As such it is an inflected language, with a non-concatenative root-and-pattern morphology, where formations of consonantal roots are interwoven with pat- terns to form words. These patterns inflect the root in various ways, for exam- ple by inserting vowels, lengthening vowels, doubling consonants and adding of prefixes, suffixes and infixes. Following is a non-exhaustive list of morphological terms used in this thesis and their description:

• root - A root can be thought of as a morphemic abstraction, and as the cornerstone of Syriac morphology. The root consists of a formation of consonants, or radicals, which are almost always three in number, but there exists also roots with both fewer and more radicals. The abstract concept of roots is not a modern construction, but grammarians as far back as the , if not earlier, were well aware of this phe- nomenon. Roots having any of ,, w or y as one of its radicals are consid- ered weak, as their presence often makes the root deviate from how it is typically inflected. Roots without any weak radicals are called strong or sound roots. Two examples of roots are, ktb "notion of writing" and qwm "notion of rising/arising".

• stem - The resulting inflection after interweaving a root with a pattern. For example, by inflecting the root ktb we can derive words such as ktab4 " wrote", tektob5 "he will write" and ktab¯ a¯ 6 "a/the book".

• lexeme - A lexeme or lemma, is the word form used when looking up a word in a dictionary. Typically verbs are listed in the THIRDPERSON MASCULINESINGULAR, p-al PERFECT form, like the word ktab above, while are typically listed as MASCULINE or FEMININESINGULAR EMPHATIC as the word ktab¯ a¯ shown above.

• word - An arbitrary stem with or without attached prefixes and suffix. For example, waktaby¯ "and my book", which has the prefix w- "and", as well as a possessive suffix, FIRSTPERSONCOMMONSINGULAR, -y "my" attached to the stem ktab¯ a¯ "a/the book". Please note that the prefix w- in this example takes the vowel a, and that the final a¯ is dropped when adding the -y suffix, which in turn is not pronounced.

• prefix - Prefixes are inseparable particles attached in front of a stem. A stem may have zero to three prefixes attached. Possible prefixes are, b- "in, with", d- "of", w- "and" and l- "to, for" (but also as an object marker).

• suffix - Suffixes are made out of possessive suffixes and object suffixes, and are attached at the end of a stem. Suffixes vary in length, and a stem can only have one suffix attached.

4VERB, THIRDPERSONMASCULINESINGULAR, p-al, PERFECT 5VERB, SECONDPERSONMASCULINESINGULAR, p-al, IMPERFECT 6NOUN, MASCULINESINGULAR, EMPHATIC

10 • pattern - A pattern is a sequence of segments containing root conso- nants represented by the letter C intertwined with other non-root con- sonants and vowels. For example, the previously mentioned words, ktab, tektob and ktab¯ a¯ formed from the root ktb have the following patterns: C1C2aC3, teC1C2oC3 and C1C2aC¯ 3a.¯ The Syriac language has a vast set of these patterns, each holding its own set of morphological attributes. It is worth noting though, that not all patterns are applicable to all roots.

2.2 Syriac Resources

Resources for Syriac are scarce but they are not nonexistent. Currently there exists two important datasets, namely the SEDRA database Kiraz (1994) and an annotated corpus of the Peshitta New Testament (interchangeably referred to as PNT from here on) (Kiraz, 1994).

2.2.1 SEDRA The Syriac Electronic Data Retrieval Archive (SEDRA) is a database allowing systematic and empirical studies in Syriac. The first version of SEDRA was originally designed in 1989 by George A. Kiraz while preparing a digital and print concordance of the Peshitta New Testament. It made substantial use of the Syriac New Testament database of The Way International7, a Biblical re- search, teaching and fellowship ministry who spent 15 years annotating the Peshitta New Testament by hand. SEDRA has since then evolved and its data been refined by George A. Kiraz and the Syriac community. The latest version of SEDRA8 is available for download at Beth Mardutho: The Syriac Institute9. SEDRA contains 2,050 roots, 3,559 lexemes and 29,699 words, but also 6352 English translations (particular to the context of the New Testament).

2.2.2 Annotated Peshitta New Testament The Peshitta New Testament text was originally annotated, by hand, by The Way International with morphological attributes, root- and lexeme forms, et- ymological data and English translations, but has later been corrected and redesigned by George A. Kiraz for use in his Peshitta New Testament con- cordance. Each and every word in the text is tagged with its corresponding SEDRA Word ID for easy lookup of its grammatical attributes and other an- notated data. The Peshitta New Testament version used is The New Testament in Syriac by British and Foreign Bible Society (1920). The corpus contains 109,654 words.

2.3 Automatic Analysis of Syriac

The limited resources available and the lack of readily available natural lan- guage processing tools for Syriac is a clear sign of that previous work for this 7http://www.theway.org 8SEDRA III 9http://www.bethmardutho.org/support/sedra/download/

11 language are all but exhaustive. Nevertheless, there are at least two important contributors in the field of computational tools and resources specifically for Syriac, namely Kiraz (2001) and McClanahan et al. (2010), both using unique approaches to tackle the problem of automatic analysis.

2.3.1 Finite-State Automata and s.emh. e George Anton Kiraz can be considered one of the pioneers of automatic morphological analysis of Syriac, and has undertaken much research in the , focusing on extending the Two-level morphology model of Kosken- niemi (1983) to more elegantly handle languages whose morphology is non- concatanative, such as Syriac. This work is discussed in great detail in Kiraz (2001) which describes a new generalized regular rewrite rule system that uses multi-tape finite-state transducers to handle root-and-pattern morphol- ogy, infixation, circumfixation and other complex operations such as Arabic’s broken plural. .Semh. e, a system implementing this approach is discussed in Ki- raz (1996) and Kiraz (1998). However, a full working implementation, with an exhaustive grammar to handle the whole morphological analysis of the lan- guage, seems never to have been completely undertaken, and as such nothing is available for use today. Kiraz has also done other tremendous work for Syriac in terms of language resources. Particularly his work in preparing, and also making available, the an- notated Peshitta New Testament corpus as well as the accompanying SEDRA database, the only Syriac resources currently readily available.

2.3.2 Syromorph A recent contribution in the field of automatic morphological analysis of Syriac is the work done by McClanahan et al. (2010). In contrast to Kiraz’s multi-tape finite-state automata they have created a data-driven probabilistic morpholog- ical analyzer for Syriac called Syromorph. It is a joint pipeline model consisting of a segmenter, a baseform (lexeme) linker, a root linker, a suffix tagger, and a stem tagger. Similar to the analyzer proposed in this thesis, they also use Kiraz’s an- notated Peshitta New Testament and accompanying SEDRA database as their only resource. The data used to train their system consists of non-vocalized words that are segmented with prefix, stem and suffix. Also, a word’s context is taken into account when analyzing; this as opposed to the system proposed in this thesis, which takes as input vocalized words without any prior segmenta- tion, and where words are analyzed in complete isolation. When trained on all available training data (approximately 100,000 words), their system achieves an accuracy of 86.47%. Syromorph was developed to be used by the Comprehensive Syriac Corpus project10, which aims at creating a comprehensive, labeled corpus of classical Syriac. The project is a joint project between scholars at the Neal A. Maxwell Institute for Religious Scholarship at Brigham Young University and the Ori- ental Institute at Oxford University.

10For more information see http://cpart.byu.edu/?page=112&sidebar.

12 3 The dkrMorph Approach

dkrMorph1 is the name of the proposed method and prototype designed to au- tomatically morphologically analyze Syriac words. dkrMorph can be classified as a hybrid model that uses a data-driven approach to automatically gener- ate and create weighted regular expression rules and patterns to deduce the most likely morphological attribute analysis as well as the most likely root and lexicon form for an isolated Syriac word. Following is a short list of special terminology and definitions used by dkr- Morph:

• greedy prefix - A greedy prefix is the full string of all characters from the beginning of a word up to the first root consonant.

• greedy suffix - Similar to a greedy prefix, except that a greedy suffix is the full string of characters after the last root consonant to the end of the word.

3.1 Data Requirements

The system is trained using a list of words, each with an accompanying root form, lexicon form and morphological attribute tag as shown in Table 3.1. The words do not need to be segmented in any way, but they should be fully vocalized. The root and lexeme forms do not bear any special constraints and are retained in their consonantal form. Please note that Kiraz’s transliteration, unfortunately, represent certain consonants using vowels signs; A, O, Y, E and I representing -, w, .t, , and p. However, these can be easily distinguished by the fact that consonants are all in upper-case, with the addition of ; which represents the letter , while vowels are represented in lower-case. The morphological attribute tagset can contain any arbitrary attributes that are deemed sufficient for describing the word’s morphological informa- tion. The morphological tagset used here represents the minimum set of mor- phological attributes decided upon for Dukhrana Biblical Research’s ongoing project to produce an electronic annotated version of the Peshitta Old Testa- ment. The tagset is represented as a list of individual morphological attributes. The system stores the morphological attributes as hexadecimal digits, but can be converted to any other format when displayed. For example, the word BLeBHuON has the morphological tagset [4, 0, 2, 1, 00, 0, 3, 3, 2, 2] or

1The letters dkr in dkrMorph make out the Aramaic root letters dalath-- "notion of remembering". This is the same root as the word dukhrana "remembrance, memorial" is derived from. dkrMorph can therefore also go by the name Dukhrana Morph.

13 0x40210003322, of which a full breakdown can be seen in Table 3.2. Please see Table B.1 for a complete list of the attribute values employed by the sys- tem.

Word Root Lexeme Morphological Attribute Tag YuOBa;HuON YOB YOBA [4, 0, 2, 1, 00, 0, 3, 3, 2, 2] LaA;Le;N A;NA A;NA [7, 0, 1, 2, 00, 0, 0, 0, 0, 0] DaDCe;N DCA DCA [9, 0, 2, 2, 01, 6, 0, 0, 0, 0] BLeBHuON LB LBA [4, 0, 2, 1, 00, 0, 3, 3, 2, 2] DHeNuON HO HO [7, 3, 2, 2, 00, 0, 0, 0, 0, 0] NeKZuON KZA KZA [9, 3, 2, 2, 01, 2, 0, 0, 0, 0] LaALoHoA ALH ALHA [4, 0, 2, 1, 00, 0, 3, 0, 0, 0]

Table 3.1: Sample of training data fed into the system - tubayhon laylen dadken blebbhon dhennon neh. zon lalah¯ a¯ "Blessed are the pure in heart: for they shall see God." (Matthew 5:8)

Morphological Attribute Value (Code) Grammatical Category (4) Person N/A (0) Gender Masculine (2) Number Singular (1) Verb Stem N/A (00) Verb Tense N/A (0) State Emphatic (3) Suffix Person Third (3) Suffix Gender Masculine (2) Suffix Number Plural (2)

Table 3.2: Morphological attribute tagset for the word BLeBHuON "in their hearts".

3.2 Morphological Analysis

The actual analysis of a word is divided into two parts. The first part deals with morphological analysis, while the second deals with deriving the root and lexicon form. Currently there are four different methods employed when analyzing the morphological information of a word. These methods are applied in a sequen- tial order, where subsequent methods are only tried if the previous ones did not

14 return any matches. The four methods are (1) Dictionary Match, (2) Regular Expression Match, (3) Regular Expression with Generalized Prefix Match and (4) Greedy Prefix and Suffix Match. Each one of these methods will be described in detail below.

3.2.1 Dictionary Match As training data is being loaded, the system constructs a dictionary of all unique words seen and their possible morphological attribute tags. The dictionary is indexed by word, and a list of all possible morphological attribute tags for that word is then stored as a list and ordered by frequency, high to low. The word’s root and lexeme are also stored for later reference. The dictionary thus has a structure as follows,

dict["word"] = [ [frequency, [tagset], root, lexeme], [...] ]

Figure 3.1 shows what the sample word, BLeBHuON, from Table 3.1 looks like in a dictionary created from the Peshitta New Testament dataset.

dict["BLeBHuON"] = [ [11, [4,0,2,1,00,0,3,3,2,2], LB, LBA] ]

Figure 3.1: A sample dictionary entry.

If the word being analyzed is found in the dictionary, then the morphologi- cal attribute tag with the highest frequency for that word is assumed to be the correct analysis, and no further analysis is attempted. However, if the word is not found in the dictionary, then the next step is to try the Regular Expression Match method. It is worth noting that a dictionary entry can sometimes contain more than one list of morphological attributes. This is due to the fact that there exists ambiguous words. For example, the p,al imperfect verb nektob can be cor- rectly analyzed as either THIRDMASCULINESINGULAR, or FIRSTCOMMON PLURAL. Instead of letting the system return the morphological attribute tag with the highest frequency, it is also possible to tell the system to return, de- pending on user preference, any number of known possible morphological tags for a word.

3.2.2 Regular Expression Match Besides constructing a dictionary, the system also constructs a data structure containing regular expressions generated from the training data. Generation of regular expressions is a two step process. First a pattern is generated by masking out the root consonants from a word. The algorithm for doing this takes a word-root pair as input, and each root consonant’s corresponding counterpart in the word is masked out. The algorithm has a few special constraints enforced due to the nature of Syriac’s morphology. Special care is taken to not mistake an initial prefix for a root letter. Strong and weak root consonants are also handled differently, as weak root consonants have a tendency to disappear in certain inflections. Strong

15 root consonants are thus marked # and weak root consonants are marked @. See Table 3.3 for an illustration of what kind of patterns that are generated for the words in Table 3.1.

Word Root Pattern YuOBa;HuON YOB #u@#a;HuON LaA;Le;N A;NA La@@Le;# DaDCe;N DCA Da##e;N BLeBHuON LB B#e#HuON DHeNuON HO D#eNu@N NeKZuON KZA Ne##uON LaALoHoA ALH La@#o#oA

Table 3.3: Examples of patterns generated from word-root pairs.

The second step is to build a regular expression using the pattern just gen- erated. Here each # is replaced with a list of strong consonants [BGDHZKY- CLMNSEI/XRWT], and each @ is replaced with a list of weak consonants [AO;]. Finally, an initial ^and an ending $ is added to make sure the result- ing regular expression will be applied from the start of a word to the end of a word. See Table 3.2 for an example of a regular expression built from the pattern B#e#HuON.

^B[BGDHZKYCLMNSEI/XRWT]{1}e[BGDHZKYCLMNSEI/XRWT]{1}HuON$

Figure 3.2: An example of a regular expression built from the pattern B#e#HuON.

After a regular expression has been built from each word in the training data, the list of regular expressions is compressed to only include unique reg- ular expressions. Each unique regular expression is then associated with a list of all possible morphological attribute tags seen for that regular expression. These tags are assigned an initial weight based on frequency. The weight can then be adjusted depending on different factors, for example if the morpho- logical attribute tag contains certain attributes or not. At present time, a tag’s weight is only adjusted if the tag contains any known greedy suffix, in which case the number of times the greedy suffix has been seen during training is added to the weight. Each and every regular expression is then matched against the word being analyzed. If there is a match, then that regular expression is stored in a sepa- rate list. When all regular expressions have been tested, the list with matching regular expressions is ordered by the weight of their respective morphologi- cal attribute tags. The regular expression with the morphological attribute tag with the largest weight is then chosen to be the most likely correct analysis of the word.

16 If no regular expressions matching the word are found, then the next method, Regular Expression with Generalized Prefix Match, is applied.

3.2.3 Regular Expression with Generalized Prefix Match If none of the previously built regular expressions produce a match, then the system tries to see if more generalized variants of the regular expressions pro- duces any matches. This process includes analyzing each regular expression to see if it begins with any known prefixes2. If a prefix is found, then the regular expression is modified so that the regular expression will match with or without the prefix. For example, consider the following regular expression:

^B[BGDHZKYCLMNSEI/XRWT]{1}e[BGDHZKYCLMNSEI/XRWT]{1}HuON$

As we can see, this regular expression contains an initial prefix B which can be generalized into [BDOL]? and results in:

^[BDOL]?[BGDHZKYCLMNSEI/XRWT]{1}e[BGDHZKYCLMNSEI/XRWT]{1}HuON$

The modified regular expression is then matched against the word being analyzed in the same manner as described above. If none of the prefix general- ized regular expressions produces any matches, then the last and final method, Greedy Prefix and Suffix Match is applied.

3.2.4 Greedy Prefix and Suffix Match This is the system’s last resort before giving up on a word. Initially when the system starts, lists of all greedy prefixes and greedy suffixes seen in the training data are generated. The word being analyzed is then checked to see if it con- tains any of the known greedy prefixes or suffixes, and any matching greedy prefixes and greedy suffixes are saved in separate lists. These lists are then ordered by the lengths of the greedy prefixes respective greedy suffixes. The longest greedy prefix and greedy suffix are then matched against the list of reg- ular expressions built earlier. Those regular expressions that contain both the longest greedy prefix and longest greedy suffix are considered possible candi- dates for holding the correct analysis attribute tag and are stored in a separate list. The regular expression whose morphological attribute tag has the largest weight is then chosen from this list as the most likely morphological analysis of the word. As mentioned in Section 3.2.1, the system is also capable of re- turning any number of entries from this list, all depending on user preference.

3.3 Derivation of Root and Lexeme

The underlying method used to try and derive the root and lexeme for a word is the same. The method uses two different approaches: one for known words,

2These prefixes are language dependent, and a list of them is hardcoded into the system.

17 and one for unknown words. The method chosen depends on which sequential step in the morphological analysis produced a match. If there was a Dictionary Match in the morphological analysis, then the system is dealing with a known word, and the root and lexeme are fetched from the dictionary.

1 lexicon = [ [word, root, lexeme, [morphological attributes]], [...] ] 2 root_dict = [] 3 lexeme_dict = [] 4 5 f o r entry in l e x i c o n : 6 root_pattern = derive_pattern(entry[word], entry[root]) 7 lexeme_pattern = derive_pattern(entry[word], entry[lexeme]) 8 regexp = build_regexp(root_pattern) 9 10 i f regexp in r o o t _ d i c t : 11 root_dict[regexp].append(root_pattern) 12 lexeme_dict[regexp ].append(lexeme_pattern) 13 e l s e 14 root_dict[regexp] = [root_pattern] 15 lexeme_dict[regexp] = [lexeme_pattern] 16 17 f o r key in r o o t _ d i c t : 18 root_dict[key] = most_frequent(root_dict[key]) 19 lexeme_dict[key] = most_frequent(lexeme_dict[key])

Figure 3.3: Listing of algorithm for building root- and lexeme pattern dictionaries.

If the match was in any of the other three sequential steps, Regular Expres- sion Match, Regular Expression with Generalized Prefix Match or Greedy Prefix and Suffix Match, then the system is dealing with an unknown word. In order to derive a root and lexeme for an unknown word, the system uses root- and lexeme pattern dictionaries. These dictionaries are built when the training data is loaded, and the algorithm used for building the dictionaries are shown in Figure 3.3 in pseudo code. The dictionaries contain root- and lexeme patterns which are used to mask out the root and lexeme from a word. The dictionaries are indexed by the corresponding regular expression built from the patterns. The reason the dictionaries are indexed by regular expressions is because the best matching regular expression picked by the morphological analysis is to be used to access the root and lexeme patterns. For example, lets say the analyzer was about to analyze the word RaG- MuOH;, "they stoned him", and that the word was not found in the dictionary of known words, but that the analyzer, in the Regular Expression Match step, found one regular expression, R,

^[BGDHZKYCLMNSEI/XRWT]{1}a[BGDHZKYCLMNSEI/XRWT]{1}[BGDHZKYCLMNSE I/XRWT]{1}uOH;$

that produced a match, and elected the regular expression’s morpholog- ical attribute tag to contain the correct analysis of the word. This very same regular expression, R, is then used to access the root- and lexeme patterns needed to get the root and lexeme:

18 root_dict[R] returns #a##uOH; lexeme_dict[R] returns #a##uOH;

Once the root- and lexeme patterns have been obtained, the root and lexeme are derived by applying the patterns to the word being analyzed and masking out the root and lexeme. If we continue our example from above, then the masking and deriving of the root and lexeme for the word RaGMuOH; would result in the root RGM and lexeme RGM: get_root("RaGMuOH;", "#a##uOH;") returns RGM get_lexeme("RaGMuOH;", "#a##uOH;") returns RGM

It is also possible that the morphological analysis does not produce any matches and no regular expression can be used to access the root- and lexeme pattern dictionaries. In this case the root and lexeme are assumed to be the first three consonants in the word being analyzed.

19 4 Experiment

This chapter aims at describing how the testing and evaluation of dkrMorph was conducted, and present how well the analyzer performs using datasets with varying characteristics.

4.1 Setup

Resources used for the experiment include the annotated Peshitta New Testa- ment and the accompanying SEDRA database, both described in Section 2.2. This data is annotated with root and lexeme form, and labeled with mor- phological attributes. The morphological attributes contained in the SEDRA database have been manually streamlined and corrected in some places to fit the desired morphological attributes tagset set out by Dukhrana Biblical Re- search. For example, words in SEDRA tagged as denominatives, verbs derived from a noun or , have been retagged as verbs. For a complete list of changes and corrections please see Table C.1. Furthermore, the SEDRA database and the annotated Peshitta New Testa- ment are treated as two different datasets, due to their different characteristics and topology. The streamlined SEDRA dataset contains 29,699 words, where 22,599 (79.06%) are unique words. The PNT dataset contains 109,654 words of which 18,242 (16.64%) are unique words. From this it is discernible that the SEDRA data has a richer set of unique words, and contains 4,357 words not found in the PNT dataset. Training the system with the SEDRA data should yield more regular expressions rules and have a higher accuracy when dealing with unknown words, while the PNT data, having more words and a lower per- centage of unique words, should give a higher total accuracy, as most words will be known words. As a final test, the SEDRA and PNT datasets are combined in order to get the best of both worlds, so to speak. Ten-fold cross-validation has been employed for validation of the training of dkrMorph using the SEDRA and PNT datasets. Table 4.1 outlines the three datasets, SEDRA, PNT and SEDRA+PNT, used in the experiments, and how many words per dataset that are allocated for training respective test.

4.1.1 Baseline A naïve baseline has also been devised in order to get a feeling for just how well dkrMorph performs. The baseline has been implemented in the following manner to cater for morphological analysis and derivation of root and lexeme forms.

20 SEDRA PNT SEDRA+PNT Training data 26,730 98,694 125,424 Test data 2,969 10,960 13,929 Total 29,699 109,654 139,353 Unique words 22,599 18,242 22,599

Table 4.1: Number of words in the training- and test data per dataset.

• Morphological Analysis - If the word being analyzed is a known word, then the most frequent morphological attribute tag for that word is cho- sen. This is the same approach used by dkrMorph. If it is an unknown word, then the analyzer choses the most frequent overall morphological analysis tag for unknown words seen in the validation set.

• Derivation of Root and Lexeme - If the word has been seen during training, then the baseline uses the same approach as dkrMorph, and the root and lexeme are taken from the dictionary of known words. For unknown words the baseline simply uses the first three consonants from the word being analyzed to represent the root and lexeme forms.

4.2 Results

This section will present the results of the experiment. The results have been organized into four subsections. The first one, Baseline vs. dkrMorph, presents how dkrMorph compared against the naïve baseline. The remaining three sub- sections, Morphological Analysis, Root and Lexeme Derivation and Morphologi- cal Analysis and Root and Lexeme Derivation Combined reports detailed results, such as distribution of matches and accuracy, for dkrMorph’s different meth- ods. Furthermore, the three different methods for handling unknown words uses the following abbreviations in the tables shown in the result section.

• Regexp 1 - Regular Expression Match

• Regexp 2 - Regular Expression with Generalized Prefix Match

• Greedy PreSuf - Greedy Prefix and Suffix Match

4.2.1 Baseline vs. dkrMorph Presented here are the overall accuracy results for dkrMorph and the naïve baseline. All results shown are broken down by known and unknown words per dataset, in order to see how well each analyzer performed for known respective unknown words. Table 4.2 shows the accuracy of the morphological analysis per dataset. The results for known words are the same for both the baseline and dkrMorph, as

21 Dataset Total Known Unknown SEDRA 36.81% 86.74% 10.18% PNT 89.42% 98.52% 7.06% baseline SEDRA+PNT 94.27% 98.11% 9.90% SEDRA 71.84% 86.74% 63.89% PNT 94.67% 98.52% 59.85%

dkrMorph SEDRA+PNT 96.58% 98.11% 62.87%

Table 4.2: Morphology analysis accuracy per dataset. they both use the same method to handle known words. The highest total ac- curacy is achieved with the SEDRA+PNT dataset (baseline 94.27%, dkrMorph 96.58%). The SEDRA dataset produces the highest accuracy for unknown words (baseline 10.18%, dkrMorph 63.89%), while for known words, the PNT produced the highest accuracy (baseline and dkrMorph both 98.52%). The reason PNT has a higher accuracy for known words than SEDRA+PNT, is be- cause PNT contains fewer ambiguous words.

Dataset Total Known Unknown SEDRA 43.68% 98.74% 14.31% PNT 90.71% 99.86% 7.97% baseline SEDRA+PNT 96.37% 99.83% 20.46% SEDRA 66.45% 98.74% 49.64% PNT 94.72% 99.86% 48.30%

dkrMorph SEDRA+PNT 97.95% 99.83% 56.93%

Table 4.3: Root accuracy per dataset.

In Table 4.3 the accuracy for root derivation is shown. The highest total accuracy (baseline 96.37%, dkrMorph 97.95%) as well as the highest accuracy for unknown words (baseline 20.46%, dkrMorph 56.93%) are both achieved with the SEDRA+PNT dataset. Highest accuracy for known words is achieved with the PNT dataset (baseline and dkrMorph both 99.86%). The results for lexeme derivation accuracy shown in Table 4.4 are almost identical to the root derivation accuracy results, and follows the same pattern. In Table 4.5 the combined correct accuracy for the morphological analysis and root- and lexeme derivation is shown. Highest total accuracy is achieved with the SEDRA+PNT dataset (baseline 93.80%, dkrMorph 95.53%). For known words the highest accuracy is achieved with the PNT dataset (base- line and dkrMorph both 98.49%). The highest accuracy for unknown words is for the baseline (0.33%) achieved with the SEDRA+PNT dataset, while for dkrMorph (54.24%) it is achieved by using the SEDRA dataset.

22 Dataset Total Known Unknown SEDRA 39.95% 97.68% 9.14% PNT 90.34% 99.75% 5.22% baseline SEDRA+PNT 96.06% 99.69% 16.17% SEDRA 66.18% 97.68% 49.38% PNT 94.42% 99.75% 46.20%

dkrMorph SEDRA+PNT 97.85% 99.69% 57.59%

Table 4.4: Lexeme accuracy per dataset.

Dataset Total Known Unknown SEDRA 30.14% 86.25% 0.21% PNT 88.70% 98.49% 0.09% baseline SEDRA+PNT 93.80% 98.06% 0.33% SEDRA 52.61% 86.25% 54.24% PNT 91.92% 98.49% 33.36%

dkrMorph SEDRA+PNT 95.53% 98.06% 41.61%

Table 4.5: Accuracy of correct morphological analysis and root- and lexeme derivation per dataset.

4.2.2 Morphological Analysis In this section, detailed results for dkrMorph’s morphological attribute analysis are presented.

SEDRA PNT SEDRA+PNT Known 34.79% 90.05% 95.65% Unknown 65.21% 9.95% 4.35%

Table 4.6: Distribution of known and unknown morphological analysis matches per dataset.

Table 4.6 shows an aggregated view of the distribution of word matches between known and unknown words. SEDRA+PNT has the highest amount of known matches (95.65%), and SEDRA has the highest amount of matches for unknown words (65.21%). Table 4.7 presents a detailed overview of how the word matches are dis- tributed between dkrMorph’s different methods. The dataset with the highest amount of dictionary matches is SEDRA+PNT (95.65%), while the SEDRA dataset has the highest amount of Regexp 1 matches (46.82%), Regexp 2

23 Method SEDRA PNT SEDRA+PNT Dictionary 34.79% 90.05% 95.65% Regexp 1 46.82% 6.93% 3.47% Regexp 2 9.03% 1.41% 0.42% Greedy PreSuf 6.70% 1.32% 0.30% No Matches 2.66% 0.30% 0.16%

Table 4.7: Breakdown of the distribution of morphological analysis matches per method and dataset. matches (9.03%), Greedy PreSuf matches (6.70%) as well as the highest dis- tribution of words with no matches (2.66%).

Method SEDRA PNT SEDRA+PNT Dictionary 86.74% 98.52% 98.11% Regexp 1 67.77% 64.03% 66.53% Regexp 2 81.34% 77.27% 79.31% Greedy PreSuf 38.69% 33.10% 30.95% Total 71.84% 94.67% 96.58%

Table 4.8: Accuracy of morphological analysis methods per dataset.

Table 4.8 displays the accuracy per morphological attribute analysis method. The results show that the highest dictionary accuracy is achieved with PNT (98.52%), while the highest accuracy for the methods dealing with un- known words are all obtained using SEDRA - Regexp 1 (67.77%), Regexp 2 (81.34%) and Greedy PreSuf (38.69%). The highest total accuracy is attained with the SEDRA+PNT dataset (96.58%).

4.2.3 Root and Lexeme Derivation This section shows detailed results for the derivation of root and lexeme forms. Table 4.9 displays the distribution of correct root derivations per morpho- logical analysis method, i.e. which morphological analysis step produced the correct root derivations. The results shown here exhibit a similar pattern of dis- tribution as the morphological analysis distribution results shown in Table 4.7. The highest distribution of correct root derivations stemming from the dictio- nary method are with the SEDRA+PNT dataset (97.49%). The highest dis- tributions of correct root derivations stemming from methods associated with unknown words are all achieved with SEDRA - Regexp 1 (43.59%), Regexp 2 (4.51%), Greedy PreSuf (0.10%) and No Matches (0.10%). Table 4.10 shows the distribution of correct lexeme derivations per mor- phological analysis method. These are almost identical to the distribution of

24 Method SEDRA PNT SEDRA+PNT Dictionary 51.70% 94.93% 97.49% Regexp 1 43.59% 4.36% 2.32% Regexp 2 4.51% 0.68% 0.16% Greedy PreSuf 0.10% 0.02% 0.03% No Matches 0.10% 0.00% 0.01%

Table 4.9: Distribution of correct root matches per morphological analysis method and dataset.

Method SEDRA PNT SEDRA+PNT Dictionary 51.35% 95.13% 97.45% Regexp 1 44.53% 4.26% 2.42% Regexp 2 3.87% 0.58% 0.12% Greedy PreSuf 0.05% 0.03% 0.01% No Matches 0.20% 0.00% 0.00%

Table 4.10: Distribution of correct lexeme matches per morphological analysis method and dataset. correct root derivation results presented in Table 4.9. The highest dictionary distribution are with the SEDRA+PNT dataset (97.45%), while the highest distribution for methods associated with unknown words are all obtained with the SEDRA dataset - Regexp 1 (44.53%), Regexp 2 (3.87%), Greedy PreSuf (0.05%) and No Matches (0.20%).

SEDRA PNT SEDRA+PNT Root 66.72% 94.73% 97.96% Lexeme 66.18% 94.42% 97.86%

Table 4.11: Accuracy of root- and lexeme derivation per dataset.

In Table 4.11 the overall root- and lexeme derivation accuracy is shown per dataset. The highest accuracy for root- and lexeme derivation (97.96% and 97.86%) are both achieved by using the SEDRA+PNT dataset.

4.2.4 Morphological Analysis and Root and Lexeme Derivation Combined This subsection presents the combined results for the morphological analysis and root- and lexeme derivation, i.e. when the morphological analysis is correct

25 as well as both the root- and lexeme derivation.

Method SEDRA PNT SEDRA+PNT Dictionary 55.35% 96.49% 98.17% Regexp 1 41.16% 3.02% 1.72% Regexp 2 3.49% 0.49% 0.11% Greedy PreSuf 0.00% 0.01% 0.00% No Matches N/A N/A N/A

Table 4.12: Distribution of correct morphological analysis matches and root- and lexeme matches per morphological analysis method and dataset.

Table 4.12 shows the distribution, per morphological analysis method, of correct morphological attribute analysis matches for which the root- and lexeme derivation where also correct. The highest distribution of dictionary matches are in the SEDRA+PNT dataset (98.17%), while the highest distri- bution for Regexp 1 (41.16%) and Regexp 2 (3.49%) are obtained with the SEDRA databset. Hardly any matches are achieved with the Greedy PreSuf, but the highest distribution of Greedy PreSuf (0.01%) is achieved with PNT. Distribution of No Matches are not applicable.

Analysis SEDRA PNT SEDRA+PNT M 71.84% 94.67% 96.58% R 66.72% 94.73% 97.96% L 66.18% 94.42% 97.86% R + L 60.93% 93.73% 97.39% M + R 55.47% 92.38% 95.74% M + L 56.01% 92.34% 95.77% M + R + L 52.61% 91.92% 95.53%

Table 4.13: Total accuracy of morphological analysis and root- and lexeme derivation per dataset.

The final Table 4.13 presents the total accuracy per dataset for the mor- phological analysis (M), root derivation (R), lexeme derivation (L), R and L combined, M and R combined, M and L combined as well as the accuracy for M, R and L combined. The highest accuracy is achieved throughout with the SEDRA+PNT dataset.

26 5 Discussion

The discussion is divided into four parts. The first part deals with the dkrMorph vs. the naïve baseline results, while the second part in greater detail discusses the results of dkrMorph’s internal performance. The third part then compares dkrMorph’s results to Syromorph’s results and the fourth part discusses future work.

5.1 dkrMorph vs. Baseline Results

The results show that dkrMorph outperforms the naïve baseline in all tests. The total accuracies per dataset shown in Table 4.5 are for the baseline 30.14% (SEDRA), 88.70% (PNT) and 93.80% (SEDRA+PNT), while for dkrMorph they are 52.61% (SEDRA), 91.92% (PNT) and 95.53% (SEDRA+PNT). The results demonstrate that dkrMorph’s results are slightly higher than the base- line’s results when using the PNT and SEDRA+PNT datasets, and much higher when using the SEDRA dataset. Both the PNT and SEDRA+PNT datasets have a low percentage of unknown words, while the SEDRA dataset has a high percentage of unknown words. And as dkrMorph and the baseline both use the same method to cater for known words, it is then evident that the key varying factor for dkrMorph’s improved performance over the baseline lies with dkrMorph’s ability to better handle unknown words. Therefore, the rest of the discussion will focus on the handling of unknown words. dkrMorph’s superior performance and accuracy for handling unknown words are clearly seen throughout all three datasets used in the experiment. For example, for the SEDRA dataset the baseline has an accuracy of 10.18% for morphological analysis of unknown words, while dkrMorph has a 6.28 times higher accuracy with 71.84%. Similar results are also seen for root- (Table 4.3) and lexeme derivation (Table 4.4) of unknown words, where the baseline have accuracies of 14.31% and 9.14% respectively, compared to dkrMorph’s accu- racies of 49.64% (3.47 times higher) and 49.38% (5.40 times higher). If taking into consideration that the whole label (morphological, root and lexeme) for unknown words must be correct, then the baseline only achieves an accuracy of 0.21% for unknown words, compared to dkrMorph with a 258.29 times higher accuracy of 54.24% (Table 4.5). Similar trends and results are also seen for the PNT and SEDRA+PNT datasets.

27 5.2 dkrMorph Results

Results for how well each of the morphological attribute analysis methods per- forms are shown in Table 4.8. The accuracy for known words is as good as it can get considering that words are analyzed in complete isolation and disam- biguation is not catered for. For unknown words the Regexp 1 and Regexp 2 methods are doing a pretty good job, with a lowest accuracy of 64.03% and 77.27% using the PNT dataset, which is also the dataset with the least amount of automatically generated regular expressions. Accuracy for these two meth- ods will thus improve as more new words are added to the training data, and consequently more regular expressions are generated. This is supported by the improved results achieved by using the SEDRA and SEDRA+PNT datasets, which both have more unique words than the PNT dataset, and thus more generated regular expressions. The Greedy PreSuf method performs the worst of the methods, but since so few words are handled by this method, partic- ularly when using datasets with a high percentage of known words, putting effort into improving it would only marginally improve the total accuracy. The highest accuracy for morphological analysis is obtained with the SEDRA+PNT dataset, which has an accuracy of 96.58%. dkrMorph handles the derivation of root- and lexeme forms quite well. The highest accuracy for root- and lexeme derivation is achieved with the SEDRA+PNT dataset, with a root accuracy of 97.96% and a lexeme accuracy of 97.86%. So far morphological analysis and root- and lexeme derivation have only been considered as isolated tasks. However, Table 4.13 contains a summary of both the isolated as well as the combined accuracies achieved by the morpho- logical attribute analysis method, root derivation method and lexeme deriva- tion method per dataset, and clearly shows that dkrMorph performs almost equally well when treating the resulting annotation as a monolithic label. Thus, by taking into consideration that the morphological analysis and root- and lex- eme derivation must all be correct for the analysis to be correct, then dkr- Morph achieves a highest accuracy of 95.53% with the SEDRA+PNT dataset.

5.3 dkrMorph vs. Syromorph

The only other data-driven system for automatic morphological analysis of Syriac, which is similar enough for comparison with dkrMorph is Syromoph - a data-driven probabilistic morphological analyzer for Syriac. Despite sharing some similarities, the two systems have many differences that sets them apart. The key differences being that dkrMorph as input data uses vocalized and unsegmented words, while Syromorph uses unvocalized segmented (prefix, stem and suffix) words. Furthermore, while dkrMorph isolates words in com- plete isolation, Syromorph is designed to analysis words in their context. Sy- romorph’s annotation also includes properly handling of segmenting the word into prefix, stem and suffix, while dkrMorph doesn’t handle segmentation at all. With these key differences kept in mind, then comparison of the two sys- tems can be undertaken both by comparing the highest accuracy achieved us- ing similar datasets for training as well as the highest overall accuracy achieved.

28 The dkrMorph dataset that best resembles the dataset used by Syromorph is the PNT dataset, which includes all of the words in the annotated Peshitta New Testament. The highest accuracy achieved by Syromorph is 86.47% (90.77% for known words, 40.93% for unknown words), while dkrMorph’s best results for this dataset is 91.92% (98.49% for known words, 33.36% for unknown words). dkrMorph’s total accuracy is higher, but Syromoph does a better job analyzing unknown words on the same corpus. dkrMorph’s highest overall accuracy is achieved using the SEDRA+PNT dataset with 95.53% (98.06% for known words, 41.61% for unknown words). By using this larger combined dataset, dkrMorph’s accuracy for unknown words is further improved and outperforms Syromorph.

5.4 Future Work

The method for morphological analysis and root- and lexeme derivation of Syr- iac words, presented herein, has so far only been implemented as a prototype for proof-of-concept and evaluation. Future work includes implementing the analyzer as a proper application. Plans are to make it a network service, al- lowing users to supply words to be analyzed, as well as options, via a network protocol. Future work also includes possible handling of ambiguous words in order to increase the accuracy of correctly analyzing known words. For this, a word’s context must be taken into consideration, which the current implementation does not. It would also be of interest to see how well this method of analyzing words performs for other Semitic languages such as Arabic and Hebrew. However, one of the most important future task will be to actively use the analyzer to prepare the electronic annotated Peshitta Old Testament text to be used at Dukhrana Biblical Research’s website. The Peshitta Old Testament will be analyzed in incremental steps, so previously analyzed and corrected words can be used to bootstrap the analyzer and make it perform better on subsequent steps.

29 6 Conclusion

This thesis has presented a data-driven approach to automatically morpholog- ically analyze isolated vocalized Syriac words with due accuracy of 95.53%. The analyzer uses automatically generated and weighted regular expressions rules and patterns to cater for morphological attribute tagging as well as root- and lexeme form derivation used for dictionary linkage. It has also been shown that the proposed analyzer outperforms its corresponding baseline on all tests, and significantly outperforms it when it comes to handling of unknown words. dkrMorph also slightly outperforms Syromorph, the only other system for au- tomatic morphological analysis of Syriac similar enough for comparison. The next step is to implement dkrMorph as a proper application and use it in Dukhrana Biblical Research’s ongoing project to produce a free electronic an- notated version of the Peshitta Old Testament.

30 A Transliteration Table

Letter Name Kiraz Brockelmann Vowel Name Kiraz Brockelmann ¡ A ’alaph A - _ ptah¯. a¯ a a ¤ |B beth B b _ zkap¯ a¯ o a¯ ¢ |G gamal G g _ rbas¯. a¯ e e / e¯ £ D dalath D d _ .hbas¯. a¯ i i ¥ H he H h _ ,.sas¯. a¯ u u / o O O w Z zayn Z z |i . K .h |W . Y .t |I yod ; y Q kaph C k |L lamadh L l m M m > N n |S semkath S s |E ,ayn E , |P I p s .sade / .s |V X q R resh R r |v šhin W š T T t

Table A.1: Transliterations used in this work.

31 B Table of Morphological Attributes

Morphological Attribute Values (Code) Grammatical Category Adjective (1), Adverb (2), Idiom (3), Noun (4), Numeral (5), Particle (6), (7), Proper Noun (8), Verb (9) Person N/A (0), First (1), Second (2), Third (3) Gender N/A (0), Common (1), Masculine (2), Feminine (3) Number N/A (0), Singular (1), Plural (2) Verb Stem N/A (00), Peal (01), Ethpeal (02), Pael (03), Ethpael (04), Aphel (05), Ettaphal (06), Shaphel (07), Eshtaphal (08), Saphel (09), Estaphal (0A), Pauel (0B), Ethpaual (0C), Paiel (0D), Ethpaial (0E), Palpal (0F), Ethpalpal (10), Palpel (11), Ethpalpal (12), Pamel (13), Ethpa- mal (14), Parel (15), EthParal (16), Pali (17), Ethpali (18), Pahli (19), Ethpahli (1A), Taphel (1B), Ethaphal (1C) Verb Tense N/A (0), Perfect (1), Imperfect (2), Imperative (3), Infinitive (4), Active (5), Passive Participle (6) State N/A (0), Absolute (1), Construct (2), Emphatic (3) Suffix Person N/A (0), First (1), Second (2), Third (3) Suffix Gender N/A (0), Common (1), Masculine (2), Feminine (3) Suffix Number N/A (0), Singular (1), Plural (1)

Table B.1: Morphological attributes and their values and codes.

32 C SEDRA Corrections and Modifications

SEDRA Old Attributes(s) New Attributes(s) * DENOMINATIVE VERB * ACTIVE_PARTICIPLE, ACTIVE_PARTICIPLE FIRST|SECOND|THIRD * PASSIVE_PARTICIPLE, PASSIVE_PARTICIPLE FIRST|SECOND|THIRD * , FIRST|SECOND|THIRD PASSIVE_PARTICIPLE * PARTICIPLE_ADJECTIVE ADJECTIVE * SUBSTANTIVE PARTICLE 2:804 CSTR. EMPH. 2:814 CSTR. EMPH. 2:828 CSTR. EMPH. 2:836 CSTR. EMPH. 2:850 - MASC., SG., EMPH. 2:978 - MASC., PL., EMPH. 2:3106 MASC., SG., EMPH. - 2:5992 - MASC., PL., EMPH. 2:9015 FEM., SG. - 2:11183 - MASC., SG., EMPH. 2:11873 MASC., SG., EMPH. - 2:14309 - ADD MISSING ROOT SM 2:16124 MASC., SG., EMPH. 3RD, MASC., SG., PEAL, PERFECT 2:16225 MASC., SG., EMPH. 3RD, MASC., SG., PEAL, PERFECT 2:16913 EMPH. 3RD 2:17087 MASC., SG., EMPH. - 2:18660 MASC., SG., EMPH. 3RD, MASC., SG., PEAL, PERFECT 2:19232 - MASC., SG., EMPH. 2:19234 - EMPH. 2:19235 - MASC., SG., EMPH., SUF_1ST, SUF_COM., SUF_SG. 2:21958 - MASC., SG., EMPH. 2:22039 NOUN IDIOM, SUF_3RD, SUF_MASC., SUF_SG. 2:22748 EMPH. 3RD 2:26079 PARTICIPLE -

Table C.1: Corrections and Modifications made to the SEDRA dataset.

33 D Ten-Fold Cross Validation Results and Experiment Results

Figure D.1: Ten-Fold Cross Validation Results and Experiment Results for the SEDRA dataset.

34 Figure D.2: Ten-Fold Cross Validation Results and Experiment Results for the PNT dataset.

35 Bibliography

British and Foreign Bible Society (1920). The New Testament in Syriac. British and Foreign Bible Society, London.

Brockelmann, C. (1906). Semitische Sprachwissenschaft. Göschen’sche Ver- lagshandlung, Leipzig.

Kiraz, G. A. (1994). Automatic concordance generation of Syriac texts. In Syriacum, S. and Lavenant, R., editors, VI Symposium Syriacum 1992 : Uni- versity of Cambridge, Faculty of Divinity 30 August - 2 September 1992, pages 461–475. Pontificio Istituto Orientale.

Kiraz, G. A. (1996). SEMHE: a generalised Two-level system. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, ACL ’96, pages 159–166, Stroudsburg, PA, USA. Association for Computational Linguistics.

Kiraz, G. A. (1998). Syriac morphology: From a linguistic model to a computa- tional implementation; Symposium Syriacum VII: Uppsala University, Depart- ment of Asian and African Languages 11 - 14 August 1996, volume 256 of Orientalia christiana analecta. Pontificio Istituto Orientale, Roma".

Kiraz, G. A. (2001). Computational nonlinear morphology with emphasis on Semitic languages. Cambridge University Press, Cambridge.

Koskenniemi, K. (1983). Two-level morphology : a general computational model for word-form recognition and production. Helsinki. Diss. Helsingfors : Univ., 1983.

McClanahan, P., Busby, G., Haertel, R., Heal, K., Lonsdale, D., Seppi, K., and Ringger, E. (2010). A probabilistic morphological analyzer for Syriac. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10, pages 810–820. Association for Computational Lin- guistics.

Nöldeke, T. and Euting, J. (1904). Compendious Syriac grammar. Williams & Norgate, London.

36