Morpho-Syntactically Annotated Amharic Treebank
Total Page:16
File Type:pdf, Size:1020Kb
Morpho-syntactically Annotated Amharic Treebank Seyoum, Binyam Ephrem Miyao, Yusuke Mekonnen, Baye Yimam Addis Ababa University National Institute of Informatics Addis Ababa University [email protected] [email protected] [email protected] morphological features and syntactic structure Abstract (dependency structures). Parsing, on the other hand, is a process of In this paper, we describe an ongoing project recognizing input sentences and identifying units of developing a treebank for Amharic. The main objective of developing the treebank is to like subject, verb, and object. It is used to use it as an input for the development of a determine the interaction among these parser. Morphologically-rich Languages like grammatical functions. Deep parsing, sometimes Arabic, Amharic and other Semitic languages referred to as deep processing, is a process present challenges to the state-of-art in whereby rich linguistic resource is used to give a parsing. In such language morphemes play detailed (or deep) syntactic as well as semantic important functions in both morphology and analysis of input sentences. Such systems are syntax. In addition to the existence of high basic components and language specific lexical variations due the morphology, resources for any Natural Language Processing Amharic has a number of clitics which are not applications which require deeper understanding indicated with any special marker in the orthography. Considering the status of of Natural Languages. Amharic resources and challenges to the The structure of the paper is as follows: after existing approach to parsing, we suggest to we give a brief discussion on the motivation for develop a treebank where clitics are separated the development of the treebank in Section 2, the manually from content words and annotated challenges of Amharic to treebank development semi-automatically for part-of-speech, will be discussed in Section 3. Section 4 deals morphological features and syntactic with the existing resources and their limitation. relations. Section 5 is devoted to the proposed solution and finally the paper will concluded in Section 6. 1. Introduction 2. Motivation A treebank in general can be viewed as a linguistically annotated corpus that includes The main objective of the project is to develop a grammatical analysis beyond part-of-speech level parser for Amharic sentences. For parser (Nivre, 2008). The level of annotation could development there are two major approaches. include word, phrase and sentence levels (Frank These are grammar-driven (or rule-based) and and Erhard, 2012). Such language specific data-driven (or statistical-based) (Nivre, 2008). annotated corpus is an input for the development In grammar-driven parsing, a formal grammar is of various natural language processing tools that used to define possible parsing results for each use data-driven approaches. Though the design of string in the language. We define the grammar a treebank should be motivated by its intended rules and the list of possible lexical items to use, most treebank annotation schemes are which the rules can apply. It follows linguistic organized into a number of layers (Nivre, 2008). motivation to precisely describe a grammar of a For our purpose, we propose three layers of language. Even though, the rules are hand-written information. These are the part-of-speech, (or hand-crafted), they are capable of delivering a 48 highly accurate in-depth analysis of complex be categorized as less resourced language. As a natural language phenomena. result, to develop a parser for Amharic we Motivated by the hypothesis that humans suggested first to develop a treebank manually. In recognize patterns and phrases that have occurred this paper, we are going to address some of the in past experience (Bod et al., 2012), the task in challenges and solution in developing the data-driven approaches is to learn syntactic treebank for Amharic. structures from an existing treebank or a large syntactically annotated corpus. In data-driven 3. Challenges approach, the development of a parsing system presupposes availability of a treebank for a given In developing a treebank for languages which are language (Nivre et al., 2007). In data-driven, the morphologically rich and less resourced, there are knowledge of the grammar rules will be learned a number of challenging issues which should be from a manually-parsed training data. given due attention. In the following sections we Both methods have their own shortcomings provide a summary of the major challenges we and benefits. Grammar-driven methods are have noticed so far. known to be linguistically precise, but have a 3. 1 Morphologically-rich languages problem of robustness and ambiguity. On the other hand, data-driven methods are good for Amharic being one of the morphologically rich developing wide-coverage parsers rapidly. languages presents a challenge to the area of NLP However, the accuracy of the parser depends on in general and to parsing in particular (Dehadari, the magnitude of the training data and the et al., 2011). It is considered as a existence of accurate language specific resources. morphologically-rich languages in the sense that As data-driven methods depend on the size of grammatical relations like subject, object, etc. or corpus, it is also subject to problems of word arrangements and syntactic information are robustness, the ability of a system to give a indicated morphologically or at word level certain analysis for a new or unseen input (Habash, 2010).When parsing models which sentence (Nivre, 2006). have the highest performance for languages like On the other hand, in choosing which English are adopted and implemented for approaches to follow for the development of morphologically-rich languages like Amharic, parsing system, we need to consider the status of they perform poorly. This is due to the the language. In general, languages can be highly complexity of the morphological structure resourced or less resourced. Such distinction is (Tsarfaty, et al., 2010). based on the availability of tools and electronic An orthographic word in such a language data prepared in a language. Regarding Amharic which is delimited by white space, may be a there are initiatives to develop a large corpus. It combination of one or more function words and is also stated that language processing research inflectional morphemes. For instance, an for Amharic has shown some progress in recent Amharic orthographic word like ከየቤቱና years in both corpus and basic Natural Language /kəjjəbetunna/ “from each house and”, includes Processing (NLP) tools development (Gamback, the preposition ከ- /kə-/ “from”, reduced form of 2012).The existing corpus so far focuses on web the distributive marker እየ- /ɨjjə/ “each”, ቤት /bet/ or news corpus. These sources are good for “house”, the definite marker -ኡ /-u / “the” and the producing a large corpus quickly. However, for conjunction -ና /-nna/ “and”. The clitics like Amharic, there are no text preparation tools like preposition, conjunction, auxiliaries, etc. have Spell Checker, Grammar Checker, etc. As a syntactic roles that indicate grammatical relations result, a part of such corpus is produced with with the content words (Tsarfaty, 2013). In order some level of errors. The efficiency of NLP tools to show the syntactic relation between clitics and trained on such corpus may also be questioned. content words, we need to consider how to Therefore, in both tools and corpus, Amharic can 49 represent and tag such elements. In addition, in handle this feature. In some cases, morphologically-rich languages a syntactic gemination can be used to convey position may not give clue about the syntactic grammatical information. Such case is relation. Rather it provides morphological very common in the construction of features that determine the function of a phrase in relativized verbs. a sentence. Thus, we should combine both B. Compound words - As can be observed structural and morphological features to predict in other languages, compound words are syntactic dependencies. written in three ways; with a space State-of-the-art in parsing technology has between the combined words (ቡና ቤት been using data-driven systems. However, due to high morphological variations that exist in /bunna bet/ “bar”፣ አየር መንገድ /ʔajər morphologically rich languages, it is impractical məngəd/ “airline”), separated by a hyphen to observe all such variations in a given annotated (ስነ-ስርዓት /sɨnə-sɨrʔat/ “procedures”፣ ስነ- data. As a result, data-driven systems do not ጥበብ /sɨnə-t‘ɨbəb/ “art”) or written as a guarantee recognition of all morphological single word (ቤተክርስቲያን /bətəkɨrsɨtijan/ variations (Tsarfaty, 2013). Verbs in Amharic, “church”፣ መስሪያቤቶች /məsrijabetoʧʧ/ for instance, can have more than thousands word “offices”). forms (Gasser, 2010). Capturing all these word C. Syntactic words – words which are forms in a corpus is unthinkable. Apart from the separated by white-space (semicolon in morphological variations, the written forms may old documents) may be coupled with cause further variations due to clitics that are functional words like a preposition, attached to content words. Since there are a conjunction or auxiliaries. Thus, an number of clitics in Amharic, the degree of variations increases. Thus, in order to decrease orthographic word may be a phrase (ለሰው the degree of variations arising from attachment /ləsəw/ “to human”), a clause (የሚገኙትና of functional words, one needs to segment these /jəmmigəɲɲutɨnna/“and those that are forms. Segmenting clitics is not a simple task as found/available), or even a sentence it can be part of a content word or can be a word (አልመጣችም /ʔalmət‘t‘aʧʧɨm/ “She did not which is reduced due to phonological processes. come.”) 3.2 Writing system All such features need to be addressed in processing Amharic texts. Some of the above The Amharic script is called Ge’ez or Ethiopic problems call for standardization efforts to be where a consonant and a vowel are represented made whereas others are due to the decision to by one symbol.