Nagaoka Tigrinya Corpus: Design and Development of Part-Of-Speech Tagged Corpus
Total Page:16
File Type:pdf, Size:1020Kb
言語処理学会 第22回年次大会 発表論文集 (2016年3月) Nagaoka Tigrinya Corpus: Design and Development of Part-of-speech Tagged Corpus Yemane Keleta Tedla Kazuhide Yamamoto Ashuboda Marasinghe [email protected] [email protected] [email protected] Nagaoka University of Technology 1603-1 Kamitomioka, Nagaoka Niigata, 940-2188 Japan 1 Introduction TEI1 text corpus encoding in XML2. A text corpus is more useful when augmented with A text corpus is a collection of text data that linguistic information. In light of this fact, represents an instance of the language in use. part-of-speech information has been added by It is a fundamental resource for the devel- manually tagging the data. Tagging guidelines opment of statistical and corpus-based Natu- used in the process of manual tagging were ral Language Processing (NLP) tasks. In gen- prepared following a study on Tigrinya gram- eral, with resource poor languages, the sup- mar. Accordingly, a tagset comprising of 73 port for working with electronic data is ei- tags was designed to encompass three levels ther insufficient or may be not available alto- of grammatical information: (1) the main POS gether [11]. We present the development of category, (2) its subcategory or type, and (3) the first Part-of-Speech (POS) tagged text cor- POS affixes (clitics). Finally, these POS labels, pus for a Semitic language, Tigrinya. Tigrinya were employed in tagging a corpus of 72,080 is spoken by estimated 7 million people in the tokens arranged into 4656 sentences. East African countries of Eritrea and North- ern Ethiopia. Other Semitic languages include Arabic, Hebrew, Amharic, Maltese and Tigre. 2 Tigrinya Language and its Hebrew and Arabic have a decent amount of morphology resources and research on NLP. However, be- cause of the absence of a Tigrinya corpus, Semitic languages are characterized by rich NLP research on the language have been very inflectional and derivational morphology that limited. Therefore, it is imperative to initi- generates numerous variations of word forms ate NLP research by developing corpora and [4]. Tigrinya (also Amharic and Tigre) uses the thereby language tools for the advancement of ancient Ge’ez script as its writing system. The information access in Tigrinya. Ge’ez script is one of the few native scripts The raw data used for producing the cor- that are still in active use in the African con- pus was collected from a daily newspaper in tinent. According to the consonant-vowel syl- Eritrea. In order to normalize the corpus, 60 lable, Tigrinya identifies seven vowel sounds common ways of using clitics were identified which are usually called ‘orders’. There are from the corpus. In addition, we cleaned the also five-ordered alphabets which are variants corpus and retained only a few basic punctu- of some of the basic 35 consonants. Alto- ation marks. Regarding formatting, the cor- 1TEI – Text Encoding Initiative pus is available in both plain text format and 2XML – Extended Markup Language ― 413 ― Copyright(C) 2016 The Association for Natural Language Processing. All Rights Reserved. gether 275 symbols constitute the Tigrinya at shabait.com/haddas-ertra. Next, data were alphabet chart known as ‘Fidel’ [9]. Like randomly selected from different columns of other Semitic languages, Tigrinya, is also a the newspaper as shown in table 1. highly inflected language. Tigrinya verbs are inflected for gender, person, number, case, Table 1: Distribution of topics tense, aspect, mood, etc. Tense-aspect-mood Topic/domain Articles is expressed by affixes that are prefixed, in- Agriculture 10 fixed or suffixed to the root word. The Business 5 verbs have a ‘root-template’ pattern which Culture 14 Health 13 predominantly consists of triliteral consonants. History 4 Subject-verb agreement is enforced by alter- Law 9 ation of verb suffixes and/or prefixes. More- Politics 7 over, negating Tigrinya verbs, nouns or adjec- Relationship 8 Sport 11 tives also require circumfixing the morpheme Social 12 ኣይ...ን /ayI...nI/3. Grammatical clitics that are General 7 attached to words further add to the complex- Total 100 ity of word morphology. These clitics are mostly proclitics and enclitics of prepositions, It is essential for a corpus to be represen- conjunctions, possessives and object pronouns tative in order to achieve findings that gener- [4]. alize to the language or the specific language domain the corpus represents (Sinclair, 2004). Topic and genre are important parameters that 3 The Tigrinya Corpus are used to measure how balanced or represen- tative a corpus is [1]. In this regard, although The task of developing a Tigrinya corpus is our corpus is a collection of News genre we quite challenging for at least two reasons. have attempted to gather data from different First, the amount of information available in topics or domains in order to include more Tigrinya on the Internet is limited in size and terminologies and structures of the language in variation. Second, random crawling of text that could appear in different topics. does not serve the purpose of this research as the text’s meta-data need proper documenta- 3.2 Preprocessing tion. In the following sub-sections, the pro- cess of compiling and annotating NTC is de- In the preprocessing phase the raw text was scribed. The corpus is publicly available4 on cleaned and formatted into plain text and XML the Internet for research purposes. Processing formats. First, unnecessary punctuation marks the corpus is explained in the subsections that and foreign scripts were removed. The Corpus follow. was then structured into XML following TEI corpus encoding standards. TEI conformant styles provide an extensive set of markups 3.1 Data Collection that support associating meta-information of The raw text was collected from Tigrinya pub- the entire corpus or an extract of the corpus lications of ‘Haddas Ertra’, a newspaper based [6]. Furthermore, in an effort to normalize the in Eritrea. The newspaper hosts a range of corpus, around 60 common methods of cliti- different topics or domains. News articles cizing Tigrinya words were identified from the published from March 2013 until December corpus. This tendency occurs because it is cus- 2013 were downloaded from the official site tomary to mask laryngeals such as እ ‘I’, ኣ ‘a’ or ኢ ‘i’ with an apostrophe. The normaliza- 3 Transliteration in this paper uses the uppercase ‘I’ tion process takes two forms. In most case, to exclusively mark the epenthetic vowel, traditionally known as ‘sads’ words joined by an apostrophe were separated 4http://eng.jnlp.org/yemane/ntigcorpus into full form of their constituent parts. For ― 414 ― Copyright(C) 2016 The Association for Natural Language Processing. All Rights Reserved. example, ክጽሕፍ’ዩ /kISIHIfI’yu/ ‘he will write’ Table 2: Reduced POS tags with category and is resolved into the two words ክጽሕፍ /kISI- type information HIfI/ and እዩ /Iyu/. On the other hand, there Category Type Label are cases were a combination of prepositions Noun N such as ኣብ /abI/ ‘in’ , ካብ /kabI/ ‘from’ and Verbal N_V pronouns such as እዚ /Izi/ ‘this’, እቲ /Iti/ ‘that’ Proper N_PRP are found almost fused together without the Pronoun PRO Verb V clitic marker. For instance, considering the Perfective V_PRF word ኣብቲ /abIti/ ‘over there’, while it rarely Imperfective V_IMF exists separated as ኣብ እቲ /abI Iti/, there are Imperative V_IMV still some instances of its cliticized form ኣብ’ቲ Gerundive V_GER Auxiliary V_AUX /abI’ti/ which are merged in the normaliza- Relative V_REL tion process to ኣብቲ. As such, by minimizing Adjective ADJ orthographic variations, normalization is ex- Adverb ADV pected to simplify the data sparseness problem Preposition PRE Conjunction CON during training session in various NLP studies Interjection INT such as POS tagging. Numeral NUM Punctuation PUN Foreign Word FW 3.3 Tagset Design Unclassified UNC The experiences of Brown corpus, Penn tree- bank and related Semitic languages were re- is given as: ‘category_type_preposition and/or viewed during the design of the corpus as well conjunction affix’; e.g. the label ‘V_IMF_PC’ as the tagsets for Tigrinya [2, 3, 7]. Many represents a Verb (V) of the IMperFective type Tigrinya linguists have classified Tigrinya (IMF) with both a Preposition and a Conjunc- parts-of-speech into eight major categories tion (PC) attached to it. [5, 9, 10]. These are verbs, nouns, pronouns, adjectives, adverbs, conjunctions, prepositions 3.4 Manual Tagging and interjections. On the other hand, an- We developed tagging guidelines based on the other study discussed Tigrinya POS by reduc- accounts of three Tigrinya Grammar books ing them into five categories [12] . Interest- [5, 9, 10]. A team of three taggers from ingly, there was a study that classified articles Eritrea carried out the manual tagging process of definiteness as entirely separate POS cate- using a tagset of 73 tags. The task was con- gories [8]. ducted in a period of four months, tagging We followed the most agreed classification, around 72k words. A simple POS annotation which considers eight major POS tags. The tool was developed to aid the manual tagging tagset was designed to encompass three lev- process. The tool was important in avoiding els of POS information. Level-1 represents the typographic errors and speeding up the tagging tags for the major categories, level-2 is a label process. In order to ensure tagging consis- for the type of the major tag and level-3 marks tency, inter-annotator agreement tests were re- if conjunction or preposition is affixed to the peated by evaluating the percentage of assign- main word. In this manner a total of 73 tags ments the taggers agreed on. Subsequently, were recognized. This design was inspired by we refined the tagging guidelines and the tag- previous POS research on Amharic [3]. How- gers were able to agree up to 95% of the time.