言語処理学会 第22回年次大会 発表論文集 (2016年3月)

Nagaoka Tigrinya Corpus: Design and Development of Part-of-speech Tagged Corpus

Yemane Keleta Tedla Kazuhide Yamamoto Ashuboda Marasinghe [email protected] [email protected] [email protected]

Nagaoka University of Technology 1603-1 Kamitomioka, Nagaoka Niigata, 940-2188 Japan

1 Introduction TEI1 text corpus encoding in XML2. A text corpus is more useful when augmented with A text corpus is a collection of text data that linguistic information. In light of this fact, represents an instance of the language in use. part-of-speech information has been added by It is a fundamental resource for the devel- manually tagging the data. Tagging guidelines opment of statistical and corpus-based Natu- used in the process of manual tagging were ral Language Processing (NLP) tasks. In gen- prepared following a study on Tigrinya gram- eral, with resource poor languages, the sup- mar. Accordingly, a tagset comprising of 73 port for working with electronic data is ei- tags was designed to encompass three levels ther insufficient or may be not available alto- of grammatical information: (1) the main POS gether [11]. We present the development of category, (2) its subcategory or type, and (3) the first Part-of-Speech (POS) tagged text cor- POS affixes (clitics). Finally, these POS labels, pus for a Semitic language, Tigrinya. Tigrinya were employed in tagging a corpus of 72,080 is spoken by estimated 7 million people in the tokens arranged into 4656 sentences. East African countries of and North- ern . Other include , Hebrew, , Maltese and Tigre. 2 and its Hebrew and Arabic have a decent amount of morphology resources and research on NLP. However, be- cause of the absence of a Tigrinya corpus, Semitic languages are characterized by rich NLP research on the language have been very inflectional and derivational morphology that limited. Therefore, it is imperative to initi- generates numerous variations of word forms ate NLP research by developing corpora and [4]. Tigrinya (also Amharic and Tigre) uses the thereby language tools for the advancement of ancient Ge’ez script as its . The information access in Tigrinya. Ge’ez script is one of the few native scripts The raw data used for producing the cor- that are still in active use in the African con- pus was collected from a daily newspaper in tinent. According to the - syl- Eritrea. In order to normalize the corpus, 60 lable, Tigrinya identifies seven vowel sounds common ways of using clitics were identified which are usually called ‘orders’. There are from the corpus. In addition, we cleaned the also five-ordered which are variants corpus and retained only a few basic punctu- of some of the basic 35 . Alto- ation marks. Regarding formatting, the cor- 1TEI – Text Encoding Initiative pus is available in both plain text format and 2XML – Extended Markup Language

― 413 ― Copyright(C) 2016 The Association for Natural Language Processing. All Rights Reserved. gether 275 symbols constitute the Tigrinya at shabait.com/haddas-ertra. Next, data were chart known as ‘Fidel’ [9]. Like randomly selected from different columns of other Semitic languages, Tigrinya, is also a the newspaper as shown in table 1. highly inflected language. Tigrinya are inflected for gender, person, number, case, Table 1: Distribution of topics tense, aspect, mood, etc. Tense-aspect-mood Topic/domain Articles is expressed by affixes that are prefixed, in- Agriculture 10 fixed or suffixed to the word. The Business 5 verbs have a ‘root-template’ pattern which Culture 14 Health 13 predominantly consists of triliteral consonants. History 4 - is enforced by alter- Law 9 ation of verb suffixes and/or prefixes. More- Politics 7 over, negating Tigrinya verbs, or adjec- Relationship 8 Sport 11 tives also require circumfixing the morpheme Social 12 ኣይ...ን /ayI...nI/3. Grammatical clitics that are General 7 attached to words further add to the complex- Total 100 ity of word morphology. These clitics are mostly proclitics and enclitics of prepositions, It is essential for a corpus to be represen- conjunctions, possessives and pronouns tative in order to achieve findings that gener- [4]. alize to the language or the specific language domain the corpus represents (Sinclair, 2004). Topic and genre are important parameters that 3 The Tigrinya Corpus are used to measure how balanced or represen- tative a corpus is [1]. In this regard, although The task of developing a Tigrinya corpus is our corpus is a collection of News genre we quite challenging for at least two reasons. have attempted to gather data from different First, the amount of information available in topics or domains in order to include more Tigrinya on the Internet is limited in size and terminologies and structures of the language in variation. Second, random crawling of text that could appear in different topics. does not serve the purpose of this research as the text’s meta-data need proper documenta- 3.2 Preprocessing tion. In the following sub-sections, the pro- cess of compiling and annotating NTC is de- In the preprocessing phase the raw text was scribed. The corpus is publicly available4 on cleaned and formatted into plain text and XML the Internet for research purposes. Processing formats. First, unnecessary punctuation marks the corpus is explained in the subsections that and foreign scripts were removed. The Corpus follow. was then structured into XML following TEI corpus encoding standards. TEI conformant styles provide an extensive set of markups 3.1 Data Collection that support associating meta-information of The raw text was collected from Tigrinya pub- the entire corpus or an extract of the corpus lications of ‘Haddas Ertra’, a newspaper based [6]. Furthermore, in an effort to normalize the in Eritrea. The newspaper hosts a range of corpus, around 60 common methods of cliti- different topics or domains. News articles cizing Tigrinya words were identified from the published from March 2013 until December corpus. This tendency occurs because it is cus- 2013 were downloaded from the official site tomary to mask laryngeals such as እ ‘I’, ኣ ‘a’ or ኢ ‘i’ with an apostrophe. The normaliza- 3 Transliteration in this paper uses the uppercase ‘I’ tion process takes two forms. In most case, to exclusively mark the epenthetic vowel, traditionally known as ‘sads’ words joined by an apostrophe were separated 4http://eng.jnlp.org/yemane/ntigcorpus into full form of their constituent parts. For

― 414 ― Copyright(C) 2016 The Association for Natural Language Processing. All Rights Reserved. example, ክጽሕፍ’ዩ /kISIHIfI’yu/ ‘he will write’ Table 2: Reduced POS tags with category and is resolved into the two words ክጽሕፍ /kISI- type information HIfI/ and እዩ /Iyu/. On the other hand, there Category Type Label are cases were a combination of prepositions N such as ኣብ /abI/ ‘in’ , ካብ /kabI/ ‘from’ and Verbal N_V pronouns such as እዚ /Izi/ ‘this’, እቲ /Iti/ ‘that’ Proper N_PRP are found almost fused together without the Pronoun PRO Verb V clitic marker. For instance, considering the Perfective V_PRF word ኣብቲ /abIti/ ‘over there’, while it rarely Imperfective V_IMF exists separated as ኣብ እቲ /abI Iti/, there are Imperative V_IMV still some instances of its cliticized form ኣብ’ቲ V_GER Auxiliary V_AUX /abI’ti/ which are merged in the normaliza- Relative V_REL tion process to ኣብቲ. As such, by minimizing ADJ orthographic variations, normalization is ex- Adverb ADV pected to simplify the data sparseness problem Preposition PRE Conjunction CON during training session in various NLP studies Interjection INT such as POS tagging. Numeral NUM Punctuation PUN Foreign Word FW 3.3 Tagset Design Unclassified UNC

The experiences of Brown corpus, Penn tree- bank and related Semitic languages were re- is given as: ‘category_type_preposition and/or viewed during the design of the corpus as well conjunction affix’; e.g. the label ‘V_IMF_PC’ as the tagsets for Tigrinya [2, 3, 7]. Many represents a Verb (V) of the IMperFective type Tigrinya linguists have classified Tigrinya (IMF) with both a Preposition and a Conjunc- parts-of-speech into eight major categories tion (PC) attached to it. [5, 9, 10]. These are verbs, nouns, pronouns, , adverbs, conjunctions, prepositions 3.4 Manual Tagging and interjections. On the other hand, an- We developed tagging guidelines based on the other study discussed Tigrinya POS by reduc- accounts of three Tigrinya books ing them into five categories [12] . Interest- [5, 9, 10]. A team of three taggers from ingly, there was a study that classified articles Eritrea carried out the manual tagging process of definiteness as entirely separate POS cate- using a tagset of 73 tags. The task was con- gories [8]. ducted in a period of four months, tagging We followed the most agreed classification, around 72k words. A simple POS annotation which considers eight major POS tags. The tool was developed to aid the manual tagging tagset was designed to encompass three lev- process. The tool was important in avoiding els of POS information. Level-1 represents the typographic errors and speeding up the tagging tags for the major categories, level-2 is a label process. In order to ensure tagging consis- for the type of the major tag and level-3 marks tency, inter-annotator agreement tests were re- if conjunction or preposition is affixed to the peated by evaluating the percentage of assign- main word. In this manner a total of 73 tags ments the taggers agreed on. Subsequently, were recognized. This design was inspired by we refined the tagging guidelines and the tag- previous POS research on Amharic [3]. How- gers were able to agree up to 95% of the time. ever, new classes that include level-2 tags for Out of the 72k words, around 19k (26%) are four types of verbs, a proper noun label and a unique words or types. Furthermore, accord- level-1 tag for foreign words were added. Ta- ing to the lexical diversity (token-type ratio), ble 2 shows the tags with level-1 and level-2 a words gets repeated 3.85 times in the cor- information. The format of the tagset labels pus. However, about 6% of the words are

― 415 ― Copyright(C) 2016 The Association for Natural Language Processing. All Rights Reserved. hapaxes. Although there is a large difference, [2] David Graff and Walker Kevin. on average a sentence is composed of about Arabic newswire part 1 - 15 words. The distribution of the 12 major Linguistic Data Consertium. tags is shown in figure 1. https://catalog.ldc.upenn.edu/LDC2001T55, 2001. Accessed: 2014-10-16. [3] Demeke Girma A. and Getachew Mesfin. Manual annotation of amharic news items with part-of-speech tags and its chal- lenges. , 2006. ELRC Work- ing Papers, 2:1–17. [4] Michael Gasser. HornMorpho 2.5 user's guide. Indiana University, Indiana, 2012. [5] Adi Ghebre. . Admas Figure 1: Distribution of POS tags in Nagaoka Forlag, Stockholm, 2 edition, 2000. Tigrinya Corpus [6] Text Encoding Intiative. TEI P5: Guidelines for electronic text encoding and interchange. http://www.tei- 4 Conclusion and Future Works c.org/Vault/P5/current/doc/tei-p5- doc/en/html/, 2013. Accessed: 2013-05- We presented a study on Tigrinya language 10. conducted with the aim of creating the first part-of-speech tagged corpus for Tigrinya, a [7] H. Kucera and W. Nelson Francis. A low resource language. A corpus of size 72k Standard Corpus of Present-Day Edited words was compiled and tagged manually for American English, for use with Digital part-of-speech. The corpus was cleaned, nor- Computers- Manual of Information. Tech- malized and formatted in TEI corpus encoding. nical report, Department of Linguistics, A new tagset of 73 tags was also designed Brown University, Providence, Rhode Is- which contains information of the major POS land, 1979. categories, types of the categories and clitics [8] Sebhatu Gebremichael Kuflu. The basic information. In the future, we plan to work principles of Tigrinian Language. For- on improving both the quality and volume of fattaresBokmaskin, Stockholm, 1997. the corpus. Revising the tagset design and rec- tifying tagging errors and inconsistencies can [9] John Mason. Tigrinya grammar. The Red improve the quality of the corpus. At present, Sea Press, Inc., New Jersey, 1996. the corpus contains text from News genre. It is important to collect data from other genres [10] Amanuel Sahle. A Comprehensive and styles in order to achieve a more represen- Tigrinya Grammar. The Press, tative corpus. The corpus will be an essential Inc., Lawrenceville NJ, 1998. resources for many NLP studies. [11] Oliver Streiter, Kevin Scannell, and Math- ias Stuflesser. Implementing NLP projects References for noncentral languages: instructions for funding bodies, strategies for developers. Springer Science+Business Media, 2007. [1] Keh Jiann Chen, Churen Huang, Liping Chang, and HuiLi Hsu. SINICA corpus: [12] Daniel Teklu Reda. Modern grammar of Design methodology for balanced corpus. Tigrinya language. Mega Publishing and In Language, Information and Computa- Distribution PLC, Addis Ababa, 2005. tion PACLING 11, 1996.

― 416 ― Copyright(C) 2016 The Association for Natural Language Processing. All Rights Reserved.