Addis Ababa University School of Graduate Studies School of Information Science
Total Page:16
File Type:pdf, Size:1020Kb
ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES SCHOOL OF INFORMATION SCIENCE DEVELOPMENT OF A STEMMER FOR AFARAF TEXT RETRIEVAL A thesis proposed to School of Graduate Studies of Addis Ababa University in partial fulfillment of the Requirement for the Degree of Master of Science in Information Science By OSMAN TAHA OMER ADDIS ABABA November, 2015 ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES SCHOOL OF INFORMATION SCIENCE DEVELOPMENT OF A STEMMER FOR AFARAF TEXT RETRIEVAL A THESIS SUBMITTED TO THE SCHOOL OF INFORMATION SCIENCE OF ADDIS ABABA UNIVERSITY BY OSMAN TAHA OMER IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE DEGREE OF MASTER OF SCIENCE IN INFORMATION SCIENCE NOVEMBER 2015 ADDIS ABABA, ETHIOPIA ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES SCHOOL OF INFORMATION SCIENCE DEVELOPMENT OF A STEMMER FOR AFARAF TEXT RETRIEVAL BY OSMAN TAHA OMER Name and signature of members of the examining board Name Title Signature Date Ato _______________ Chairperson: ____________ _______________ Ato Ermias Abebe Advisor: ____________ _______________ Dr. Solomon Teferra Examiner: ____________ _______________ Dr. Million Meshesha Examiner: ____________ ______________ Dedication To my family DECLARATION This thesis is my original work. It has not been presented for a degree in any other university ___________________________________ OSMAN TAHA This is to certify that I have examined this copy of Master's thesis by OSMAN TAHA and have found it is complete and acceptable in all respects, and has been submitted for examination with my approval as university advisor. Name of Advisor Signature of Advisor Date Ato Ermias Ababe ______________________ ___________ NOVEMBER 2015 ADDIS ABABA, ETHIOPIA Abstract This study describes the design of a stemming algorithm for Afaraf text retrieval system. Nowadays, a considerable amount of electronical information has produced in Afaraf. Information retrieval system is a mechanism that enables users to retrieve relevant unstructured information material from large collection. The Afaraf morphology leterature reviewed in order to develop the rule-based stemmer. Each natural language has words structure in its own forms, that are different prefixes and suffixes, which need special handling of affixes with specific rules. The rule based stemmer proposed based on grammar and dictionary of Afaraf, included Numbers (singular and plural), personal pronoun, adjectives, adverbs, verbal-noun, strong and weak verb, indefinite pronoun, conditional and subjunctive mood, linkage to remove suffixes and prefixes from the word and produce stem word. For this study text document corpus are prepared by the researcher used 300 text files of Afaraf documents, which collected from different school text books, Samara university modules, Qusebaa Maca magazines and other online and experiment is made by using eight different queries. Data pre-processing techniques of VSM involved for both document indexing and query text. The evaluation conducted on the stemmer shows that the accuracy is 65.65 % with error rate of 4.50% for over-stemming and 29.85% for under-stemming. The information retrieval system registered effective performance of 0.785 precision and 0.233 recall. It has been witnessed that the challenging task in developing a full-fledged Afaraf text retrieval system is handling morpholoical word variations. The performance of the system may increase if the performance of the stemming algorithm is improved and if standard test corpus is used. I Table of Contents Abstract ........................................................................................................................................................ I Table of Contents ...................................................................................................................................... II Table list ..................................................................................................................................................... V Algorithm list ............................................................................................................................................ VI Figure and code ........................................................................................................................................ VI List of Acronyms and Abbreviations ................................................................................................... VII Chapter One ................................................................................................................................................ 1 1.1 Introduction ................................................................................................................................... 1 1.2 Statement of the Problem ............................................................................................................ 3 1.3 Objective of the Study ................................................................................................................... 5 1.3.1 General Objective ...................................................................................................................... 5 1.3.2 Specific Objectives .................................................................................................................... 5 1.4 Scope and limitation ..................................................................................................................... 5 1.5 Significance of the study .............................................................................................................. 6 1.6 Research Methodology ................................................................................................................. 7 1.6.1 Literature Review ..................................................................................................................... 7 1.6.2 Data Collection and preparation ............................................................................................ 7 1.6.3 Programming Tools .................................................................................................................. 7 1.6.4 Testing Process And Evaluation Techniques ...................................................................... 8 1.7 Organization of the Thesis .......................................................................................................... 8 Chapter Two .............................................................................................................................................. 10 2 Afaraf Literature review ................................................................................................................ 10 2.1 Afaraf Morphology ...................................................................................................................... 10 2.1.1 Affixes ........................................................................................................................................ 10 2.2 Word Formation .......................................................................................................................... 11 2.2.1 Inflection ................................................................................................................................... 11 2.2.2 Derivation ................................................................................................................................. 12 2.2.3 Compounding ........................................................................................................................... 12 2.3 Dialects and Varieties ................................................................................................................. 12 2.4 Alphabets and Sounds ................................................................................................................ 13 II 2.5 Grammar ....................................................................................................................................... 16 2.5.1 Syntax: ....................................................................................................................................... 16 2.5.2 Gender morphology ................................................................................................................ 16 2.5.3 Numbers morphology: ........................................................................................................... 17 2.6 Personal pronoun (Numiino kee Haysit Ciggiile) ................................................................ 20 2.7 Adjectives (Weelo) morphology ............................................................................................... 21 2.8 Adverbs (Abnigurra) morphology ........................................................................................... 22 2.9 Afaraf Verbal-Nouns ................................................................................................................... 23 2.10 Strong and weak verb ............................................................................................................. 25 2.11 Post-positions (Yascassi) ....................................................................................................... 25 2.12 Indefinite pronoun ( Amixxige-waa ciggile) ...................................................................... 26 2.13 Conditional and subjunctive mood ( Sharti kee Niyat Gurra) ........................................ 26 2.14 Linkage (Qaada yasgalli) ......................................................................................................