Developing Methods and Resources for Automated Processing of the African Language Igbo
Total Page:16
File Type:pdf, Size:1020Kb
Developing Methods and Resources for Automated Processing of the African Language Igbo Author: Supervisor: Ikechukwu E. Onyenwe Dr. Mark R. Hepple A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy Department of Computer Science Faculty of Engineering April 2017 ii Declaration I hereby declare that I am the sole author of this thesis. That except where specific reference is made to the work of others, the contents of this thesis are original and have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other university. The contents are the outcome of research done under my supervisor. Part of this thesis has appeared in the following publications: 1. Onyenwe, Ikechukwu E., Chinedu Uchechukwu, and Mark Hepple. Part-of-speech Tagset and Corpus Development for Igbo, an African. In Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop, pages 93{98, Dublin, Ireland, August 23-24 2014. 2015 Association for Computational Linguistics. 2. Onyenwe, Ikechukwu, Mark Hepple, Chinedu Uchechukwu, and Ignatius Ezeani. Use of Transformation-Based Learning in Annotation Pipeline of Igbo, an African Language. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pages 24{33, Hissar, Bulgaria, September 10, 2015. 2015 Association for Computational Linguistics. 3. Onyenwe, Ikechukwu E. and Mark Hepple. Predicting Morphologically-Complex Unknown Words in Igbo. In Proceedings of the Nineteenth International Con- ference on Text, Speech, Dialogue | TSD 2016, Brno, Czech Republic, September 12-16, 2016. Published by Springer-Verlag in Lecture Notes in Artificial Intelligence (LNAI), Volume 99241. 4. Onyenwe, Ikechukwu E. and Mark Hepple. Predicting Morphologically-Complex Unknown Words in Igbo. In Proceedings of the Community-based Building of Language Resources (CBBLR) workshop | TSD 2016, Brno, Czech Republic, September 12, 20162. 5. Onyenwe, Ikechukwu, Mark Hepple and Chinedu Uchechukwu. Improving Ac- curacy of Igbo Corpus Annotation Using Morphological Reconstruction and Transformation-Based Learning. JEP-TALN-RECITAL 2016, Workshop TALAf 2016: Traitement Automatique des Langues Africaines (TALAf 2016: African Language Processing), July, 2016, Paris, France, publisher: ATALA/AFCP, pages 1-10. 1 The best papers which succeeded in both review processes (by the TSD 2016 Conference PC and CBBLR Workshop 2016 PC) will be published in the TSD 2016 Springer and CBBLR Proceedings 2See footnote1 iii 6. Onyenwe, Ikechukwu, Mark Hepple, and Ignatius Ezeani. Towards An Effective Igbo Part-of-Speech Tagger. Manuscript submitted for journal publication. 7. Onyenwe, Ikechukwu E and Hepple, Mark and Uchechukwu, Chinedu and Ignatius Ezeani. A Basic Language Resource Kit Implementation for IgboNLP Project. Manuscript submitted for journal publication. The above jointly authored publications are primarily the work of the first author. The role of the co-authors was editorial and supervisory. This copy has been supplied on the understanding that it is copyright material and that no quotation from the thesis may be published without proper acknowledgement. Ikechukwu E. Onyenwe April 2017 iv v Dedication To my family: my lovely wife Obio. ma and my son Chimdi.ndu. To these great ones: Onyeanusi Ikechukwu and Gladys N. Onyenwe for being my wonderful parents. Dr. Mark Hepple for being extremely good mentor and supervisor to me. Prof. Boniface C.E. Egboka for believing in me. Rev. Canon Prof. A.D. Nkamnebe for overwhelming guidance and supports to me. In loving memory of my beloved sister Lily. Forever in my heart, my beautiful sister with a beautiful heart. vi Acknowledgements Firstly, I would like to express my exceptional appreciation and sincere thanks to my supervisor Dr. Mark Hepple (a Reader in Computer Science), who has been an immense mentor to me. I would like to thank him for supporting my research and enabling me to grow as a research scientist. His patience, motivation, immense knowledge and guidance on both research as well as on my career have been invaluable. His guidance and hard questions helped me to broaden my research from various standpoints in all the time of study and writing of this PHD thesis. I could not have envisaged having a greater supervisor and mentor for my PHD study. He is the true definition of a mentor. I would like to thank the rest of my panel committee members, my chair, Dr. John Barker and my advisor, Dr. Mark Stevenson, for their insightful comments and suggestions. My sincere thanks also go to Dr. Uchechukwu Chinedu, a senior linguist who provided me with some Igbo linguistic materials and ideas. His collaboration throughout this study was very helpful. The administrative teams (com-X groups) of Computer Science, University of Sheffield, are delightfully appreciated for their administrative supports throughout this PHD study. Many thanks to Nnamdi Azikiwe University and Tertiary Eduction Trust Fund (TET- Fund) Nigeria for the funding. Special thanks to the Vice Chancellor, Prof. Joe Ahaneku, and his management teams for their supports. I sincerely appreciate all who have positively impacted my life, especially these great ones: Mr. Robin and Mrs. Joan Story, The Rt. Rev Prof. 'kelue Okoye, HRH Sir Dr. Harry Obi-Nwosu, Prof. S.O. Anigbogu, and Pastor (Dr.) and Mrs. Sam Okerenta. I thank my fellow colleagues at IgboNLP project and Natural Language Processing Group of the Computer Science Department of Faculty of Engineering, The University of Sheffield, United Kingdom; for the sleepless nights we worked together meeting deadlines for panel meetings, research retreats, conferences , etc., and for all of the fun we had in the last four years. Special thanks to Mark Tice, Ignatius Ezeani, Olusayo Obajemu, Samuel Nwagbo, and Joshua Gbenga Adeyemi for their friendship and support beyond earthly norm. I am also thanking my wonderful colleagues and friends within and outside Nnamdi Azikiwe University in Nigeria, and those outside Nigeria for their calls and messages. You guys are one of the major reasons while I smile. A special thanks to my family; words cannot express how grateful I am to my wife (seed of beauty) and son, parents, in-laws, and siblings for all of their prayers on my behalf. To God be all the glory great things He has done. Amen! Ikechukwu E. Onyenwe, April 2017. vii viii Developing Methods and Resources for Automated Processing of the African Language Igbo Ikechukwu E. Onyenwe Abstract Natural Language Processing (NLP) research is still in its infancy in Africa. Most of languages in Africa have few or zero NLP resources available, of which Igbo is among those at zero state. In this study, we develop NLP resources to support NLP-based research in the Igbo language. The springboard is the development of a new part-of-speech (POS) tagset for Igbo (IgbTS) based on a slight adaptation of the EAGLES guideline as a result of language internal features not recognized in EAGLES. The tagset consists of three granularities: fine-grain (85 tags), medium-grain (70 tags) and coarse-grain (15 tags). The medium-grained tagset is to strike a balance between the other two grains for practical purpose. Following this is the preprocessing of Igbo electronic texts through normalization and tokenization processes. The tokenizer is developed in this study using the tagset definition of a word token and the outcome is an Igbo corpus (IgbC) of about one million tokens. This IgbTS was applied to a part of the IgbC to produce the first Igbo tagged corpus (IgbTC). To investigate the effectiveness, validity and reproducibility of the IgbTS, an inter-annotation agreement (IAA) exercise was undertaken, which led to the revision of the IgbTS where necessary. A novel automatic method was developed to bootstrap a manual annotation process through exploitation of the by-products of this IAA exercise, to improve IgbTC. To further improve the quality of the IgbTC, a committee of taggers approach was adopted to propose erroneous instances on IgbTC for correction. A novel automatic method that uses knowledge of affixes to flag and correct all morphologically-inflected words in the IgbTC whose tags violate their status as not being morphologically-inflected was also developed and used. Experiments towards the development of an automatic POS tagging system for Igbo using IgbTC show good accuracy scores comparable to other languages that these taggers have been tested on, such as English. Accuracy on the words previously unseen during the taggers' training (also called unknown words) is considerably low, and much lower on the unknown words that are morphologically-complex, which indicates difficulty in handling morphologically-complex words in Igbo. This was improved by adopting a morphological reconstruction method (a linguistically-informed segmentation into stems and affixes) that reformatted these morphologically-complex words into patterns learnable by machines. This enables taggers to use the knowledge of stems and associated affixes of these morphologically-complex words during the tagging process to predict their appropriate tags. Interestingly, this method outperforms other methods that existing taggers use in handling unknown words, and achieves an impressive increase for the accuracy of the morphologically-inflected unknown words and overall unknown words. These developments are the first NLP toolkit for the Igbo language and a step towards achieving the objective of Basic Language Resources Kits (BLARK)