Exploring Neural Word Embeddings for Amharic Language

EXPLORING NEURAL WORD EMBEDDINGS YARED YENEALEM YENEALEM YARED FOR AMHARIC LANGUAGE AKLILU A THESIS SUBMITTED TO THE GRADUATE SCHOOL OF APPLIED SCIENCES OF NEAR EAST UNIVERSITY EXPLORING NEURAL WORD EMBEDDINGS FOR AMHARIC WORD EMBEDDINGSEXPLORING FOR AMHARIC NEURAL By YARED YENEALEM AKLILU LANGUAGE In Partial Fulfillment of the Requirements for the Degree of Master of Science in Software Engineering NEU 2019 NICOSIA, 2019 EXPLORING NEURAL WORD EMBEDDINGS FOR AMHARIC LANGUAGE A THESIS SUBMITTED TO THE GRADUATE SCHOOL OF APPLIED SCIENCES OF NEAR EAST UNIVERSITY By YARED YENEALEM AKLILU In Partial Fulfillment of the Requirements for the Degree of Master of Science in Software Engineering NICOSIA, 2019 Yared Yenealem AKLILU: EXPLORING NEURAL WORD EMBEDDINGS FOR AMHARIC LANGUAGE Approval of Director of Graduate School of Applied Sciences Prof. Dr. Nadire Çavuş We certify that this thesis is satisfactory for the award of the degree of Master of Science in Software Engineering Examining Committee in Charge: Asst. Prof. Dr. Boran Şekeroğlu Department of Information Systems Engineering, NEU Assoc. Prof. Dr. Yöney Kırsal Ever Department of Software Engineering, NEU Assoc. Prof. Dr. Kamil Dimililer Supervisor, Department of Automotive Engineering, NEU I hereby declare that this thesis has been composed solely by myself and all the information in it has been obtained and presented in accordance with the academic rules and ethical conducts. I also declare that, as required by these rules and conducts, I have fully cited and referenced all materials and results that are not original to this work. I further declare that this work has not been submitted, or will not be concurrently be submitted, in whole or in part, for the award of any other degree in this institute or any other institute. Name, Surname: Yared Yenealem, Aklilu Signature: Date: To my beloved late elder brother… ACKNOWLEDGEMENT Foremost, let the Most High be praised and honored as he had led me beside the still waters. My supervisor Assoc. Prof. Dr. Kamil Dimililier: Your guidance, continuous support and the patience and love you showed towards me had helped me a lot to carry out this work. You have been fully supportive from day one till the end. I thank you very much. My special respect and gratitude should go to Asst. Prof. Dr. Boran Şekeroğlu for his unreserved support starting from the very first day of my graduate class till this work. You were very helping, caring and inspiring during my stay in Near East University. I also very much grateful to the Ethiopian Ministry of Science and Technology (now rebranded as: Ministry of Science and Higher Education) for giving me the opportunity to pursue my masters here in this institute. It will be a great omission not to thank my families (Sister Mastewal and Brother Dejazmach), friends and fellows. It would be against the will and conscience of my mind not to render my heartfelt and warmest gratitude and thanks to my late elder brother, Habtamu Yenealem. May the Almighty God grant you the surest mercies of David. ii ABSTRACT Word embeddings are recent developments in natural language processing where words are mapped to real numbers for ease of operations on characters, words, subwords and sentences. Word embeddings for many world languages have been generated and a study is underway. Though Amharic is one of the most widely spoken language in Ethiopia, it is lagging behind in computational analysis including word embeddings. Word embeddings capture different linguistic characteristics, which are intrinsic, such as word analogy, word similarity, out-of-vocabulary words and odd-word out operations. In this thesis, these characteristics and operations were explored and analyzed on Amharic language. Besides these intrinsic evaluations, the word embedding was evaluated on multiclass Amharic text classification task as an extrinsic evaluation. FastText, a recent method to generate and evaluate word embeddings was utilized. This was used because of the morphologically richness of Amharic and the features of fastText in capturing sub-word information. The resulting embedding using fastText showed that words that are similar or analogous to each other happen together or closer in space. Related Amharic words were found closer to each other in the vector space. Morphological relatedness took the highest stake. The word embedding has also learned the vector representation, “ንጉሥ(King) - ወንድ(man) + ሴት(woman)” resulting in a vector closer to the word “ንግሥት(queen)”. Out-of-vocabulary words were also entertained. Multiclass text classification on the model attained 97.8% F1- score; result being fluctuated based on parameters. Keywords: Word embedding; text classification; word relatedness; word analogy; Amharic language; fastText iii ÖZET Kelime gömme işlemleri, doğal dil işlemede, karakterlerin, kelimelerin, alt kelimelerin ve cümlelerin kullanım kolaylığı için kelimelerin gerçek sayılarla eşleştirildiği son gelişmelerdir. Birçok dünya dili için kelime yerleştirmeleri yapıldı ve bir çalışma devam ediyor. Amharca Etiyopya'da en çok konuşulan dilden biri olmasına rağmen, kelime gömme işlemleri de dahil olmak üzere hesaplama analizlerinde geride kalmaktadır. Sözcük yerleştirmeleri, sözcük analojisi, sözcük benzerliği, sözcük dışı sözcükler ve garip sözcük çıkarma işlemleri gibi kendine özgü farklı dil karakteristiklerini yakalar. Bu tez çalışmasında, bu özellikler ve işlemler Amharca dilinde araştırılmış ve analiz edilmiştir. Bu içsel değerlendirmelerin yanı sıra, gömme kelimesi çok sınıflı Amharca metin sınıflandırma görevinde dışsal bir değerlendirme olarak değerlendirilmiştir. FastText, kelime gömme işlemlerini üretmek ve değerlendirmek için yeni bir yöntem kullanıldı. Amharic'in morfolojik olarak zengin olması ve alt-kelime bilgisinin yakalanmasında fastText'in özellikleri nedeniyle kullanılmıştır. FastText kullanılarak elde edilen sonuç gömme, birbirine benzer veya birbirine benzeyen kelimelerin bir arada veya uzayda daha yakın olduğunu gösterdi. İlgili Amharca kelimeler vektör uzayında birbirlerine daha yakın bulundu. Morfolojik ilişki en yüksek tehlikeyi aldı. Gömme kelimesi aynı zamanda “ንጉሥ(Kral) - ወንድ (erkek) + ሴት (kadın)” vektör gösterimini de “ንግሥት (kraliçe)” kelimesine daha yakın bir vektörle sonuçlamıştır. Kelime dışı kelimeler de ağırlandı. Model üzerindeki çoklu sınıf metin sınıflaması% 97.8 F1 puanına ulaşmıştır; Sonuç parametrelere göre dalgalanma. Anahtar Kelimeler: Sözcük gömme; metin sınıflandırması; kelime ilişkililiği; kelime benzetmesi; Amharca dili; Fasttext iv TABLE OF CONTENTS ACKNOWLEDGEMENT ................................................................................................. ii ABSTRACT ....................................................................................................................... iii ÖZET .................................................................................................................................. iv LIST OF TABLES ........................................................................................................... viii LIST OF FIGURES ........................................................................................................... ix LIST OF ABBREVIATIONS ........................................................................................... xi CHAPTER 1: INTRODUCTION 1.1 Statement of the Problems .......................................................................................... 1 1.2 Thesis Objectives ........................................................................................................ 2 1.2.1 General objectives ............................................................................................... 2 1.2.2 Specific objectives ............................................................................................... 3 1.3 Methods and Techniques ............................................................................................ 3 1.3.1 Literature review.................................................................................................. 3 1.3.2 Tool selection ...................................................................................................... 4 1.3.3 Data collection and preparation ........................................................................... 4 1.3.4 Models ................................................................................................................. 4 1.3.5 Evaluation and analysis ....................................................................................... 4 1.4 Scope and Limitation .................................................................................................. 5 1.5 Significance of the Study ............................................................................................ 5 1.6 Thesis Outline ............................................................................................................. 5 CHAPTER 2: LITERATURE REVIEW AND RELATED WORKS 2.1 Overview .................................................................................................................... 7 2.1.1 Named entity recognition .................................................................................... 7 2.1.2 Sentiment analysis ............................................................................................... 8 2.1.3 Text classification ................................................................................................ 9 2.2 Word Embedding .....................................................................................................

Load more