Topic Modelling and Sentiment Analysis with the Bangla Language: a Deep Learning Approach Combined with the Latent Dirichlet Allocation

Topic Modelling and Sentiment Analysis with the Bangla Language: A Deep Learning Approach Combined with the Latent Dirichlet Allocation by Mustakim Al Helal A thesis Submitted to the Faculty of Graduate Studies and Research In Partial Fulfillment of the Requirements For the Degree of Master of Science in Computer Science University of Regina Regina, Saskatchewan September, 2018 UNIVERSITY OF REGINA FACULTY OF GRADUATE STUDIES AND RESEARCH SUPERVISORY AND EXAMINING COMMITTEE Mustakim Al Helal, candidate for the degree of Master of Science in Computer Science, has presented a thesis titled, Topic Modelling and Sentiment Analysis with the Bangla Language: A Deep Learning Approach Combined with the Latent Dirichlet Allocation, in an oral examination held on August 28, 2018. The following committee members have found the thesis acceptable in form and content, and that the candidate demonstrated satisfactory knowledge of the subject material. External Examiner: *Dr. Yllias Chali, University of Lethbridge Supervisor: Dr. Malek Mouhoub, Department of Computer Science Committee Member: Dr. Samira Sadaoui, Department of Computer Science Committee Member: Dr. David Gerhard, Department of Computer Science Chair of Defense: Dr. Maria Velez-Caicedo, Department of Geology *via SKYPE Abstract In this thesis, the Bangla language topic modelling and sentiment analysis has been researched. It has two contributions lining up together. In this regard, we have proposed different models for both the topic modelling and the sentiment analysis task. Many research exist for both of these works but they do not address the Bangla language. Topic modelling is a powerful technique for unsupervised analysis of large document collections. There are various efficient topic modelling techniques available for the English language as it is one of the most spoken languages in the whole world, but not for the other spoken languages. Bangla being the seventh most spoken native language in the world by population, it needs au- tomation in different aspects. This thesis deals with finding the core topics of the Bangla news corpus and classifying news with a similarity measure which is one of the contributions. This is the first ever tool for Bangla topic modelling. The document models are built using LDA (Latent Dirichlet Allocation) with Bigram. Over the recent years, people in Bangladesh are heavily getting involved in social media with Bangla texts. Among this involvement, people post their opinion about products or businesses across different social sites and Facebook is the most weighted one. We have collected data from the Facebook Bangla comments and applied a state of the art algorithm to extract the sentiments which is another contribution. Our proposed system will demonstrate an efficient sentiment analysis. We have performed a comparison analysis with the existing sentiment analysis system in Bangla. However it is not straightforward to extract sentiments from the Bengali language due to its complex grammatical structure. A deep learning based method was applied to train the model and understand the underlying sentiment. The main idea is confined to the word level and character level encoding and in order to see the differences in terms of the model performance. So, we will explore different algorithms and techniques for topic modelling and sentiment analysis for the Bangla language. Acknowledgements My first debt of gratitude goes to my supervisor Dr. Malek Mouhoub, who en- couraged me to follow this path and provided me with constant support during my M.Sc. studies. I would like to express my sincere thanks for his valuable guidance, financial assistance and constant encouragement. His enthusiasm, patience and diverse knowledge helped and enlightened me on many occasions. The amount of freedom in thinking I received from him helped me overcome the difficulties with my research. I feel proud to be his research student and cannot imagine better supervisor. I acknowledge the Faculty of Graduate Studies and Research for providing me with the financial means, in the form of scholarships which contributed towards my tuition fees. I thank UR international office for helping me engage myself in different community work. I also thank Dr. Samira Sadaoui and Dr. David Gerhard, my thesis committee members who read my thesis and provided invaluable suggestions and useful comments for improvement. I would like to take this opportunity to thank every member of the Department of Computer Science who has helped me throughout my studies. Last but not least, I would like to extend my deepest gratitude to my beloved parents for their unconditional love and support throughout my entire life. It would not have been possible for me to come to Canada and achieve a prestigious scholarship without my parent's never ending support. I dedicate my hard earned M.Sc. degree to my beloved parents. ii POST DEFENCE ACKNOWLEDGEMENT My thanks go to Dr. Yllias Chali of the University of Lethbridge for being the external examiner for my M.Sc thesis and for providing me with his invaluable comments and suggestions. Contents Abstracti Acknowledgements ii List of Figures vi List of Tables vii Abbreviations viii 1 Introduction1 1.1 Problem Statement and Motivations.................1 1.2 Proposed Solution and Contributions.................3 1.3 Thesis Organization...........................4 2 Literature Review and Background6 2.1 Literature Study............................6 2.2 Background Knowledge......................... 16 2.2.1 Topic Modelling......................... 16 2.2.2 Latent Dirichlet Allocation................... 18 2.2.3 Latent Semantic Indexing................... 19 2.2.4 Hierarchical Dirichlet Process................. 21 2.2.5 Singular Value Decomposition................. 22 2.2.6 Evaluation of Topics...................... 24 2.2.7 Recurrent Neural Network................... 25 2.2.8 Long Short Term Memory................... 26 2.2.9 Gated Recurrent Unit..................... 28 2.2.9.1 Update Gate..................... 30 2.2.9.2 Reset Gate...................... 31 2.2.9.3 Current Memory Content.............. 31 2.2.9.4 Final memory at Current Time-Stamp....... 32 2.2.10 Evaluation of Sentiment Analysis Model........... 34 iv v 3 Sentiment Analysis 35 3.1 Data Collection and Preprocessing.................. 35 3.2 Character Encoding........................... 36 3.3 Methodology.............................. 38 3.4 Proposed Model............................. 39 3.4.1 Baseline Model......................... 39 3.4.2 Character Level Model..................... 40 3.5 Experimentation............................ 41 3.6 Results and Discussion......................... 41 4 Topic Modelling 45 4.1 The Corpus............................... 45 4.2 The Crawler............................... 48 4.3 Preprocessing and Cleaning...................... 48 4.3.1 Tokenization........................... 49 4.3.2 Stop Words........................... 49 4.3.3 Bag of Words Model...................... 50 4.3.4 Bigram.............................. 51 4.3.5 Removing Rare and Common Words............. 51 4.4 Proposed Model............................. 52 4.5 Algorithm................................ 53 4.6 Experimentation............................ 56 4.6.1 Topic Extraction........................ 57 4.6.2 Similarity Measure....................... 64 4.6.3 Performance Comparison with Other Topic Model Algorithms 67 4.7 Methodology for Classifying News Category............. 69 5 Conclusion and Discussion 73 Bibliography 76 List of Figures 2.1 A typical RNN [31].......................... 26 2.2 A typical LSTM [32].......................... 27 2.3 A recurrent neural network with a gated recurrent unit [32].... 29 2.4 The GRU unit [32]........................... 29 2.5 Diagram showing the sigmoid activation for merge [32]....... 30 2.6 GRU reset function [32]........................ 31 2.7 Diagram showing GRU tanh function [32].............. 32 2.8 Diagram showing GRU function [32]................. 33 3.1 The dataset for the sentiment analysis work............. 36 3.2 Characters................................ 37 3.3 Character encoding........................... 37 3.4 Word level model architecture..................... 40 3.5 Character level model architecture................... 41 3.6 Training and testing loss........................ 42 3.7 Training and testing accuracy..................... 43 3.8 Comparison of the two models..................... 43 4.1 The news corpus: CSV file....................... 46 4.2 Bagla Independent Vowels....................... 47 4.3 Bagla Dependent Vowels........................ 47 4.4 Bagla Consonants............................ 47 4.5 Bagla words............................... 48 4.6 Proposed model for topic extraction.................. 52 4.7 Coherence based number of topics................... 56 4.8 Coherence based number of topics (t=10)............... 57 4.9 Coherence based number of topics (t=20)............... 58 4.10 Similarity Dissimilarity of Cosine average............... 66 4.11 Model performance comparison.................... 67 4.12 Document topic distribution for movie news............. 71 4.13 Document topic distribution for Trump news............. 71 vi List of Tables 2.1 List of some examples words after POS tagging with positive and negative polarity [1]...........................8 2.2 Comparison of F-measure for both the classifiers with different fea- tures [1].................................9 2.3 Comparison of characteristics of topic modelling methods [2].... 13

Topic Modelling and Sentiment Analysis with the Bangla Language: a Deep Learning Approach Combined with the Latent Dirichlet Allocation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support