Text Sentiment Analysis of Marathi Language in English and Devanagari Script
Total Page:16
File Type:pdf, Size:1020Kb
Text Sentiment Analysis of Marathi Language in English And Devanagari Script by Harry Pradeep Gavali Dissertation Submitted in partial fulfilment of the requirements for the degree of MSc in Data Analytics At Dublin Business School Supervisor: Pierpaolo Dondio January 2020 2 DECLARATION I declare that this dissertation that I have submitted to Dublin Business School for the award of MSc in Data Analytics is my own research and original work; except where otherwise stated, where it is clearly acknowledged by references. Also, this work has not been submitted for any other degree and is entirely in compliance with Dublin Business School’s academic policy. Signed: Harry Pradeep Gavali Dated: 6th January 2020. 3 ACKNOWLEDGEMENT I would like to express my deepest gratitude to my mentor and research supervisor Mr Pierpaolo Dondio, for guiding me on the right track from the beginning to the end of the research. I appreciate his constructive feedbacks and his willingness to always dedicate his precious time generously. He was always available to help, and his remarks have helped me a lot in order to make the research successful and professional. I would also like to thank my parents, my friends and their parents who invested their precious time and helped me in compiling the data for the research. Few of them had to act as judges in order to rate the data, which was a crucial part of the research. Their help is deeply appreciated. Furthermore, I would like to thank Google for allowing me to incorporate their Cloud Translation API which translated the data smoothly and swiftly. Finally, I am indebted to my family for they have been a continuous support and have always motivated me throughout the completion of the thesis. 4 ABSTRACT Marathi is a language spoken by a very large number of people in India and about 10% of the Indian population uses ‘Marathi + English’ while texting one another. This study focuses on text sentiment analysis of the mixed language text of Marathi (written in English) first, and then compares the accuracy again after the same sentences (written in Devanagari script) have been translated using Google’s cloud translation services. Same machine learning techniques were applied on both in order to maintain equality. A new and accurate dataset which comprised of day to day sentences was compiled manually in order to reduce error. The outputs were later compared and the need to develop such researches further is highlighted. The results of the research show that the algorithms like Random Forest and SVM give us the highest accuracies of 65.41% and 64.16% respectively. Keywords: machine learning, text classification, Marathi sentiment analysis, NLP Marathi. 5 Table of Contents DECLARATION ............................................................................................................................ 2 ACKNOWLEDGEMENT ................................................................................................................ 3 ABSTRACT ................................................................................................................................... 4 LIST OF FIGURES ......................................................................................................................... 6 Chapter 1: INTRODUCTION ........................................................................................................ 7 Chapter 2: LITERATURE REVIEW .............................................................................................. 13 Chapter 3: METHODOLOGY ..................................................................................................... 19 3.1 Datasets: ................................................................................................................................. 20 3.2 Sentiment Valuation: .............................................................................................................. 21 3.3 Data Preprocessing: ................................................................................................................ 21 3.4 Machine Learning:................................................................................................................... 23 3.5 Statistical Testing: ................................................................................................................... 27 Chapter 4: IMPLEMENTATION ................................................................................................. 30 Chapter 5: RESULT ................................................................................................................... 37 5.1 Random Forest algorithm. ...................................................................................................... 39 5.2. K Nearest Neighbour Algorithm. ............................................................................................ 39 5.3. Naïve Bayes Algorithm. .......................................................................................................... 40 5.4. Decision Tree Algorithm. ........................................................................................................ 41 5.5. Support Vector Machine Algorithm. ..................................................................................... 41 5.6. Logistic Regression Algorithm. ............................................................................................... 42 5.7. Statistical testing: ................................................................................................................... 42 Chapter 6: CONCLUSION .......................................................................................................... 44 6.1 Future Scope ..................................................................................................................... 46 PLAGIARISM AND REFERENCES ............................................................................................... 48 APPENDIX ................................................................................................................................. 51 6 LIST OF FIGURES Figure 1: Dissertation Roadmap .............................................................................................. 12 Figure 2: Conceptual diagram of the research process ........................................................... 29 Figure 3: Initial dataset ............................................................................................................ 30 Figure 4: Final Rating ................................................................................................................ 31 Figure 5: Final rating for Marathi dataset ................................................................................ 31 Figure 6: Graph showing different classes and their count ..................................................... 32 Figure 7: Number of stop words in every sentence ................................................................ 32 Figure 8: Sentences with stop words removed ....................................................................... 33 Figure 9: Tokenized sentences ................................................................................................. 33 Figure 10: o/p table for the algorithms ................................................................................... 34 Figure 11: Final Marathi data ................................................................................................... 34 Figure 12: Translated text obtained after using google services ............................................. 35 Figure 13: Final translated text from Marathi to English ......................................................... 35 Figure 14: Marathi data after translation and removing stop words ...................................... 36 Figure 15: o/p table for translated Marathi dataset ............................................................... 36 Figure 16: output table comparing accuracies of both methods of text analysis ................... 37 Figure 17: RF output or mixed language text .......................................................................... 39 Figure 19: KNN o/p for mixed language text ........................................................................... 39 Figure 21: Naïve Bayes o/p for mixed language text ............................................................... 40 Figure 23: DT o/p for mixed language text .............................................................................. 41 Figure 25: SVM o/p for mixed language text ........................................................................... 41 Figure 27: LR o/p for mixed language text ............................................................................... 42 Figure 29: t test output for the algorithms .............................................................................. 43 Figure 30: output for the comparison of the two text classification methods ....................... 44 7 Chapter 1: INTRODUCTION Languages have existed for millions of years and since then have been going through a lot of advancements and changes. Over the course of history, we have seen language and culture being transferred and exchanged due to travellers visiting other shores and warriors capturing different lands. This led people of the land to learn the language of the other people. Those were the earliest instances of people learning different languages for convenience purposes. Fast-forward to today, with the invention of