Bangla Natural Language Processing: a Comprehensive Review of Classical, Machine Learning, and Deep Learning Based Methods
Total Page:16
File Type:pdf, Size:1020Kb
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object Identifier 10.1109/ACCESS.2017.DOI Bangla Natural Language Processing: A Comprehensive Review of Classical, Machine Learning, and Deep Learning Based Methods OVISHAKE SEN 1, MOHTASIM FUAD1, MD. NAZRUL ISLAM1, JAKARIA RABBI 1, MD. KAMRUL HASAN 2, MOHAMMED BAZ3, MEHEDI MASUD4, MD. ABDUL AWAL5, AWAL AHMED FIME1, MD. TAHMID HASAN FUAD1, DELOWAR SIKDER1, AND MD. AKIL RAIHAN IFTEE1 1Department of Computer Science and Engineering, Khulna University of Engineering & Technology, Khulna-9203, Bangladesh 2Department of Electrical and Electronic Engineering, Khulna University of Engineering & Technology, Khulna-9203, Bangladesh 3Department of Computer Engineering, College of Computer and Information Technology, Taif University, PO Box. 11099, Taif 21994, Saudi Arabia 4Department of Computer Science, College of Computers and Information Technology, Taif University, P. O. Box 11099, Taif 21944, Saudi Arabia 5Electronics and Communication Engineering Discipline, Khulna University, Khulna-9208, Bangladesh Corresponding author: Jakaria Rabbi ([email protected]) ABSTRACT The Bangla language is the seventh most spoken language, with 265 million native and non-native speakers worldwide. However, English is the predominant language for online resources and technical knowledge, journals, and documentation. Consequently, many Bangla-speaking people, who have limited command of English, face hurdles to utilize English resources. To bridge the gap between limited support and increasing demand, researchers conducted many experiments and developed valuable tools and techniques to create and process Bangla language materials. Many efforts are also ongoing to make it easy to use the Bangla language in the online and technical domains. There are some review papers to understand the past, previous, and future Bangla Natural Language Processing (BNLP) trends. The studies are mainly concentrated on the specific domains of BNLP, such as sentiment analysis, speech recognition, optical character recognition, and text summarization. There is an apparent scarcity of resources that contain a comprehensive study of the recent BNLP tools and methods. Therefore, in this paper, we present a thorough review of 71 BNLP research papers and categorize them into 11 categories, namely Information Extraction, Machine Translation, Named Entity Recognition, Parsing, Parts of Speech Tagging, Question Answering System, Sentiment Analysis, Spam and Fake Detection, Text Summarization, Word Sense Disambiguation, and Speech Processing and Recognition. We study articles published between 1999 to 2021, and 50% of the arXiv:2105.14875v2 [cs.CL] 8 Jun 2021 papers were published after 2015. We discuss Classical, Machine Learning and Deep Learning approaches with different datasets while addressing the limitations and current and future trends of the BNLP. INDEX TERMS bangla natural language processing, sentiment analysis, speech recognition, support vector machine, artificial neural network, long short-term memory, gated recurrent unit, convolutional neural network. I. INTRODUCTION the computer made a far-reaching impact in our modern time ANGUAGE is a medium for living beings to communi- as it is helping everything towards making our life easier. The L cate with each other. In everyday life, communication is computer communicates and understands everything through a necessary activity for everyone which enables us to express 0 and 1 which is called machine language. But machine our feelings, judgment, necessity and language made all this language is very complex to understand for humans as well possible. As humans evolved through civilization the format as human language is not understandable to the computer. of language has been changed drastically. The invention of To reduce this language barrier, computer scientists came up VOLUME 4, 2016 1 Ovishake et al.: BNLP: A Comprehensive Review of Classical, ML, and DL Based Methods with methods to enable computers to understand and process Error [8], Knowledge Based Approaches [9], Template Based human language. The area of computer science that deals Approaches [9], and Perceptual Linear Prediction [10]. with understanding and processing human language through The classical methods used in question answering system computers is known as NLP (Natural Language Processing). are Cosine similarity [11], Jaccard similarity [11], Anaphora- In recent years, NLP has become a popular topic in computer Cataphora Resolution [12], Vector Space Model [13], Se- science as it is solving various real-life problems like auto- mantic Web Technologies [14], Inductive Rule Learning and correction, text to speech processing, machine translation, Reasoning [15], Logic Prover [16] and many more. sentiment analysis etc. As computers are getting more ca- Spam and fake detection systems mainly use the following pable of computational power, the applications of NLP is classical methods: Traditional Linguistic Features [17], Text also increasing. As different languages are used in different Mining and Probabilistic Language Model [18], Review Pro- regions of the world, it has become a necessity to increase the cessing Method [19], Time Series [20], and Active Learning domain of NLP to enable various language to be used in NLP [21]. applications. There are many languages used in the Indian Most of the sentiment analysis systems adopted machine sub-continent and Bangla is one of them. It is widely used in learning (ML) approaches. There are very few classical ap- the region of Bangladesh and the West Bengal part of India. proaches taken in sentiment analysis. One of the classical The part of NLP that deals with understanding and processing methods is the rule-based method [22]. The classical ap- tasks related to the Bangla language are known as Bangla proaches on machine translation are: Verb based approach Natural Language Processing (BNLP). Figure 1 shows the [23], Rule-based approach using parts of speech tagging [24], components of Natural Language Processing and Figure 2 Fuzzy rules [25], Rule based method [26] and many more shows the number of papers reviewed as per category. Figure methods. Parts of speech tagging use the following classical 3 shows the methods used in Bangla speech processing and approaches: Brill’s Tagger [27], Rule-based approach [28], recognition and Figure 4 shows the methods used in Bangla [29], [30], and Morphological Analysis [29], [30]. natural language processing of text data. The classical approaches on text summarization are Heuristic approach [31] and for parsing, mostly used methods are Simple Suffix Stripping algorithm [32], Score based Clus- tering algorithm [32], and Feature Unification based Morpho- logical Parsing [33]. In the case of the information extraction for the Bangla languages, many statistical approaches [34] [35] have been used to construct different systems. Also, to make a robust system, the use of the classical methods to add more feature sets to their datasets is observed. Like Hough Transform-based Fuzzy Feature [36], Gradient Feature, and Haar Wavelet Configuration [37] are used to add more fea- tures to a dataset for extraction of the information. In NER (Named Entity Recognition), most of the research works use Dictionary-based, Rule-based, and Statistical- FIGURE 1. Natural Language Processing Components [1] . based approaches [38]. There are also different statistical models such as Conditional Random Fields (CRF) [39] is A. PREPROCESSING TECHNIQUES used for NER. Most of the papers in the parsing are based The main idea behind data preprocessing is to transform the on the Ruled-based method. Different Lexical Analysis, Se- raw data into a form so that the data becomes usable and an- mantic Analysis [40], [41] has been used for parsing. The alyzable for the target task and the computer can understand researchers also have taken the help of different context- that collected data in the desired format. For every natural free grammar [40] for creating new rules for the parsing. language processing related problem, data preprocessing of The researchers have also used different open-source tools the collected raw data is one of the most crucial parts. The like PC-KIMMO [42] to construct a morphological parser. preprocessed data makes the NLP applications faster and Analyzing different semantic information [41] can be utilized more accurate for the intended task. to create a parser. In the text summarization task, various information re- B. CLASSICAL METHODS trieval techniques for summarization with relational graphs The most commonly used classical methods in speech pro- are used. The topic sentiment recognition and aggregation cessing and recognition system are Dynamic Time Warping are done through the Topic-sentiment model, and Theme (DTW) [2], Hidden Markov Model (HMM) [3], Linear Pre- Clustering [43]. Frequency, position value, cue words, and dictive Coding (LPC) [4], Gaussian Mixture Model (GMM) skeleton of the documents are also used [44] for text sum- [2], Template Matching [2], Autocorrelation [5], Cepstrum marization. The classical approaches that used in word sense [5], Speech Application Program Interface (SAPI) [6], Fac- disambiguation are Rule-based methods [45], [46]. The rules torial Hidden Markov Model [7], Minimum Classification to mark the semiotic classes and the regular expressions are 2 VOLUME 4, 2016 Ovishake et al.: BNLP: A Comprehensive Review of Classical, ML, and DL Based Methods Speech Processing