Deep Belief Networks with Feature Selection for Sentiment Classification

2016 7th International Conference on Intelligent Systems, Modelling and Simulation Deep Belief Networks with Feature Selection for Sentiment Classification Patrawut Ruangkanokmas1, Tiranee Achalakul2, and Khajonpong Akkarajitsakul3 1,2Department of Computer Engineering 3Department of Mathematics King’s Mongkut’s University of Technology Thonburi (KMUTT), Bangkok, Thailand [email protected], [email protected], [email protected] Abstract—Due to the complexity of human languages, most of can accurately predict sentiment orientation with over 85% sentiment classification algorithms are suffered from a huge- correctness. Unfortunately, a large amount of labeled scale dimension of vocabularies which are mostly noisy and training data is needed in supervised training methods and it redundant. Deep Belief Networks (DBN) tackle this problem is often difficult and time consuming to label the data by learning useful information in input corpus with their manually. several hidden layers. Unfortunately, DBN is a time-consuming Not too long before, a new approach called semi- and computationally expensive process for large-scale supervised learning has been proposed. It aims at utilizing applications. In this paper, a semi-supervised learning the advantage of a huge amount of unlabeled data together algorithm, called Deep Belief Networks with Feature Selection with labeled data to construct sentiment classifiers [7]. Many (DBNFS) is developed. Using our chi-squared based feature research papers in [5], [6], and [8] claim that semi- selection, the complexity of the vocabulary input is decreased since some irrelevant features are filtered which makes the supervised deep learning models can avoid the learning phase of DBN more efficient. The experimental results aforementioned issues while they still get competitive of our proposed DBNFS shows that the proposed DBNFS can performance. Unfortunately, the current deep learning achieve higher classification accuracy and can speed up algorithms are computationally expensive for large-scale training time compared with others well-known applications. semi-supervised learning algorithms. Moreover, most of the classification algorithms use a fixed size of numerical feature vectors as inputs rather than Keywords-Chi-squared Feature Selection; Deep Belief use raw variable-length text documents. Thus, it is necessary Networks; Deep Learning; Feature Selection; Restricted to convert a corpus of documents into a matrix with one row Boltzmann Machine; Semi-supervised Learning; Sentiment per document and one column per token (i.e., word) Classification; occurring in the corpus. Because of the complexity of human languages, there can be more than ten-thousand dimensions I. INTRODUCTION of feature terms while most of them are noisy or redundant. This can lead to the increase in the number of classification Nowadays, amounts of social media data on the web sites errors and the computation time. tend to grow dramatically. Individuals and organizations try To overcome the aforementioned problems, an effective to extract useful information from these large datasets in feature selection, which aims at filtering inessential terms order to make better judgments and enhance customer occurring in the training set and selecting only meaningful satisfaction. For example, before making a decision whether terms, is a must in order to make the learning phase more to purchase a product or service, a customer looks through efficient and more accurate [9]. Forman [10] has proposed an product reviews expressed by others as recommendations. In empirical comparison among many feature selection the same way, the manufacturer of the product also uses this methods. The results show that with the feature selection information for improving the quality of their products or methods, the performance of the classification algorithms in services [1]. However, due to vast amount of data available most situations can be improved since they can reduce the online, it is an expensive task for people to utilize the data number of dimensions of input data by eliminating noisy manually. As a result, sentiment classification, which aims at features. Thus, we can train a classification model faster, determining whether a sentiment expressed in a document is reduce memory consumption, and also get the better result positive, neutral, or negative, will be helpful and be accuracy. beneficial in business intelligence applications, recommender The remainder of this paper is organized as follows. systems, and message filtering applications [2]. Section II presents the theoretical background of feature To construct an accurate sentiment classifier, in the past selection and semi-supervised deep learning used in our few years, many researchers have tried to integrate the work. The design workflow of our proposed framework is concept of deep learning with machine learning [3]-[6]. With described in Section III and our experiment results and its power to handle millions of parameters, deep learning can discussion are available in Section IV. Lastly, in Section V, drastically improve a model prediction power. One of the the conclusion of the proposed work and our future direction great examples is Recursive Neural Tensor Network trained are presented. on Sentiment Treebank proposed by Richard Socher [4]. It 2166-0670/16 $31.00 © 2016 IEEE 9 DOI 10.1109/ISMS.2016.9 II. LITERATURE REVIEW In this section, the theoretical background of feature selection and semi-supervised deep learning related to our work is presented. A. Feature Selection Feature selection is a process, which aims to simplify the model construction, by selecting a subset of relevant features. It serves two major roles. The first role is to Figure 1. An illustration of an RBM network. improve the training process of a classifier more efficient by reducing the size of vocabulary input. The second role is to huge amount of unlabeled data to overcome domain increase the prediction accuracy by filtering inessential terms dependence and the lack of labeled review problems, while or noisy features. As a result, a shorter period of training still get the competitive performance [7], [13]. time and also a better model representation can be achieved. 1) Restricted Boltzmann Machine 1) Feature Selection Techniques An RBM is an energy-based generative stochastic model Basically, feature selection techniques can be organized that aims to learn a probability distribution over its set of into three categories [9]: filter, wrapper and embedded inputs. It consists of an input (visible) layer and a hidden techniques. The filter-based technique is used as a pre- layer connected by symmetrically weighted connections but processing step prior to a learning algorithm. Features are no connections between neurons in the same layer. Figure 1 ranked by some criteria and then selected if their scores are shows the undirected graphical network of an RBM. above an appropriately pre-defined threshold. Next, the To train a network, the most widely-used algorithm is wrapper technique utilizes a learning algorithm to select and called Contrastive Divergence (CD-k), proposed by Hinton evaluate a subset of features among all features. Finally, the [3]. The objective of training an RBM is to optimize the embedded technique performs feature selection as a part of weight vectors in the network for minimizing a the training process. reconstruction error. In order to lower the energy of the Among the three categories, the filter-based technique is network while the distribution of the input data is being the most suitable one since it is simple, fast, and maintained as much as possible, a stochastic gradient ascent independent from classifiers. With its good scalability, it on the log-likelihood surface of the training data is applied. can efficiently be applied for large-scale applications. The more details about equations can be seen in [12]. Examples of the filter-based technique are Information 2) Deep Belief Network Gain, χ2 (Chi-squared), Mutual Information, t-test, To get a better performance, a stack of restricted F-measure, and etc. Forman [10] has studied among these Boltzmann machines can be defined as a Deep Belief methods in his comparative research and summarized the Network (DBN). To construct a DBN, we can follow the Chi-squared filter-based technique as the most effective way below steps [3]: to perform the feature reduction in text analysis problems. 2.1) a DBN is constructed by greedy layer-wise As all reasons presented so far, χ2 is selected as our feature unsupervised learning using a stack of RBMs as building selection procedure in the proposed framework. blocks. The learning algorithm makes the effective use of 2) χ2 (Chi-squared) Feature Selection unlabeled data, which provides a significant amount of A chi-squared test is a common statistical method used to patterns in input data, to produce better initial weights than test independence of two events. Particularly, it can be used random ones. The objective of this phrase is to perform as the feature selection while the two events are occurrence feature representation. of the term and occurrence of the class. The higher score on 2.2) a DBN is trained according to an exponential loss χ2, the more dependent between the feature term and class, function using gradient descent based supervised learning. and therefore considered important. Consequently, the top- The weights of the model are refined by labeled data. The rank features

Deep Belief Networks with Feature Selection for Sentiment Classification

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support