Optimizing the Hyperparameter of Feature Extraction and Machine Learning Classification Algorithms

(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 10, No. 3, 2019 Optimizing the Hyperparameter of Feature Extraction and Machine Learning Classification Algorithms Sani Muhammad Isa1, Rizaldi Suwandi2, Yosefina Pricilia Andrean3 Computer Science Department, BINUS Graduate Program Master of Computer Science Bina Nusantara University Jakarta, Indonesia 11480 Abstract—The process of assigning a quantitative value to a Text classification research started from design the best piece of text expressing a mood or effect is called Sentiment feature extraction method to choose the best classifiers. analysis. Comparison of several machine learning, feature Almost all techniques of text classification based on words extraction approaches, and parameter optimization was done to [5]. For sentiment classification, this paper uses a machine achieve the best accuracy. This paper proposes an approach to learning based method because it has been widely adopted due extracting comparison value of sentiment review using three to their excellent performance. features extraction: Word2vec, Doc2vec, Terms Frequency- Inverse Document Frequency (TF-IDF) with machine learning Word2vec, Doc2vec, and Terms Frequency-Inverse classification algorithms, such as Support Vector Machine Document Frequency (TF-IDF) feature extractions that used (SVM), Naive Bayes and Decision Tree. Grid search algorithm is in this research were implemented by python algorithm using used to optimize the feature extraction and classifier parameter. the Sklearn library (TF-IDF) and the Gensim library The performance of these classification algorithms is evaluated (Word2vec & Doc2vec). Word2vec is a new open source based on accuracy. The approach that is used in this research feature extraction method based on deep learning [3]. succeeded to increase the classification accuracy for all feature Word2vec can learn the word vector representation and extractions and classifiers using grid search hyperparameter calculate the cosine distance in the high dimensional vector optimization on varied pre-processed data. space. This research used word2vec because this approach can find the semantic relationships between words in the Keywords—Sentiment analysis; word2vec; TF-IDF (terms document. frequency-inverse document frequency); Doc2vec; grid search TF-IDF is very important in this research. Balancing the I. INTRODUCTION weight between the less commonly used words and most Google Play is an online service that developed and frequent or general words is one of the capabilities of TF-IDF operated by Google. This is an official app store for Android- feature extraction. TF-IDF can calculate the frequency of each based mobile phone. The entire google play service can be token in the review. This frequency shows the importance of a accessed through the Play Store app. Google Play sells token to a document in the corpus [6]. Android apps, games, movies, and even e-books. Google play To extend the learning of embeddings from word to word store apps data chosen because have enormous potential to sequences, this research uses a Doc2vec as a simple extension drive app-making business to success. This research focused to Word2vec. Many types of texts used this feature. They are on apps review. word n-gram, sentence, paragraphs or document [7]. Refer to In recent years, the opinions of friends, domain experts are the embedding of the word sequence; we used the term our consideration for decision making in our life. For document embedding that supported by Doc2vec. example, which app is best to download, games to play, or e- The classification method was done with 3 classifier NB book to read. Sentiment analysis or opinion mining plays an (Naive Bayes), SVM (Support Vector Machine), and DT important role in this process [1]. The sentiment (expressions) (Decision Tree). SVM can be used to create the highest states in natural language form. accuracy results in text classification problems [1]. The NB With the increasing development in e-commerce, the has high accuracy than other followed by the DT classifier. needed to extract valuable information from consumer This research dedicated to select the best feature extraction comment also increasing. It is important for the organization and choosing the best model for multiclass classification by (Google Play developer company) to automatically identify comparing the TF-IDF, Word2vec, Doc2vec feature extraction each customer review whether it is positive, negative, or and increase the accuracy using hyperparameter optimization. neutral [2]. The product comments contain a wealth of Hyperparameter optimization used to search the best information about product evaluation from customers [3]. parameter that produces the best classification accuracy. To With the main form of information from the internet is text selects, a point in space (in linear or log space) [4], text processing is needed the most. Text processing is hyperparameter space using grid search is suitable in this case. needed for extracting the value of sentiment review. Hyperparameter tuning is well-suited to use in some 69 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 10, No. 3, 2019 derivative-free optimization, it is reflecting characteristic grid because of the objective [12]. search to solve this problem [8]. In Table I, there is attached some research about sentiment This paper consisted of several parts of the section, there analysis from Google Scholar. This literature search using are: Section II to gives literature review, Section III describes some keywords such as “Sentiment Analysis using Word2vec the methodology of some techniques used in the research, and TF-IDF”, “Sentiment Analysis Google Play Review”, Section IV describes result and analysis in this research, and “Sentiment Analysis using Doc2vec”. Section V will conclude the paper. Based on the literature review, most of the research is II. RELATED WORK focused on getting better accuracy. The method was designed according to the characteristics of the text. To represent the To know what is the mood or effect from text expressing, rank among the best approach in retrieving documents and it can use sentiment analysis with assigning a quantitative labeling document, TF-IDF was used, Word2vec to obtain value (positive, negative, and neutral). Previous research has more accurate word vector, Doc2vec is one of the easiest ways shown that sentiment analysis has a good accuracy, such as, is using an average of all words in the document to represent Twitter [2], application reviews [9], documents [10], texts [3], the feature of this document [18]. [4], [11], newsgroup [12], and news article [13], IMDB [14]. The Google Play review dataset from Kaggle was used in this SVM, NB, and DT classifier were used in this paper for research. Kaggle is an online web service that provides a text classifier. In the classification process, many series of a dataset that can be used to research. Data has classification methods and machine learning techniques have filtered for noise and null review user‟s data also contain the been used. Using machine learning can increase accuracy by sentiment for each review. using optimization algorithm, i.e. hyperparameter optimization using a grid search. By adding a hyperparameter optimization, Sentiment analysis is a good candidate for analyzing will get a better result with to determine hyperparameter sentiments in text classification. Using machine learning, the efficiency in choosing parameter [19]. The methods have been classification of documents was done. This machine learning used by [12] and [4] both methods produce a high level of automatically classifies the document into categories that have accuracy with more than 80%. Then using this method will get been labeled before. It can see as a supervised learning task a high accuracy [16]. TABLE I. LITERATURE REVIEW Author Dataset Method Result J. H. Lau and T. Baldwin [7] Document Doc2vec Better accuracy Latent Semantic Analysis R. Ju, P. Zhou, C. H. Li, and L. Liu [15] Newsgroup Better accuracy + Word2vec D. Zhang, H. Xu, Z. Su, and Y. Xu [3] Text (Chinese comments) Word2vec and SVM-perf Excellence accuracy Word2vec, Terms Frequency- J. Lilleberg, Y. Zhu, and Y. Zhang [12] Newsgroup Word2vec is the best solution Inverse Document Frequency D. Rahmawati and M. L. Khodra [13] Article Word2vec Better accuracy Naïve Bayes, Max Entropy, S. K. R. Abinash Tripathy, Ankit Agrawal [14] IMDb Get better accuracy SVM, SGD Long Short Term Memory P. Vateekul and T. Koomsubha [2] Twitter and Dynamic Convolutional Better than Naïve Bayes and SVM Neural Network Word2vec and Terms W. Zhu, W. Zhang, G.-Z. Li, C. He, and L. Better than Latent Semantic Analysis Text Frequency-Inverse Document Zhang [16] and Doc2vec Frequency Word2vec, Terms Frequency- Y. Xi. Jin Gao, Yahao He, Xiaoyan Zhang [4] Text Comparison Inverse Document Frequency Contextual Specificity Contextual Specificity Similarity good in long S. Fujita [17] Newspaper Similarity, K-Nearest text L. Lin, X. Linlong, J. Wenzhen, Z. Hong, and CD_STR, TF-IDF weighted Text Comparison Y. Guocai [11] vector space model Doc2vec, Support Vector Support Vector Machine, Logreg show a Q. Shuai, Y. Huang, L. Jin, and L. Pang [18] Review Machine, LogReg better result 70 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 10, No. 3, 2019 B. Filtering III. METHODOLOGY

Optimizing the Hyperparameter of Feature Extraction and Machine Learning Classification Algorithms

10 Oriented Principal Component Analysis for Feature Extraction

Feature Selection/Extraction

Feature Extraction (PCA & LDA)

Feature Extraction for Image Selection Using Machine Learning

Time Series Feature Extraction for Industrial Big Data (Iiot) Applications

Machine Learning Feature Extraction Based on Binary Pixel Quantiﬁcation Using Low-Resolution Images for Application of Unmanned Ground Vehicles in Apple Orchards

Towards Reproducible Meta-Feature Extraction

Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data Robert T

Feature Extraction Using Dimensionality Reduction Techniques: Capturing the Human Perspective

Unsupervised Feature Extraction for Reinforcement Learning

Alignment-Based Topic Extraction Using Word Embedding

Deep Learning Feature Extraction Approach for Hematopoietic Cancer Subtype Classiﬁcation