Sentiment Analysis for Multilingual Corpora
Total Page:16
File Type:pdf, Size:1020Kb
Sentiment Analysis for Multilingual Corpora Svitlana Galeshchuk Julien Jourdan Ju Qiu Governance Analytics, PSL Research University / Governance Analytics, PSL Research University / University Paris Dauphine PSL Research University / University Paris Dauphine Place du Marechal University Paris Dauphine Place du Marechal de Lattre de Tassigny, Place du Marechal de Lattre de Tassigny, 75016 Paris, France de Lattre de Tassigny, 75016 Paris, France [email protected] 75016 Paris, France [email protected] [email protected] Abstract for different languages without polarity dictionar- ies using relatively small datasets. The paper presents a generic approach to the supervised sentiment analysis of social media We address the challenges for sentiment analy- content in foreign languages. The method pro- sis in Slavic languages by using the averaged vec- poses translating documents from the origi- tors for each word in a document translated in nal language to English with Google’s Neural English. The vectors derive from the Word2Vec Translation Model. The resulted texts are then model pre-trained on Google news. converted to vectors by averaging the vectorial Researchers tend to use the language-specific representation of words derived from a pre- pre-trained Word2Vec models (e.g., Word2Vec trained Word2Vec English model. Testing the approach with several machine learning meth- model pre-trained on Wikipedia corpus in Greek, ods on Polish, Slovenian and Croatian Twitter Swedish, etc.). On the contrary, we propose ben- corpora returns up to 86 % of classification ac- efiting from Google’s Neural Translation Model curacy on the out-of-sample data. translating the texts from other languages to En- glish. Translated documents are then converted 1 Introduction to the fixed-vectorial representation with Google 1 Sentiment analysis is gaining prominence as a Word2Vec model. The supervised machine learn- topic of research in different areas of applica- ing classifiers such as Gradient Boosting Trees, tion (journalism, political science, marketing, fi- Random Forest, Support Vector Machines provide nance, etc.). In the last two decades, opinion-rich sufficiently high accuracy on the out-of-sample data sources are widely available because of web- data converted to the aggregate vectors. resources and social networks. While lexicon- The rest of the paper is structured as follows: based frameworks have long been investigated for Section 2 provides a brief review of related liter- sentiment analysis, deep learning methods with a ature. Section 3 describes the methodology. Sec- vectorial representation of words are proving to tion 4 expands on data used. Section 5 presents the deliver promising results. The integration of two results from our experiments. Section 6 concludes types of methods is widely investigated as well. with some observations on our findings and iden- Thus, the sentiment analysis approaches usually tifies directions of future research. require either fine-grained lexicon of most fre- quent words along with their polarity scores or 2 Related Work the dataset large enough for supervised training This section elaborates on existing methods for of deep learning network, sufficient computational sentiments analysis and the adjacent approaches to memory, etc. text data treatment that have helped us formulate Moreover, most of the open-source datasets the proposed process of sentiment analysis. Fol- for training sentiment models comprise English- lowing Dashtipour et al.(2016), we divide senti- language texts. The lexicons are not always avail- ment analysis systems on lexicon-based, corpus- able for other languages, and it remains a time- based and hybrid. consuming task to construct them. It motivates us to build on the existing approaches and test a 1https://drive.google.com/file/d/ rather general method to run a sentiment analysis 0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit 120 Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pages 120–125, Florence, Italy, 2 August 2019. c 2019 Association for Computational Linguistics 2.1 Lexicon-Based Methods (Le and Mikolov, 2014; Severyn and Moschitti, 2015). As the reference point in our study we Lexicon-based methods employ the dictionaries use the papers of Giatsoglou et al.(2017) and of pre-defined words with corresponding polar- Garten et al.(2018) where authors meticulously ity scores. These scores define how positive the employ Sentence2Vec in their methodological set- term is. Some approaches (e.g., Vader lexicon in tings. Giatsoglou et al.(2017) uses Sentence2Vec Hutto and Gilbert(2014)) use the opinion of sev- based on the Word2Vec model learned from the eral experts and the final polarity measure equals Wikipedia corpus in Greek. The performance is the mean of the corresponding scores. A sub- evaluated with the datasets of mobile phonesâA˘ Z´ set of the most popular and promising lexicon- reviews in Greek. The model that exploits the based sentiment classifiers for English corpora vectors derived from the Wikipedia corpora+the has been reported in Levallois(2013). Con- reviews provides the highest accuracy of 70.89- cerning Slavic languages, Slovak lexicon trans- 82.40% on test samples. Author further try hy- lated from English and annotated with Bare-bones brid methods (lexicon- and embedding- driven) particle swarm optimization helps achieve 0.865 that deliver slightly better results. Rotim and Šna- F1 score in sentiment classification reported in jder(2017) use similar approach for Croatian cor- Krchnavy and Simko(2017). Gombar et al. pora obtaining 0.822 as F1 score for game reviews (2017) construct Croatian domain-specific lexicon but the results are much worse for the Twitter for domain-specific classification; Haniewicz et al. dataset. In contrast to the authors, we do not train (2013) run sentiment analysis with polarity lexi- our Word2Vec model for the corpora in Slavic con for reviews in Polish that renders up to 79% languages. Instead, we employ the pre-trained of accuracy. We will refer to these papers later in Google News Word2Vec model after translating our study to corroborate our results by comparison texts to English. It makes our approach more with the existing methods. universal and easier to apply to the foreign lan- The idea proposed in Wan(2009) shares some guage corpora yielding satisfactory accuracy (see similarities with our method. Authors translate Results). Chinese text in English and then employ lexicon- based ensemble method to classify texts on posi- Garten et al.(2018) compute the cosine similar- tive or negative. The reported accuracy is 85.3% ity between the aggregate vectorial representation though it requires Chinese and English lexicon and of documents and the âAIJnegative⢠A˘ I˙ and âAIJ-˘ some additional calculation to create the ensem- positiveâA˘ I˙ dictionaries. Precision on the IMBD ble method. However, word scoring in each con- English reviews data varies between 0.70-0.75. structed lexicon usually relies on human treatment Our findings show that the introduction of the po- and perception. The task is also labor-intensive, larity dictionaries delivers less accurate outputs and it may be challenging to find fine-grained lex- than using Sentence2Vec. However, our set-up icons for some languages. does not foresee unsupervised learning. Moreover, a well-known drawback of lexicon- based method is the contextual ignorance as some Feature-based approach for Czech language terms may have different meanings in various doc- sentiment classification renders 0.69 as F1 mea- uments. Besides, some documents (e.g., short sure in Habernal et al.(2013). The method has to texts as tweets) sometimes do not include any be adjusted for other languages if used. word from the lexicon. The introduction of word vectorial representation tend to address this disad- Zhang et al.(2017) report another approach vantage. for Twitter sentiment classification employing character-based Convolutional neural networks 2.2 Corpus-based and Hybrid Methods with with different languages. The method transforms Vectorial Representation the characters in alphabetic order in UTF-8 codes Embedding approaches usually rely on the seman- facilitating sentence segmentation. The character tic vector spaces derived from the neural networks. embedding matrix is then used as an input for the Their application in supervised experimental set- convolutional neural network. We consider these ups for polarity analysis often demonstrates su- findings as one of the benchmarks for comparison perior performance to the lexicon-based methods in our study. 121 3 Methodology makes the model the state-of-the-art regarding the quality of vectors which plays a crucial role in our Recall from the previous sections that we tend to study as we use the translated text from Slavic develop a sentiment analysis approach for multi- languages to English; (ii) Google model com- language use. Fig.1 depicts the proposed method.2 prises approximately 3 million words and phrases. This vocabulary covers the lion share of lexicon employed by web-resources and social networks users; (iii) we do not need to construct a large dataset to train our model as the vectors have al- ready been pre-trained with a significant number of terms. Google Translation. Machine translation does not always provide perfect accuracy from the lin- guistic point of view. However, the resulted trans- lation with recently introduced GoogleâA˘ Zs´ Neu- ral Machine Translation approach tends to deliver English text contextually similar