<<

Refining Embeddings for Sentiment Analysis

1,3 2,3,4 2,3 4 Liang-Chih Yu , Jin Wang , K. Robert Lai and Xuejie Zhang 1 Department of Information Management, Yuan Ze University, Taiwan 2 Department of Computer Science & Engineering, Yuan Ze University, Taiwan 3Innovation Center for Big Data and Digital Convergence Yuan Ze University, Taiwan

4School of Information Science and Engineering, Yunnan University, Yunnan, P.R. China

Contact: [email protected]

Abstract 1995) are also useful information for learning word embeddings. These embeddings have been Word embeddings that can capture seman- successfully used for various natural language tic and syntactic information from contexts processing tasks. have been extensively used for various In general, existing word embeddings are se- natural language processing tasks. Howev- er, existing methods for learning context- mantically oriented. They can capture semantic based word embeddings typically fail to and syntactic information from unlabeled data in capture sufficient sentiment information. an unsupervised manner but fail to capture suffi- This may result in with similar vec- cient sentiment information. This makes it diffi- tor representations having an opposite sen- cult to directly apply existing word embeddings to timent polarity (e.g., good and bad), thus sentiment analysis. Prior studies have reported degrading sentiment analysis performance. that words with similar vector representations Therefore, this study proposes a word vec- (similar contexts) may have opposite sentiment tor refinement model that can be applied to polarities, as in the example of happy-sad men- any pre-trained word vectors (e.g., tioned in (Mohammad et al., 2013) and good-bad and GloVe). The refinement model is based on adjusting the vector rep- in (Tang et al., 2016). Composing these word vec- resentations of words such that they can be tors may produce sentence vectors with similar closer to both semantically and sentimen- vector representations but opposite sentiment po- tally similar words and further away from larities (e.g., a sentence containing happy and a sentimentally dissimilar words. Experi- sentence containing sad may have similar vector mental results show that the proposed representations). Building on such ambiguous method can improve conventional word vectors will affect sentiment classification per- embeddings and outperform previously formance. proposed sentiment embeddings for both To enhance the performance of distinguishing binary and fine-grained classification on words with similar vector representations but op- Stanford Sentiment (SST). posite sentiment polarities, recent studies have suggested learning sentiment embeddings from 1 Introduction labeled data in a supervised manner (Maas et al., Word embeddings are a technique to learn con- 2011; Labutov and Lipson, 2013; Lan et al., 2016; tinuous low-dimensional vector space representa- Ren et al., 2016; Tang et al., 2016). The common tions of words by leveraging the contextual in- goal of these methods is to capture both seman- formation from large corpora. Examples include tic/syntactic and sentiment information such that C&W (Collobert and Weston, 2008; Collobert et sentimentally similar words have similar vector al., 2011), Word2vec (Mikolov et al., 2013a; representations. They typically apply an objective 2013b) and GloVe (Pennington et al., 2014). In function to optimize word vectors based on the addition to the contextual information, character- sentiment polarity labels (e.g., positive and nega- level subwords (Bojanowski et al., 2016) and se- tive) given by the training instances. The use of mantic knowledge resources (Faruqui et al., 2015; such sentiment embeddings has improved the per- Kiela et al., 2015) such as WordNet (Miller, formance of binary sentiment classification.

534 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 534–539 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics

This study adopts another strategy to obtain Target word: good (7.89) Ranked by Ranked by both semantic and sentiment word vectors. Instead cosine similarity sentiment score of building a new model from great (7.50) excellent (7.56) labeled data, we propose a word vector refinement bad (3.24) great (7.50) model to refine existing semantically oriented terrific (7.12) fantastic (8.36) Re-ranking word vectors using sentiment lexicons. That is, the decent (6.27) wonderful (7.41) proposed model can be applied to the pre-trained nice (6.95) terrific (7.12) vectors obtained by any word representation excellent (7.56) nice (6.95) learning models (e.g., Word2vec and GloVe) as a fantastic (8.36) decent (6.27) post-processing step to adapt the pre-trained vec- solid (5.65) solid (5.65) tors to sentiment applications. The refinement lousy (3.14) bad (3.24) model is based on adjusting the pre-trained vector wonderful (7.41) lousy (3.14) of each affective word in a given sentiment lexi- con such that it can be closer to a set of both se- Figure 1: Example of nearest neighbor rank- mantically and sentimentally similar nearest ing. neighbors (i.e., those with the same polarity) and further away from sentimentally dissimilar neigh- (target word) and the other words in the lexicon bors (i.e., those with an opposite polarity). based on the cosine distance of their pre-trained The proposed refinement model is evaluated by vectors, and then select top-k most similar words examining whether our refined embeddings can as the nearest neighbors. These semantically simi- improve conventional word embeddings and out- lar nearest neighbors are then re-ranked according perform previously proposed sentiment embed- to their sentiment scores provided by the lexicon dings. To this end, several deep neural network such that the sentimentally similar neighbors can classifiers that performed well on the Stanford be ranked higher and dissimilar neighbors lower. Sentiment Treebank (SST) (Socher et al., 2013) Finally, the pre-trained vector of the target word is are selected, including convolutional neural net- refined to be closer to its semantically and senti- works (CNN) (Kim, 2014), deep averaging net- mentally similar nearest neighbors and further work (DAN) (Iyyer et al., 2015) and long-short away from sentimentally dissimilar neighbors. term memory (LSTM) (Tai et al., 2015; Looks et The following subsections provide a detailed de- al., 2017). The conventional word embeddings scription of the nearest neighbor ranking and re- used in these classifiers are then replaced by our finement model. refined versions and previously proposed senti- ment embeddings to re-run the classification for 2.1 Nearest Neighbor Ranking performance comparison. The SST is chosen be- cause it can show the effect of using different The sentiment lexicon used in this study is the ex- word embeddings on fine-grained sentiment clas- tended version of Affective Norms of English sification, whereas prior studies only reported bi- Words (E-ANEW) (Warriner et al., 2013). It con- nary classification results. tains 13,915 words and each word is associated The rest of this paper is organized as follows. with a real-valued score in [1, 9] for the dimen- Section 2 describes the proposed word vector re- sions of valence, arousal and dominance. The va- finement model. Section 3 presents the evaluation lence represents the degree of positive and nega- results. Conclusions are drawn in Section 4. tive sentiment, where values of 1, 5 and 9 respec- tively denote most negative, neutral and most pos- 2 Word Vector Refinement itive sentiment. In Fig. 1, good has a valence score of 7.89, which is greater than 5, and thus can be The refinement procedure begins by giving a set considered positive. Conversely, bad has a va- of pre-trained word vectors and a sentiment lexi- lence score of 3.24 and is thus negative. In addi- con annotated with real-valued sentiment scores. tion to the E-ANEW, other lexicons such as Sen- Our goal is to refine the pre-trained vectors of the tiWordNet (Esuli and Fabrizio, 2006), SoCal affective words in the lexicon such that they can (Taboada et al., 2011), SentiStrength (Thelwall et capture both semantic and sentiment information. al., 2012), Vader (Hutto et al., 2014), ANTUSD To accomplish this goal, we first calculate the se- (Wang and Ku, 2016) and SCL-NMA mantic similarity between each affective word (Kiritchenko and Mohammad, 2016) also provide

535

real-valued sentiment intensity or strength scores like the valence scores. For each target word to be refined, the top-k semantically similar nearest neighbors are first se- lected and ranked in descending order of their co- sine similarities. In Fig. 1, the left ranked list shows the top 10 nearest neighbors for the target word good. The semantically ranked list is then sentimentally re-ranked based on the absolute dif- ference of the valence scores between the target word and the words in the list. A smaller differ- ence indicates that the word is more sentimentally similar to the target word, and thus will be ranked higher. As shown in the right ranked list in Fig. 1, Figure 2: Conceptual diagram of word vector the re-ranking step can rank the sentimentally refinement. similar neighbors higher and the dissimilar neigh- bors lower. In the refinement model, the higher ranked sentimentally similar neighbors will re- similar neighbors and further away from lower- ceive a higher weight to refine the pre-trained vec- ranked dissimilar neighbors, as shown in Fig. 2. tor of the target word. To prevent too many words being moved to the same location and thereby producing too 2.2 Refinement Model many similar vectors, we add a constraint to keep each pre-trained vector within a certain range Once the word list ranked by both cosine similari- from its original vector. The objective function is ty and valence scores for each target word is ob- thus divided as two parts: tained, its pre-trained vector will be refined to be (1) closer to its sentimentally similar neighbors, arg minΦ (V )=

(2) further away from its dissimilar neighbors, and nk tt++11tt (2) (3) not too far away from the original vector. argmin∑∑αβdist (,) vi v i + wij dist (,) v i v j ij=11= Let V = {v1, v2, …, vn} be a set of the pre- trained vectors corresponding to the affective tt+1 where dist( vii ,) v denotes the distance between words in the sentiment lexicon. For each target to the vector of the target word in step t and t+1, i.e., be refined, the refinement model iteratively mini- the distance between the refined vector and its mizes the distance between the target word and its original vector. The later one represents the dis- top-k nearest neighbors. The objective function tance between the vector of the target word and Φ(V) can thus be defined as that of its neighbors (similar to Eq. (1)). The pa- nk rameters α and β together are used as a ratio to Φ= (1) ()V∑∑ wij dist(,) v i v j control how far the refined vector can be moved ij=11 = away from its original vector and toward its near- where n denotes the total number of vectors in V est neighbors. A greater ratio indicates a stronger to be refined, vi denotes the vector of a target constraint on keeping the refined vector closer to word, vj denotes the vector of one of its nearest its original vector. For the extreme case of α=1 neighbors in the ranked list, dist(vi, vj) denotes the and β=0, the target word will not be moved (re- distance between vi and vj, and wij denotes the fined). As the ratio decreases, the constraint de- weight of the target word’s nearest neighbor, de- creases accordingly and the refined vector can be fined as the reciprocal rank of a ranked list. For moved closer to its nearest neighbors. The setting example, excellent in Fig. 1 will receive a weight of α=0 and β=1 means that the constraint is disa- of 1, great will receive a weight of 1/2, and so on. bled. A word ranked higher will receive a higher To facilitate the calculation of the partial de- weight. This weight is used to control the move- rivative of Φ(V), dist(vi, vj) in the above equa- ment direction of the target word towards to its tions is measured by the squared Euclidean dis- nearest neighbors. That is, the target word will be tance, defined as moved closer to the higher-ranked sentimentally

536

D  Re(Glove) and Re(Word2vec): Both the pre- dist(, v v )= ( vdd − v )2 (3) ij∑ i j trained GloVe and Word2vec were refined d =1 using E-ANEW (Warriner et al., 2013). Each where D is the dimensionality of the word vectors. affective word was refined by its top k=10 The global optimal solution of Φ(V) can be found nearest neighbors with parameters of α:β=0.1 by using an iterative update method. To do so, we (1:10) (see Eq. (2)). solve the partial derivation of Eq. (2) in step t with  HyRank: It was trained using SST, NRC respect to word vector vt , and by setting i Sentiment140 and IMDB datasets. We com- ∂Φ()V = 0 t +1 pared this method because its code is public- t to obtain a new vector vi in step 3 ∂vi ly accessible . t+1. The iterative update procedure is defined as Classifiers. The above word embeddings were 4 k used by CNN (Kim, 2014) , DAN (Iyyer et al., tt 5 γβvi + ∑ wvij j 2015) , , bi-directional LSTM (Bi-LSTM) (Tai et t+1 j=1 al., 2015)6 and Tree-LSTM (Looks et al., 2017)7 vi = k (4) with default parameter values. γβ+ ∑ wij j=1 Comparative Results. Table 1 presents the eval- Through the iterative procedure, the vector uation results of using different word embeddings representation of each target word will be for different classifiers. For the pre-trained word iteratively updated until the change of the location embeddings, GloVe outperformed Word2vec for of the target word’s vector is converged. The DAN, Bi-LSTM and Tree-LSTM, whereas refinement process will be terminated when all Word2vec yielded better performance for CNN. target words are refined. After the proposed refinement model was applied, both the pre-trained Word2vec and GloVe were 3 Experimental Results improved. The Re(Word2vec) and Re(GloVe) re- spectively improved Word2vec and GloVe by This section evaluates the proposed refinement 1.7% and 1.5% averaged over all classifiers for model, conventional word embeddings and previ- binary classification, and both 1.6% for fine- ously proposed sentiment embeddings using sev- grained classification. In addition, both Re(GloVe) eral deep neural network models for binary and and Re(Word2vec) outperformed the sentiment fine-grained sentiment classification. embeddings HyRank for all classifiers on both bi- Dataset. SST was adopted as the evaluation cor- nary and fine-grained classification, indicating pus (Socher et al., 2013). The binary classification that the real-valued intensity scores used by the subtask (positive and negative) contains proposed refinement model are more effective 6920/872/1821 samples for the train/dev/test sets, than the binary polarity labels used by the previ- while the fine-grained ordinal classification sub- ously proposed sentiment embedings. task (very negative, negative, neutral, positive, The proposed method yielded better perfor- and very positive) contains 8544/1101/2210 sam- mance because it can remove semantically similar ples of the train/dev/test sets. but sentimentally dissimilar nearest neighbors for the target words by refining their vector represen- Word Embeddings. The word embeddings used tations. To demonstrate the effect, we define a for comparison included two conventional word measure noise@k to calculate the percentage of embeddings (GloVe and Word2vec), our refined versions (Re(GloVe) and Re(Word2vec)), and top k nearest neighbors with an opposite polarity previously proposed sentiment embeddings (Hy- (i.e., noise) to each word in E-ANEW. For in- Rank) (Tang et al., 2016). We used the same di- stance, in Fig. 1, the noise@10 for good is 20% mensionality of 300 for all word embeddings. because there are two words with an opposite po- larity to good among its top 10 nearest neighbors.  GloVe and Word2vec: The respective GloVe Table 2 shows the average noise@10 for different and Word2vec (skip-gram) were pre-trained 1 on Common Crawl 840B and Google- 3 http://ir.hit.edu.cn/~dytang/ News2. 4 https://github.com/yoonkim/CNN_sentence 5 https://github.com/miyyer/dan 1 http://nlp.stanford.edu/projects/glove/ 6 https://github.com/stanfordnlp/treelstm 2 https://code.google.com/archive/p/word2vec/ 7 https://github.com/tensorflow/fold

537

Method Fine-grained Binary Word Embeddings Noise@10 (%) DAN Word2vec 24.3 - Word2vec 46.2 84.5 GloVe 24.0 - GloVe 46.9 85.7 HyRank 18.5 - Re(Word2vec) 48.1 87.0 Re(Word2vec) 14.4 - Re(GloVe) 48.3 87.3 Re(GloVe) 13.8 - HyRank 47.2 86.6 Table 2: Average percentages of noisy words CNN in the top 10 nearest neighbors for different - Word2vec 48.0 87.2 word embeddings. - GloVe 46.4 85.7 - Re(Word2vec) 48.8 87.9 Experiments on SST show that the proposed - Re(GloVe) 47.7 87.5 method yielded better performance than both con- - HyRank 47.3 87.6 ventional word embeddings and sentiment em- Bi-LSTM beddings for both binary and fine-grained senti- - Word2vec 48.8 86.3 ment classification. In addition, the performances - GloVe 49.1 87.5 of various deep neural network models have also - Re(Word2vec) 49.6 88.2 been improved. Future work will evaluate the - Re(GloVe) 49.7 88.6 proposed method on another datasets. More ex- - HyRank 49.0 87.3 periments will also be conducted to provide more Tree-LSTM in-depth analysis. - Word2vec 48.8 86.7 Acknowledgments - GloVe 51.8 89.1 - Re(Word2vec) 50.1 88.3 This work was supported by the Ministry of Sci- - Re(GloVe) 54.0 90.3 ence and Technology, Taiwan, ROC, under Grant - HyRank 49.2 88.2 No. MOST 105-2221-E-155-059-MY2 and MOST 105-2218-E-006-028. The authors would Table 1: Accuracy of different classifiers with like to thank the anonymous reviewers and the ar- different word embeddings for binary and fine- ea chairs for their constructive comments. grained classification. References word embeddings. For the two semantic-oriented Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors word vectors, GloVe and Word2vec, on average with subword information. arXiv preprint arXiv: around 24% of the top 10 nearest neighbors for 1607.04606. each word are noisy words. After refinement, both Ronan Collobert and Jason Weston. 2008. A unified Re(GloVe) and Re(Word2vec) can reduce architecture for natural language processing: Deep noise@10 to around 14%. The HyRank also neural networks with multitask learning. In yielded better performance than both GloVe and Proceedings of the 25th International Conference Word2vec. on (ICML-08), pages 160–167. Ronan Collobert, Jason Weston, L´eon Bottou, 4 Conclusion Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) This study presents a word vector refinement from scratch. Journal of Machine Learning model that requires no labeled corpus and can be Research, 12:2493–2537. applied to any pre-trained word vectors. The pro- Andrea Esuli, and Fabrizio Sebastiani. 2006. Senti- posed method selects a set of semantically similar WordNet: A publicly available for nearest neighbors and then ranks the sentimentally opinion mining. In Proceedings of the 5th Interna- tional Conference on Language Resources and similar neighbors higher and dissimilar neighbors Evaluation (LREC-06), pages 417-422. lower based on a sentiment lexicon. This ranked Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris list can guide the refinement procedure to itera- Dyer, Eduard H. Hovy and Noah A. Smith. 2015. tively improve the word vector representations. Retrofitting word vectors to semantic lexicons. In

538

Proceedings of the 2015 Conference of the North Advances in Neural Information Processing American Chapter of the Association for Computa- Systems (NIPS-13), pages 1-9. tional Linguistics – Human Language Technolo- Tomas Mikolov, Greg Corrado, Kai Chen, and Jeffrey gies (NAACL/HLT-15), pages 1606-1615. Dean. 2013b. Efficient estimation of word Clayton J. Hutto and Eric Gilbert. 2014. VADER: A representations in vector space. In Proceedings of parsimonious rule-based model for sentiment anal- the International Conference on Learning ysis of text. In Proceedings of 8th In- Representations (ICLR-2013), pages 1-12. ternational AAAI Conference on Weblogs and So- George A. Miller. 1995. WordNet: A lexical database cial Media, pages 216-225. for English. Communications of the ACM, Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, 38(11):39-41. and Hal Daumé III. 2015. Deep unordered Saif M. Mohammad, Bonnie J. Dorr, Graeme Hirst, composition rivals syntactic methods for text and Peter D. Turney. 2013. Computing lexical classification. In Proceedings of the 53rd Annual contrast. Computational Linguistics, 39(3):555-590. Meeting of the Association for Computational Jeffrey Pennington, Richard Socher, and Christopher Linguistics (ACL-15), pages 1681-1691. D. Manning. 2014. GloVe: Global vectors for word Douwe Kiela, Felix Hill and Stephen Clark. 2015. representation. In Proceedings of the Conference Specializing word embeddings for similarity or on Empirical Methods in Natural Language relatedness. In Proceedings of the 2015 Conference Processing (EMNLP-14), pages 1532-1543. on Empirical Methods in Natural Language Yafeng Ren, Yue Zhang, Meishan Zhang, and Processing (EMNLP-15), pages 2044-2048. Donghong Ji. 2016. Improving sentiment Yoon Kim. 2014. Convolutional neural networks for classification using topic-enriched multi-prototype sentence classification. In Proceedings of the 2014 word embeddings. In Proceedings of the 30th Conference on Empirical Methods in Natural AAAI Conference on (AAAI- Language Processing (EMNLP-14), pages 1746- 16), pages 3038-3044. 1751. Richard Socher, Alex Perelygin, and Jy Wu. 2013. Svetlana Kiritchenko and Saif M. Mohammad. 2016. Recursive deep models for semantic The effect of negators, modals, and degree adverbs compositionality over a sentiment treebank. In on sentiment composition. In Proceedings of the Proceedings of the Conference on Empirical 7th Workshop on Computational Approaches to Methods in Natural Language Processing Subjectivity, Sentiment and Social Media Analysis (EMNLP-13), pages 1631-1642. at NAACL-HLT 2016, pages 43-52. Maite Taboada, Julian Brooke, and Kimberly Voll. Igor Labutov and Hod Lipson. 2013. Re-embedding 2011. Lexicon-based methods for sentiment words. In Proceedings of the 51st Annual Meeting analysis. Computational linguistics, 37(2):267–307. of the Association for Computational Linguistics Kai Sheng Tai, Richard Socher, and Christopher D. (ACL-13), pages 489-493. Manning. 2015. Improved semantic representations Man Lan, Zhihua Zhang, Yue Lu, and Ju Wu. 2016. from tree-structured long short-term memory Three convolutional neural network-based models networks. In Proceedings of the 53rd Annual for learning sentiment word vectors towards Meeting of the Association for Computational sentiment analysis. In Proceedings of the 2016 Linguistics (ACL-15), pages 1556–1566. International Joint Conference on Neural Duyu Tang, Furu Wei, Bing Qin, Nan Yang, Ting Liu, Networks (IJCNN-16), pages 3172-3179. and Ming Zhou. 2016. Sentiment embeddings with Moshe Looks, Marcello Herreshoff, DeLesley applications to sentiment analysis. IEEE Trans. Hutchins, and Peter Norvig. 2017. Knowledge abd Data Engineering, 28(2):496-509. with dynamic computation graphs. In Proceedings Mike Thelwall, Kevan Buckley, and Georgios Pal- of the 5th International Conference on Learning toglou. 2012. Sentiment strength detection for the Representations (ICLR-17). social web. Journal of the Association for Infor- Andrew L. Maas, Raymond E. Daly, Peter T. Pham, mation Science and Technology, 63(1):163-173. Dan Huang, Andrew Y. Ng, and Christopher Potts. Shih-Ming Wang and Lun-Wei Ku. 2016. ANTUSD: 2011. Learning word vectors for sentiment analysis. A large Chinese sentiment dictionary. In Proc. of In Proceedings of the 49th Annual Meeting of the the 10th International Conference on Language Association for Computational Linguistics (ACL- Resources and Evaluation (LREC-16), pages 2697- 11), pages 142-150. 2702. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Amy Beth Warriner, Victor Kuperman, and Marc Dean. 2013a. Distributed representations of words Brysbaert. 2013. Norms of valence, arousal, and and phrases and their compositionality. In dominance for 13,915 English lemmas. Behavior Proceedings of the Annual Conference on Research Methods, 45(4):1191-1207.

539