Learning Word Representations for Sentiment Analysis

Cogn Comput (2017) 9:843–851 DOI 10.1007/s12559-017-9492-2 Learning Word Representations for Sentiment Analysis Yang Li1 · Quan Pan1 · Tao Yang1 · Suhang Wang2 · Jiliang Tang3 · Erik Cambria4 Received: 29 April 2017 / Accepted: 18 July 2017 / Published online: 17 August 2017 © Springer Science+Business Media, LLC 2017 Abstract Word embedding has been proven to be a useful such prior sentiment information at both word level and doc- model for various natural language processing tasks. Tradi- ument level in order to investigate the influence each word tional word embedding methods merely take into account has on the sentiment label of both target word and context word distributions independently from any specific tasks. words. By evaluating the performance of sentiment analysis Hence, the resulting representations could be sub-optimal in each category, we find the best way of incorporating prior for a given task. In the context of sentiment analysis, there sentiment information. Experimental results on real-world are various types of prior knowledge available, e.g., senti- datasets demonstrate that the word representations learnt ment labels of documents from available datasets or polarity by DLJT2 can significantly improve the sentiment analysis values of words from sentiment lexicons. We incorporate performance. We prove that incorporating prior sentiment knowledge into the embedding process has the potential to learn better representations for sentiment analysis. Erik Cambria [email protected] Introduction Yang Li [email protected] Word embedding is a popular method for natural language Quan Pan processing (NLP) that aims to learn low-dimensional vec- [email protected] tor representations of words from documents. Due to its Tao Yang ability to capture syntactic and semantic word relationships, [email protected] word embedding algorithms such as Skip-gram, CBOW [17, 18] and GloVe [23] have been proven to facilitate various Suhang Wang [email protected] NLP tasks, such as word analogy [18], parsing [1], POS tagging [13], aspect extraction [25], temporal tagging [35], Jiliang Tang [email protected] personality recognition [16], and multimodal fusion [27]. The assumption behind these word embedding approaches 1 School of Automation, NorthWestern Polytechnical is the distributional hypothesis that “you shall know a word University, Xian,´ ShanXi, 710072, People’s Republic of China by the company it keeps” [8]. By leveraging on statistical information such as word co-occurrence frequencies, 2 Department of Computer Science and Engineering, word embedding approaches can learn distributional vector Arizona State University, Tempe, AZ, 85281, USA representations that capture semantic meanings of words. 3 Computer Science and Engineering, Michigan State The majority of existing word embedding algorithms University, East Lansing, MI, 48824, USA merely take into account statistical information from documents [17, 18, 23]. The representations learnt by such 4 School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Ave, algorithms are very general and can be applied to various Singapore, 639798, Singapore tasks. However, they are trained completely independently 844 Cogn Comput (2017) 9:843–851 from any specific task; thus, they may not be optimal for a The rest of the paper is organized as follows: “Related given NLP task, especially when prior knowledge or aux- Works” reviews related works in the context of senti- iliary information about such a task is available. Recent ment analysis; “The Proposed Framework” introduces the advances in document representation show that incorpo- proposed model in detail and proposes a complexity anal- rating auxiliary information such as document labels or ysis; in “Experimental Results”, we conduct experiments links for learning document representations can lead to bet- to demonstrate the effectiveness of the proposed models; ter document classification performance [31, 32]. Thus, finally, “Conclusion” concludes the paper and proposes training word embedding for specific tasks by incorporat- future work. ing prior knowledge should always be done when such knowledge is available. In sentiment analysis, there are various types of prior Related Works knowledge available. For example, many datasets provide sentiment labels for documents, which can be used to form Word embedding is a model that aims to learn low-dimensional the word-sentiment distribution, i.e., the empirical probabil- continuously-valued vector representations of words and ity each word has to appear in documents with a particular has attracted increasing attention [2, 6, 11, 17, 19, 23]. sentiment. External sources can also provide prior knowl- The learnt embedding is able to capture semantic and syn- edge. For example, the sentiment polarities of words can tactic relationships and thus facilitate many NLP tasks be obtained via some public sentiment lexicons such as such as word analogy [18], parsing [1], sentiment anal- the multi-perspective question answering (MPQA) corpus ysis [12], and machine translation [37]. The majority of [34], NRC [20], and SenticNet [4], which have been demon- existing word embedding algorithms such as Skip-gram, strated to be useful for sentiment analysis. Prior knowledge CBOW[17] and GloVe [23] follow the essential idea of carries complementary information that is not available in distributional hypothesis [8], which states that words occur- word co-occurrence and, hence, may greatly enhance word ring in the same contexts tend to have similar meanings. embeddings for sentiment analysis. For example, the words Based on this hypothesis, Skip-gram optimizes word rep- “like” and “dislike” can appear in the same or similar con- resentations by finding a center word, which is good at text such as I love reading books and I dislike reading books. predicting its neighboring words, and CBOW is good at pre- By merely looking at word co-occurrences, we would learn dicting the center word given the neighboring words. GloVe similar vector representations of “like” and “dislike” as investigates the ratios of word co-occurrence probabilities these have similar lexical behavior. From a sentiment point for learning word embedding using weighted least square. of view, however, such vector representation should be These algorithms merely utilize statistical information such very different as they convey opposite polarity. Hence, by as word co-occurrence from documents in an unsupervised incorporating prior sentiment knowledge about these two setting. However, for a given specific task, there is also aux- words, we can build more sentiment-aware word embed- iliary information available, which has been proven to be dings and, hence, learn better distributional representations helpful for learning task-specific word embeddings that can for sentiment analysis. improve the performance of such a task [14, 36]. For exam- In this paper, we study the problem of learning word ple, TWE proposed in [14] exploits the topic information embeddings for sentiment analysis by exploiting prior derived from latent Dirichlet allocation (LDA) for learn- knowledge during the embedding learning process. In doing ing word embeddings that capture topical information and this, we faced two main challenges: (1) how to mathemati- outperform CBOW on document classification. cally represent prior knowledge; and (2) how to efficiently Sentiment analysis [3] is a branch of affective computing incorporate the prior knowledge into word embedding learn- [24] that requires tackling many NLP sub-tasks, including ing process. In an attempt to solve such challenges, we subjectivity detection [5], concept extraction [28], named propose novel models that utilize word- and document-level entity recognition [15], and sarcasm detection [26]. Many sentiment prior knowledge. The main contributions of the sentiment analysis algorithms have exploited prior knowl- paper are listed as follows: edge to improve the classification performance [9, 10, 33]. – Providing a principled way to represent sentiment prior For example, emotional signals such as “lol” and emoticons knowledge from both document and word level; are utilized for sentiment analysis in [10]and[9], respec- – Proposing a novel framework that incorporates differ- tively. USEA proposed in [33] uses the publicly available ent levels of prior knowledge into the word embedding sentiment lexicon MPQA as the indication of word senti- learning process for better sentiment analysis; and ment for unsupervised image sentiment analysis. Although – Conducting extensive experiments to demonstrate the it is of great potential to learn better word embeddings for effectiveness of the proposed framework and investigat- sentiment analysis by incorporating prior knowledge, the ing which types of prior knowledge work better. work on this is rather limited. SE-HyRank in [30] utilizes Cogn Comput (2017) 9:843–851 845 lexical-level sentiment supervision from Urban Dictionary Incorporating Prior Sentiment Knowledge and proposed a neural network based model for learning word embeddings. The proposed framework is different There are many different types of prior sentiment informa- from SE-HyRank as (1): we investigate various types of tion available, which can be generally classified into two prior knowledge, i.e., sentiment distribution and ratios of categories, i.e., document-level sentiment and word-level polarity from labels and sentiment weights from sentiment sentiment. In document-level

Load more