Unsupervised Word and Dependency Path Embeddings for Aspect Term
Total Page:16
File Type:pdf, Size:1020Kb
Unsupervised Word and Dependency Path Embeddings for Aspect Term Extraction Yichun Yin1, Furu Wei2, Li Dong3, Kaimeng Xu1, Ming Zhang1∗, Ming Zhou2 1School of EECS, Peking University 2Microsoft Research 3Institute for Language, Cognition and Computation, University of Edinburgh fyichunyin,1300012834,mzhang [email protected],ffuwei,[email protected],[email protected] Abstract 2013], such as word embeddings [Mikolov et al., 2013b] and structured embeddings of knowledge bases [Bordes et al., In this paper, we develop a novel approach to as- 2011]. It learns distributed representations for text in differ- pect term extraction based on unsupervised learn- ent granularities, such as words, phrases and sentences, and ing of distributed representations of words and de- reduces data sparsity compared with the conventional one-hot pendency paths. The basic idea is to connect two representation. The distributed representations have been re- words (w1 and w2) with the dependency path (r) ported to be useful in many NLP tasks [Turian et al., 2010; between them in the embedding space. Specifi- Collobert et al., 2011]. cally, our method optimizes the objective w + r ≈ 1 In this paper, we focus on representation learning for as- w in the low-dimensional space, where the multi- 2 pect term extraction under an unsupervised framework. Be- hop dependency paths are treated as a sequence sides words, dependency paths, which have been shown to be of grammatical relations and modeled by a recur- important clues in aspect term extraction [Qiu et al., 2011], rent neural network. Then, we design the embed- are also taken into consideration. Inspired by the repre- ding features that consider linear context and de- sentation learning of knowledge bases [Bordes et al., 2011; pendency context information, for the conditional Neelakantan et al., 2015; Lin et al., 2015] that embeds both random field (CRF) based aspect term extraction. entities and relations into a low-dimensional space, we learn Experimental results on the SemEval datasets show distributed representations of words and dependency paths that, (1) with only embedding features, we can from the text corpus. Specifically, the optimization objective achieve state-of-the-art results; (2) our embedding is formalized as w + r ≈ w . In the triple (w ; w ; r), w method which incorporates the syntactic informa- 1 2 1 2 1 and w are words, r is the corresponding dependency path tion among words yields better performance than 2 consisting of a sequence of grammatical relations. The re- other representative ones in aspect term extraction. current neural network [Mikolov et al., 2010] is used to learn the distributed representations of dependency paths. Further- 1 Introduction more, the word embeddings are enhanced by linear context information in a multi-task learning manner. Aspect term extraction [Hu and Liu, 2004; Pontiki et al., The learned embeddings of words and dependency paths 2014; 2015] aims to identify the aspect expressions which are utilized as features in CRF for aspect term extraction. refer to the product’s or service’s properties (or attributes), The embeddings are real values that are not necessarily in from the review sentence. It is a fundamental step to ob- a bounded range [Turian et al., 2010]. We therefore firstly tain the fine-grained sentiment of specific aspects of a prod- map the continuous embeddings into the discrete embeddings uct, besides the coarse-grained overall sentiment. Until now, arXiv:1605.07843v1 [cs.CL] 25 May 2016 and make them more appropriate for the CRF model. Then, there have been two major approaches for aspect term ex- we construct the embedding features which include the target traction. The unsupervised (or rule based) methods [Qiu et word embedding, linear context embedding and dependency al., 2011] rely on a set of manually defined opinion words as context embedding for aspect term extraction. We conduct seeds and rules derived from syntactic parsing trees to itera- experiments on the SemEval datasets and obtain comparable tively extract aspect terms. The supervised methods [Jakob performances with the top systems. To demonstrate the effec- and Gurevych, 2010; Li et al., 2010; Chernyshevich, 2014; tiveness of the proposed embedding method, we also compare Toh and Wang, 2014; San Vicente et al., 2015] usually treat our method with other state-of-the-art models. With the same aspect term extraction as a sequence labeling problem, and feature settings, our approach achieves better results. More- conditional random field (CRF) has been the mainstream over, we perform a qualitative analysis to show the effective- method in the aspect term extraction task of SemEval. ness of the learned word and dependency path embeddings. Representation learning has been introduced and achieved success in natural language processing (NLP) [Bengio et al., The contributions of this paper are two-fold. First, we use the dependency path to link words in the embedding space ∗Corresponding author: Ming Zhang for distributed representation learning of words and depen- dency paths. By this method, the syntactic information is et al. [2015] also learn the representations of relation paths. encoded in word embeddings and the multi-hop dependency A path-constraint resource allocation algorithm is employed path embeddings are learned explicitly. Second, we construct to measure the reliability of path. In this paper, we learn the the CRF features only based on the derived embeddings for semantic composition of dependency paths over dependency aspect term extraction, and achieve state-of-the-art results. trees. 2 Related Work 3 Method There are a number of unsupervised methods for aspect term In this section, we first present the unsupervised learning of extraction. Hu and Liu [2004] mine aspect term based on a word and dependency path embeddings. In addition, we de- data mining approach called association rule mining. In ad- scribe how to enhance word embeddings by leveraging the dition, they use opinion words to extract infrequent aspect linear context in a multi-task learning manner. Then, the con- terms. Using relationships between opinion words and aspect struction of embedding features for CRF model is discussed. words to extract aspect term is employed in many follow-up 3.1 Unsupervised Learning of Word and studies. In [Qiu et al., 2011], the dependency relation is used as a crucial clue, and the double propagation method is pro- Dependency Path Embeddings posed to iteratively extract aspect terms and opinion words. We learn the distributed representations of words and depen- The supervised algorithms [Li et al., 2012; Liu et al., dency paths in a unified model, as shown in Figure 1. We 2015a] have been provided for aspect term extraction. extract the triple (w1; w2; r) from dependency trees, where Among these algorithms, the mainstream method is the CRF w1 and w2 denote two words, the corresponding dependency model. Li et al. [2010] propose a new machine learning path r is the shortest path from w1 to w2 and consists of a framework on top of CRF to jointly extract positive opinion sequence of grammatical relations. For example, in Figure 1, words, negative opinion words and aspect terms. In order given that service is w1 and professional is w2, we obtain the conj dep to obtain a more domain independent extraction, Jakob and dependency path ∗ − ∗ −! ∗ amod−! ∗. We notice that con- Gurevych [2010] train the CRF model on review sentences sidering the lexicalized dependency paths can provide more from different domains. Most of top systems in SemEval also information for the embedding learning. However, we need to rely on the CRF model, and hand-crafted features are con- memorize more dependency path frequencies for the learning structed to boost performances. method (negative sampling) described in Section 3.3. For- n−1 n The representation learning has been employed in mally, the number of dependency paths is jVwordj jVdepj NLP [Bengio et al., 2013]. Two successful applications are when we take n-hop dependency paths into consideration1. the word embedding [Mikolov et al., 2013b] and the knowl- It is computationally expensive, even when n = 2 . Hence, edge embedding [Bordes et al., 2011]. The word embeddings in our paper, we only include the grammatical relations for are learned from raw textual data and are effective represen- modeling r which also shows the promising results in the ex- tations for word meanings. They have been proven useful in periment section. many NLP tasks [Turian et al., 2010; Guo et al., 2014]. The The proposed learning model connects two words with the knowledge embedding method models the relation as a trans- dependency path in the vector space. By incorporating the lation vector that connects the vectors of two entities. It opti- dependency path into the model, the word embeddings en- mizes an objective over all the facts in the knowledge bases, code more syntactic information [Levy and Goldberg, 2014] encoding the global information. The learned embeddings and the distributed representations of dependency paths are are useful for reasoning missing facts and facts extraction in explicitly learned. To be specific, this model requires that r is knowledge bases. close to w2 −w1 and we minimize the following loss function As dependency paths contain rich linguistics information for the model learning: between words, recent works model it as a dense vector. Liu et al. [2015b] incorporate the vectors of grammatical rela- X X | | 0 tions (one-hop dependency paths) into the relation represen- maxf0; 1−(w2−w1) r+(w2−w1) r g (1) 0 tation between two entities for relation classification task. (w1;w2;r)2C1 r ∼p(r) [ Dependency-based word embedding model Levy and Gold- where C represents the set of triples extracted from the de- ] 1 berg, 2014 encodes dependency information into word em- pendency trees parsed from the text corpus, r is a sequence beddings which is one similar work to ours.