Dependency-Based Convolutional Neural Networks for Sentence Embedding∗
Total Page:16
File Type:pdf, Size:1020Kb
Dependency-based Convolutional Neural Networks for Sentence Embedding∗ Mingbo Ma† Liang Huang†‡ Bing Xiang‡ Bowen Zhou‡ †Graduate Center & Queens College ‡IBM Watson Group City University of New York T. J. Watson Research Center, IBM mma2,lhuang gc.cuny.edu lhuang,bingxia,zhou @us.ibm.com { } { } Abstract Indeed, in the sentiment analysis literature, re- searchers have incorporated long-distance infor- In sentence modeling and classification, mation from syntactic parse trees, but the results convolutional neural network approaches are somewhat inconsistent: some reported small have recently achieved state-of-the-art re- improvements (Gamon, 2004; Matsumoto et al., sults, but all such efforts process word vec- 2005), while some otherwise (Dave et al., 2003; tors sequentially and neglect long-distance Kudo and Matsumoto, 2004). As a result, syn- dependencies. To combine deep learn- tactic features have yet to become popular in the ing with linguistic structures, we pro- sentiment analysis community. We suspect one pose a dependency-based convolution ap- of the reasons for this is data sparsity (according n proach, making use of tree-based -grams to our experiments, tree n-grams are significantly rather than surface ones, thus utlizing non- sparser than surface n-grams), but this problem local interactions between words. Our has largely been alleviated by the recent advances model improves sequential baselines on all in word embedding. Can we combine the advan- four sentiment and question classification tages of both worlds? tasks, and achieves the highest published So we propose a very simple dependency-based accuracy on TREC. convolutional neural networks (DCNNs). Our model is similar to Kim (2014), but while his se- 1 Introduction quential CNNs put a word in its sequential con- Convolutional neural networks (CNNs), originally text, ours considers a word and its parent, grand- invented in computer vision (LeCun et al., 1995), parent, great-grand-parent, and siblings on the de- has recently attracted much attention in natural pendency tree. This way we incorporate long- language processing (NLP) on problems such as distance information that are otherwise unavail- sequence labeling (Collobert et al., 2011), seman- able on the surface string. tic parsing (Yih et al., 2014), and search query Experiments on three classification tasks retrieval (Shen et al., 2014). In particular, recent demonstrate the superior performance of our work on CNN-based sentence modeling (Kalch- DCNNs over the baseline sequential CNNs. In brenner et al., 2014; Kim, 2014) has achieved ex- particular, our accuracy on the TREC dataset cellent, often state-of-the-art, results on various outperforms all previously published results classification tasks such as sentiment, subjectivity, in the literature, including those with heavy and question-type classification. However, despite hand-engineered features. their celebrated success, there remains a major Independently of this work, Mou et al. (2015, limitation from the linguistics perspective: CNNs, unpublished) reported related efforts; see Sec. 3.3. being invented on pixel matrices in image process- ing, only consider sequential n-grams that are con- 2 Dependency-based Convolution secutive on the surface string and neglect long- The original CNN, first proposed by LeCun et distance dependencies, while the latter play an im- al. (1995), applies convolution kernels on a se- portant role in many linguistic phenomena such as ries of continuous areas of given images, and was negation, subordination, and wh-extraction, all of adapted to NLP by Collobert et al. (2011). Fol- which might dully affect the sentiment, subjectiv- lowing Kim (2014), one dimensional convolution ity, or other categorization of the sentence. operates the convolution kernel in sequential order d ∗ This work was done at both IBM and CUNY, and was supported in in Equation1, where xi R represents the d di- part by DARPA FA8750-13-2-0041 (DEFT), and NSF IIS-1449278. We thank ∈ Yoon Kim for sharing his code, and James Cross and Kai Zhao for discussions. mensional word representation for the i-th word in 174 Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), pages 174–179, Beijing, China, July 26-31, 2015. c 2015 Association for Computational Linguistics ROOT Despite the film ’s shortcomings the stories are quietly moving . Figure 1: DependencyFigure tree 1: Running of an example example from sentence Movie from Reviews the dataset. Movie Reviews dataset. mensional word representation for the i-th word in where f is a non-linear activation function such as the sentence, and is the concatenation operator. rectified linear unit (ReLu) or sigmoid function. the sentence, and is⊕ the concatenation operator. where f is a non-linear activation function such as Therefore xi,j⊕refers to concatenated word vector The filter w is applied to each word in the sen- Therefore xi,j refers to concatenated word vector rectified linear unit (ReLu) or sigmoid function. from the i-th word to the (i + j)-th word: tence, generating the feature map c Rl: from the i-th word to the (i + j)-th word: The filter w is applied to each∈ word in the sen- xe = x x x (1) c = [c1, c2, , cl] (5)l e i,j i ⊕ i+1 ⊕ · · · ⊕ i+j tence, generating the feature··· map c R : xi,j = xi xi+1 xi+j (1) where l is the length of the sentence. ∈ Sequential word⊕ concatenation⊕ · · · ⊕ xi,j works as e 2.2 Max-Over-Tree Pooling and Dropout Sequentialn-gram models word concatenationwhich feeds localx informationworks into as c = [c1, c2, , cl] (5) i,j ··· n-gramconvolution modelse which operations. feeds However, local information thise setting into can The filters convolve with different word concate- nation in Eq. 4 can be regarded as pattern detec- convolutionnot capture operations. long-distance However, relationships this setting unless can we where l is the length of the sentence. enlarge the window indefinitely whiche would in- tion: only the most similar pattern between the not capture long-distance relationships unless we words and the filter could return the maximum ac- evitably cause the data sparsity problem. 2.2 Max-Over-Tree Pooling and Dropout enlarge theIn order window to capture indefinitely the long-distance which would dependen- in- tivation. In sequential CNNs, max-over-time pool- evitablycies cause we propose the data the sparsity dependency problem. tree-based con- ingThe (Collobert filters convolve et al., 2011; with Kim, different 2014) operates word concate- In ordervolution to capture model (DTCNN). the long-distance Figure 1 illustrates dependen- an overnation the in feature Eq.4 map can to be get regarded the maximum as pattern acti- detec- vation cˆ = max c representing the entire feature cies weexample propose from the the dependency-based Movie Reviews (MR) convolu- dataset map.tion: Our only DTCNNs the most also similar pool the pattern maximum between ac- the tion model(Pang (DCNN). and Lee, 2005). Figure The1 illustrates sentimentof an this exam- sen- words and the filter could return the maximum ac- tence is obviously positive, but this is quite dif- tivation from feature map to detect the strongest ple fromficult the for Movie sequential Reviews CNNs (MR) because dataset many n (-gramPang activationtivation. over In sequential the whole tree CNNs, (i.e.,over max-over-time the whole pool- and Lee,windows 2005). would The include sentiment the highly of this negative sentence word sentence).ing (Collobert Since the et treeal., no2011 longer; Kim, defines 2014 a se-) operates is obviously“shortcomings”, positive, andbut thisthe distance is quite between difficult “De- for quentialover the “time” feature direction, map we to refer get to the our maximum pooling acti- sequentialspite” CNNs and “shortcomings” because many is quiten-gram long. windows DTCNN, asvation “max-over-tree”cˆ = max pooling.c representing the entire feature In order to capture enough variations, we ran- wouldhowever, include the could highly capture negative the tree-based word “short- bigram map. Our DCNNs also pool the maximum activa- “Despite – shortcomings”, thus flipping the senti- domly initialize the set of filters to detect different comings”, and the distance between “Despite” and tion from feature map to detect the strongest ac- ment, and the tree-based trigram “ROOT – moving structure patterns. Each filter’s height is the num- “shortcomings”– stories”, which is quite is highly long. positive. DCNN, however, bertivation of words over considered the whole and tree the width (i.e., is over always the whole could capture the tree-based bigram “Despite – equalsentence). to the dimensionality Since the treed of no word longer representa- defines a se- shortcomings”,2.1 Convolution thus flipping on Ancestor the sentiment, Paths and tion.quential Each “time” filter will direction, be represented we refer by only to our one pooling We define our concatenation based on the depen- the tree-based trigram “ROOT – moving – sto- featureas “max-over-tree” after max-over-tree pooling. pooling. After a series dency tree for a given modifier xi: of convolution with different filter with different ries”, which is highly positive. heights,In order multiple to features capture carry enough different variations, structural we ran- xi,k = xi xp(i) xpk 1(i) (2) ⊕ ⊕ · · · ⊕ − informationdomly initialize become the theset final of representation filters to detect of the different 2.1 Convolution onk Ancestor Paths where function p (i) returns the i-th word’s k-th inputstructure sentence. patterns. Then, this Each sentence filter’s representation height is the num- We define our concatenation based on the depen- ancestor index, which is recursively defined as: isber passed of words to a fully considered connected andsoft-max the layerwidth and is always dency tree for a given modifierk 1 xi: outputs a distribution over different labels. p(p − (i)) if k > 0 equal to the dimensionality d of word representa- pk(i) = (3) Neural networks often suffer from overtrain- xi,k = xi xpi(i) xifpkk1=(i) 0 (2) tion.