Dependency-based Convolutional Neural Networks for Sentence Embedding∗

Mingbo Ma† Liang Huang†‡ Bing Xiang‡ Bowen Zhou‡

†Graduate Center & Queens College ‡IBM Watson Group City University of New York T. J. Watson Research Center, IBM mma2,lhuang gc.cuny.edu lhuang,bingxia,zhou @us.ibm.com { } { } Abstract Indeed, in the sentiment analysis literature, re- searchers have incorporated long-distance infor- In sentence modeling and classification, mation from syntactic parse trees, but the results convolutional neural network approaches are somewhat inconsistent: some reported small have recently achieved state-of-the-art re- improvements (Gamon, 2004; Matsumoto et al., sults, but all such efforts process word vec- 2005), while some otherwise (Dave et al., 2003; tors sequentially and neglect long-distance Kudo and Matsumoto, 2004). As a result, syn- dependencies. To combine deep learn- tactic features have yet to become popular in the ing with linguistic structures, we pro- sentiment analysis community. We suspect one pose a dependency-based convolution ap- of the reasons for this is data sparsity (according n proach, making use of tree-based -grams to our experiments, tree n-grams are significantly rather than surface ones, thus utlizing non- sparser than surface n-grams), but this problem local interactions between words. Our has largely been alleviated by the recent advances model improves sequential baselines on all in . Can we combine the advan- four sentiment and question classification tages of both worlds? tasks, and achieves the highest published So we propose a very simple dependency-based accuracy on TREC. convolutional neural networks (DCNNs). Our model is similar to Kim (2014), but while his se- 1 Introduction quential CNNs put a word in its sequential con- Convolutional neural networks (CNNs), originally text, ours considers a word and its parent, grand- invented in computer vision (LeCun et al., 1995), parent, great-grand-parent, and siblings on the de- has recently attracted much attention in natural pendency tree. This way we incorporate long- language processing (NLP) on problems such as distance information that are otherwise unavail- sequence labeling (Collobert et al., 2011), seman- able on the surface string. tic parsing (Yih et al., 2014), and search query Experiments on three classification tasks retrieval (Shen et al., 2014). In particular, recent demonstrate the superior performance of our work on CNN-based sentence modeling (Kalch- DCNNs over the baseline sequential CNNs. In brenner et al., 2014; Kim, 2014) has achieved ex- particular, our accuracy on the TREC dataset cellent, often state-of-the-art, results on various outperforms all previously published results classification tasks such as sentiment, subjectivity, in the literature, including those with heavy and question-type classification. However, despite hand-engineered features. their celebrated success, there remains a major Independently of this work, Mou et al. (2015, limitation from the linguistics perspective: CNNs, unpublished) reported related efforts; see Sec. 3.3. being invented on pixel matrices in image process- ing, only consider sequential n-grams that are con- 2 Dependency-based Convolution secutive on the surface string and neglect long- The original CNN, first proposed by LeCun et distance dependencies, while the latter play an im- al. (1995), applies convolution kernels on a se- portant role in many linguistic phenomena such as ries of continuous areas of given images, and was negation, subordination, and wh-extraction, all of adapted to NLP by Collobert et al. (2011). Fol- which might dully affect the sentiment, subjectiv- lowing Kim (2014), one dimensional convolution ity, or other categorization of the sentence. operates the convolution kernel in sequential order d ∗ This work was done at both IBM and CUNY, and was supported in in Equation1, where xi R represents the d di- part by DARPA FA8750-13-2-0041 (DEFT), and NSF IIS-1449278. We thank ∈ Yoon Kim for sharing his code, and James Cross and Kai Zhao for discussions. mensional word representation for the i-th word in

174 Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), pages 174–179, Beijing, China, July 26-31, 2015. c 2015 Association for Computational Linguistics ROOT

Despite the film ’s shortcomings the stories are quietly moving .

Figure 1: DependencyFigure tree 1: Running of an example example from sentence Movie from Reviews the dataset. Movie Reviews dataset. mensional word representation for the i-th word in where f is a non-linear activation function such as the sentence, and is the concatenation operator. rectified linear unit (ReLu) or sigmoid function. the sentence, and is⊕ the concatenation operator. where f is a non-linear activation function such as Therefore xi,j⊕refers to concatenated word vector The filter w is applied to each word in the sen- Therefore xi,j refers to concatenated word vector rectified linear unit (ReLu) or sigmoid function. from the i-th word to the (i + j)-th word: tence, generating the feature map c Rl: from the i-th word to the (i + j)-th word: The filter w is applied to each∈ word in the sen- xe = x x x (1) c = [c1, c2, , cl] (5)l e i,j i ⊕ i+1 ⊕ · · · ⊕ i+j tence, generating the feature··· map c R : xi,j = xi xi+1 xi+j (1) where l is the length of the sentence. ∈ Sequential word⊕ concatenation⊕ · · · ⊕ xi,j works as e 2.2 Max-Over-Tree Pooling and Dropout Sequentialn-gram models word concatenationwhich feeds localx informationworks into as c = [c1, c2, , cl] (5) i,j ··· n-gramconvolution modelse which operations. feeds However, local information thise setting into can The filters convolve with different word concate- nation in Eq. 4 can be regarded as pattern detec- convolutionnot capture operations. long-distance However, relationships this setting unless can we where l is the length of the sentence. enlarge the window indefinitely whiche would in- tion: only the most similar pattern between the not capture long-distance relationships unless we words and the filter could return the maximum ac- evitably cause the data sparsity problem. 2.2 Max-Over-Tree Pooling and Dropout enlarge theIn order window to capture indefinitely the long-distance which would dependen- in- tivation. In sequential CNNs, max-over-time pool- evitablycies cause we propose the data the sparsity dependency problem. tree-based con- ingThe (Collobert filters convolve et al., 2011; with Kim, different 2014) operates word concate- In ordervolution to capture model (DTCNN). the long-distance Figure 1 illustrates dependen- an overnation the in feature Eq.4 map can to be get regarded the maximum as pattern acti- detec- vation cˆ = max c representing the entire feature cies weexample propose from the the dependency-based Movie Reviews (MR) convolu- dataset map.tion: Our only DTCNNs the most also similar pool the pattern maximum between ac- the tion model(Pang (DCNN). and Lee, 2005). Figure The1 illustrates sentimentof an this exam- sen- words and the filter could return the maximum ac- tence is obviously positive, but this is quite dif- tivation from feature map to detect the strongest ple fromficult the for Movie sequential Reviews CNNs (MR) because dataset many n (-gramPang activationtivation. over In sequential the whole tree CNNs, (i.e.,over max-over-time the whole pool- and Lee,windows 2005). would The include sentiment the highly of this negative sentence word sentence).ing (Collobert Since the et treeal., no2011 longer; Kim, defines 2014 a se-) operates is obviously“shortcomings”, positive, andbut thisthe distance is quite between difficult “De- for quentialover the “time” feature direction, map we to refer get to the our maximum pooling acti- sequentialspite” CNNs and “shortcomings” because many is quiten-gram long. windows DTCNN, asvation “max-over-tree”cˆ = max pooling.c representing the entire feature In order to capture enough variations, we ran- wouldhowever, include the could highly capture negative the tree-based word “short- bigram map. Our DCNNs also pool the maximum activa- “Despite – shortcomings”, thus flipping the senti- domly initialize the set of filters to detect different comings”, and the distance between “Despite” and tion from feature map to detect the strongest ac- ment, and the tree-based trigram “ROOT – moving structure patterns. Each filter’s height is the num- “shortcomings”– stories”, which is quite is highly long. positive. DCNN, however, bertivation of words over considered the whole and tree the width (i.e., is over always the whole could capture the tree-based bigram “Despite – equalsentence). to the dimensionality Since the treed of no word longer representa- defines a se- shortcomings”,2.1 Convolution thus flipping on Ancestor the sentiment, Paths and tion.quential Each “time” filter will direction, be represented we refer by only to our one pooling We define our concatenation based on the depen- the tree-based trigram “ROOT – moving – sto- featureas “max-over-tree” after max-over-tree pooling. pooling. After a series dency tree for a given modifier xi: of convolution with different filter with different ries”, which is highly positive. heights,In order multiple to features capture carry enough different variations, structural we ran- xi,k = xi xp(i) xpk 1(i) (2) ⊕ ⊕ · · · ⊕ − informationdomly initialize become the theset final of representation filters to detect of the different 2.1 Convolution onk Ancestor Paths where function p (i) returns the i-th word’s k-th inputstructure sentence. patterns. Then, this Each sentence filter’s representation height is the num- We define our concatenation based on the depen- ancestor index, which is recursively defined as: isber passed of words to a fully considered connected andsoft-max the layerwidth and is always dency tree for a given modifierk 1 xi: outputs a distribution over different labels. p(p − (i)) if k > 0 equal to the dimensionality d of word representa- pk(i) = (3) Neural networks often suffer from overtrain- xi,k = xi xpi(i) xifpkk1=(i) 0 (2) tion. Each filter will be represented by only one ⊕ ( ⊕ · · · ⊕ − ing. Following Kim (2014), we employ random feature after max-over-tree pooling. After a series where functionFigure 2p (left)k(i) illustratesreturns the ancestori-thpaths word’s patternsk-th dropout on penultimate layer (Hinton et al., 2012). with various orders. We always start the convo- inof order convolution to prevent with co-adaptation different of filter hidden with units. different ancestor index, which is recursively defined as: lution with xi and concatenate with its ancestors. Inheights, our experiments, multiple we features set our dropcarry out different rate as 0.5 structural If the root nodek is reached,1 we add “ROOT” as and learning rate as 0.95 by default. Following k p(p − (i)) if k > 0 information become the final representation of the p (i) = (3) Kim (2014), training is done through stochastic dummy ancestors (vertical padding). input sentence. Then, this sentence representation For a given(i tree-based concatenatedif k = 0 word se- gradient descent over shuffled mini-batches with is passed to a fully connected soft-max layer and quence xi,k, the convolution operation applies a the Adadelta update rule (Zeiler, 2012). Figure2 (left) illustratesk d ancestor paths patterns filter w R × to xi,k with a bias term b de- outputs a distribution over different labels. ∈ 2.3 Convolution on Siblings with variousscribed orders. in equation We 4: always start the convo- AncestorNeural paths networks alone is often not enough suffer to from capture overtrain- lution with xi and concatenate with its ancestors. c = f(w x + b) (4) manying. linguistic Following phenomena Kim (2014 such), as we conjunction. employ random If the root node is reached,i · wei,k add “ROOT” as dropout on penultimate layer (Hinton et al., 2014). dummy ancestors (vertical padding). in order to prevent co-adaptation of hidden units. For a given tree-based concatenated word se- In our experiments, we set our drop out rate as 0.5 quence xi,k, the convolution operation applies a k d and learning rate as 0.95 by default. Following filter w R × to xi,k with a bias term b de- ∈ Kim (2014), training is done through stochastic scribed in equation4: gradient descent over shuffled mini-batches with c = f(w x + b) (4) the Adadelta update rule (Zeiler, 2012). i · i,k

175 ancestor paths siblings n pattern(s) n pattern(s)

3 2 m h g s m

4 3 m h g g2 s m h t s m

5 4 m h g g2 g3 t s m h s m h g

Table 1: Tree-based convolution patterns. Word concatenation always starts with m, while h, g, and g2 Figure 2: Convolution patterns on trees. Word concatenation always starts with m, while h, g, and g2 denotedenote parent, parent, grand grand parent, parent, and and great-grand great-grand parent, parent, etc., and “ “ ”” denotes denotes words words excluded excluded in inconvolution. convolution. 2.32.3 Convolution Convolution on on Siblings Siblings (SST-1)For all (Socher datasets, et we al., first 2013) obtain datasets, the dependency and ques- AncestorAncestor paths paths alone alone is is not not enough enough to to capture capture tionparse classification tree from Stanford on TREC parser (Li (Manning and Roth, et 2002). al., 1 manymany linguistic linguistic phenomena phenomena such such as as conjunction. conjunction. 2014For). allDifferent datasets, window we first size obtain for different the dependency choice InspiredInspired by higher-orderby higher-order dependency dependency parsing parsing (Mc- (Mc- parseof convolution tree from are Stanford shown in parser Figure (Manning2. For the et al., DonaldDonald and and Pereira, Pereira, 2006; 2006 Koo; Koo and and Collins, Collins, 2010), 2010), 2014).dataset2 withoutDifferent a development window size set for (MR), different we ran- choice wewe also also incorporate incorporate siblings siblings for for a a given given word word in in ofdomly convolution choose 10% areof shown the training in Table data to 1. indicate For the variousvarious ways. ways. See See Table Figure 1 (right)2 (right) for for details. details. datasetearly stopping. without In a development order to have set a fare (MR), compari- we ran- domlyson with choose baseline 10 CNN,% of the we training also use data3 to 5 to as indicate our 2.4 Combined Model 2.4 Combined Model earlywindow stopping. size. Most In of order our results to have are a generated fare compari- by PowerfulPowerful as itas is, it is, structural structural information information still still does does GPU due to its efficiency, however CPU could po- son with baseline CNN,2 we also use 3 to 5 as our notnot fully fully cover cover sequential sequential information. information. Also, Also, pars- pars- tentially get better results. Our implementation, window size. Most of our3 results are generated4 inging errors errors (which (which are are common common especially especially for for in- in- on top of Kim (2014)’s code, will be released. formal text such as online reviews) directly affect by GPU due to its efficiency, however CPU poten- formal text such as online reviews) directly affect 3 DCNN performance while sequential n-grams are tially3.1 Sentiment could generate Analysis better results. Our imple- DTCNN performance while sequential n-grams 4 always correctly observed. To best exploit both in- mentation can be found on Github. are always correctly observed. To best exploit Both sentiment analysis datasets (MR and SST- formation, we want to combine both models. The 1) are based on movie reviews. The differences both information, we want to combine both mod- 3.1 Sentiment Analysis easiest way of combination is to concatenate these between them are mainly in the different num- els. The easiest way of combination is to con- Both sentiment analysis datasets (MR and SST- representations together, then feed into fully con- bers of categories and whether the standard split 1) are based on movie reviews. The differences catenatenected these soft-max representations neural networks. together, In these then cases, feed is given. There are 10,662 sentences in the MR intocombine fully connected with different soft-max feature neural from networks. different type In betweendataset. Each them instance are mainly is labeled in the positive different or neg- num- theseof cases, sources combine could stabilize with different the performance. feature from The bersative, of and categories in most andcases whether contains the one standard sentence. split differentfinal sentence type of sources representation could stabilizeis thus: the perfor- isSince given. no standard There aredata 10,662 split is given, sentences following in the the MR mance. The final sentence representation is thus: dataset.literature Eachwe use instance 10 fold cross is labeled validation positive to include or neg- (1) (Na) (1) (Ns) (1) (N) cˆ = [ˆca , ..., cˆa ;c ˆs , ..., cˆs ;c ˆ , ..., cˆ ] ative,every sentence and in inmost training cases and contains testing at one least sentence. once. (1) (Na) (1) (Ns) (1) (N) cˆ = [ˆca , ...,ancestorscˆa ;c ˆs , ...,siblingscˆs ;c ˆ sequential, ..., cˆ ] SinceConcatenating no standard with data sibling split and is given, sequential following infor- the mation obviously improves DCNNs, and the final |ancestors{z } | siblings{z } |sequential{z } literature we use 10 fold cross validation to include where Na, Ns, and N are the number of ancestor, everymodel sentence outperforms in training the baseline and testing sequential at least CNNs once. sibling,| and{z sequential} | filters.{z In} practice,| {z we} use where Na, Ns, and N are the number of ancestor, Concatenatingby 0.4, and ties with with Zhu sibling et al. and (2015 sequential). infor- 100 filters for each template in Figure2 . The fully sibling, and sequential filters. In practice, we use mationDifferent obviously from improvesMR, the Stanfordtree-based Sentiment CNNs, and combined representation is 1,100-dimensional by 100 filters for each template in Table 1. The fully theTreebank final model (SST-1) outperforms annotates finer-grained the baseline labels, sequen- contrast to 300-dimensional for sequential CNN. combined representation is 1100-dimensional by tialvery CNNs positive, by positive, 0.4, and neutral, ties with negative Zhu etand al. (2015). very contrast3 Experiments to 300-dimensional for sequential CNN. negative,Different on an from extension MR, of the the MR Stanford dataset. Sentiment There are 11,855 sentences with standard split. Our Table1 summarizes results in the context of other Treebank (SST-1) annotates finer-grained labels, 3 Experiments model achieves an accuracy of 49.5 which is sec- high-performing efforts in the literature. We use very positive, positive, neutral, negative and very ond only to Irsoy and Cardie (2014). Wethree implement benchmark our datasets DTCNN inon two top categories: of the senti-open negative, on an extension of the MR dataset. There 1 1 sourcement CNN analysis code on both by KimMovie (2014). Review (MR)Table (Pang 2 areThe 11,855 phrase-structure sentences trees in SST-1 with are actually standard automatically split. parsed, Our and thus can not be used as gold-standard trees. summarizesand Lee, 2005 our) results and Stanford in the Sentiment context ofTreebank other 2 modelGPU onlyachieves supports float32 an accuracywhile CPU ofsupports 49.5float64 which. is sec- high-performing(SST-1) (Socher efforts et al., in 2013 the) literature. datasets, and We ques- use ond3https://github.comw/yoonkim/CNN_sentence only to Irsoy and Cardie (2014). We set batch 4 threetion benchmark classification datasets on TREC in two (Li categories: and Roth, 2002 senti-). sizehttps://github.com/cosmmb/DCNN to 100 for this task. ment analysis on both Movie Review (MR) (Pang 2The phrase-structure trees in SST-1 are actually automat- and Lee, 2005) and Stanford Sentiment Treebank 176ically parsed, and thus can not be used as gold-standard trees. 3GPU only supports float32 while CPU supports float64. 1https://github.com/yoonkim/CNN sentence 4https://github.com/cosmmb/DTCNN Category Model MR SST-1 TREC TREC-2 DCNNs: ancestor 80.4† 47.7† 95.4† 88.4† This work DCNNs: ancestor+sibling 81.7† 48.3† 95.6† 89.0† DCNNs: ancestor+sibling+sequential 81.9 49.5 95.4† 88.8† Category Model MR SST-1 TREC TREC-2 CNNs-non-static (Kim, 2014) – baseline 81.5 48.0 93.6 86.4∗ CNNs DTCNNs: ancestor 80.4† 47.7† 95.4† 88.4† CNNs-multichannel (Kim, 2014) 81.1 47.4 92.2 86.0 This work DTCNNs: ancestor+sibling∗ 81.7† 48.3† 95.6† 89.0† Deep CNNs (Kalchbrenner et al., 2014) - 48.5DTCNNs:93.0 ancestor+sibling+sequential- 81.9 49.5 95.4† 88.8† Recursive (Socher et al., 2011) 77.7 43.2CNNs-non-static- (Kim,- 2014) – baseline 81.5 48.0 93.6 86.4 Recursive NNs CNNs ∗ Recursive Neural Tensor (Socher et al., 2013) - 45.7CNNs-multichannel- - (Kim, 2014) 81.1 47.4 92.2 86.0∗ Deep Recursive NNs (Irsoy and Cardie, 2014) - 49.8Deep CNNs- (Kalchbrenner- et al., 2014) - 48.5 93.0 - Recursive Autoencoder (Socher et al., 2011) 77.7 43.2 - - Recurrent NNs LSTM on tree (Zhu et al., 2015) Recursive NNs81.9 48.0 - - Other Paragraph-Vec (Le and Mikolov, 2014) - 48.7Recursive- Neural- Tensor (Socher et al., 2013) - 45.7 - - Deep Recursive NNs (Irsoy and Cardie, 2014) - 49.8 - - Hand-coded rules SVMS (Silva et al., 2011) - 95.0 90.8 Recurrent NNs LSTM on tree (Zhu et al., 2015) 81.9 48.0 - - Other deep learning Paragraph-Vec (Le and Mikolov, 2014) - 48.7 - - Table 1: Results on Movie Review (MR), Stanford Sentiment Treebank (SST-1), and TREC datasets. Hand-coded rules SVMS (Silva et al., 2011) - 95.0 90.8 TREC-2 is TREC with fine grained labels. †Results generated by GPU (all others generated by CPU). ∗Results generated from Kim (2014)’s implementation. Table 2: Results on Movie Review (MR), Stanford Sentiment Treebank (SST-1), and TREC datasets. TREC-2 is TREC with fine grained labels. †Results generated by GPU (all others generated by CPU). 3.2 Question Classification ∗Results generated from Kim (2014)’s implementation. In the TREC dataset, the entire dataset of 5,952 root 3.2 Question Classification sentences are classified into the following 6 cate- In the TREC dataset, the entire dataset of 5,952 gories: abbreviation, entity, description, location sentences are classified into the following 6 cate- and numeric. In this experiment, DCNNs easily What is Hawaii ’s state flower ? gories: abbreviation, entity, description, location outperform any other methods even with ancestor (a) enty loc and numeric. In this experiment, DTCNNs eas- ⇒ convolution only. DCNNs with sibling achieve the root ily outperform any other methods even with an- best performance in the published literature. DC- cestor convolution only. DTCNNs with sibling achieve the best performance in the published lit- NNs combined with sibling and sequential infor- What is natural gas composed of ? erature. DTCNNs combined with sibling and se- mation might suffer from overfitting on the train- (b) enty desc quential information might suffer from overfitting ⇒ ing data based on our observation. One thing on the training data based on our observation. One to note here is that our best result even exceeds root thing to note here is that our best result even ex- SVMS (Silva et al., 2011) with 60 hand-coded ceeds SVMS (Silva et al., 2011) with 60 hand- rules. What does a defibrillator do ? coded rules. We set batch size to 210 for this task. The TREC dataset also provides subcategories (c) desc enty The TREC dataset also provides subcategories ⇒ such as numeric:temperature, numeric:distance, root such as numeric:temperature, numeric:distance, and entity:vehicle. To make our task more real- and entity:vehicle. To make our task more real- istic and challenging, we also test the proposed istic and challenging, we also test the proposed Nothing plot wise is worth emailing home about model with respect to the 50 subcategories. There model with respect to the 50 subcategories. There (d) mild negative mild positive are obvious improvements over sequential CNNs ⇒ are obvious improvements over sequential CNNs from the last column of Table 2. Like ours, Silva from the last column of Table1. Like ours, Silva root et al. (2011) is a tree-based system but it uses et al. (2011) is a tree-based system but it uses constituency trees compared to ours dependency constituency trees compared to ours dependency What is the temperature at the center of the earth ? trees. They report a higher fine-grained accuracy of 90.8 but their parser is trained only on the Ques- trees. They report a higher fine-grained accuracy (e) NUM:temp NUM:dist of 90.8 but their parser is trained only on the Ques- ⇒ tionBank (Judge et al., 2006) while we used the root tionBank (Judge et al., 2006) while we used the standard Stanford parser trained on both the Penn Treebank and QuestionBank. Moreover, as men- standard Stanford parser trained on both the Penn What were Christopher Columbus ’ three ships ? tioned above, their approach is rule-based while Treebank and QuestionBank. Moreover, as men- (f) ENTY:veh LOC:other ours is automatically learned. For this task, we set tioned above, their approach is rule-based while ⇒ batch size to 30. ours is automatically learned. Figure 2: Examples from TREC (a–c), SST-1 (d) Figureand TREC 3: Examples with fine-grained from TREC label (a–c), (e–f) SST-1that are (d) 3.3 Discussions and Examples 3.3 Discussions and Examples andmisclassified TREC with by fine-grained the baseline CNN label but (e–f) correctly that are Compared with sentiment analysis, the advantage Compared with sentiment analysis, the advantage misclassifiedlabeled by our by DTCNN. the baseline For example, CNN but (a)correctly should of our proposed model is obviously more substan- of our proposed model is obviously more substan- labeledbe entity bybut our is DCNN. labeled location For example,by CNN. (a) should be tial on the TREC dataset. Based on our error anal- tial on the TREC dataset. Based on our error anal- entity but is labeled location by CNN. ysis, we conclude that this is mainly due to the

177 root root root root

What is theWhatspeedis hummingbirdsthe speed hummingbirdsfly ? flyWhat ? is the Whatmeltingis pointthe meltingof copperpoint? of copper ? (noun) (noun) (a) num enty and desc (a) num enty ⇒ (a) num enty and desc ⇒ (a) num enty ⇒root root ⇒ root root

What body of water are the Canary Islands in ? What did Jesse Jackson organize ? What body of water are the Canary Islands in ? What did Jesse Jackson organize ? (b) loc enty (b) hum enty and enty ⇒ (b) loc enty ⇒ (b) hum enty and enty root ⇒ root ⇒ root root

What position did Willie Davis play in baseball ? What is the electrical output in Madrid , Spain ? What position did Willie Davis play in baseball ? What is the electrical output in Madrid , Spain ? (c) hum enty (c) enty num and num ⇒ (c) hum enty ⇒ (c) enty num and num ⇒ ⇒ Figure 3: Examples from TREC datasets that are Figure 4: Examples from TREC datasets that are Figure 4: Examples from TREC datasets that are Figure 4: Examples from TREC datasets that are misclassifiedFigure by DTCNN 3: Examples but correctly from labeledTREC datasetsby misclassified that areFigure by both 5: ExamplesDTCNN and frombaseline TREC CNN. datasets that are misclassifiedbaseline CNN. bymisclassified ForDCNN example, butby (a) DTCNN correctly should but be numer- correctly labeled labeledFor by example, by misclassified (a) should be bynumerical both DTCNNbut is andla- baseline CNN. misclassified by both DCNN and baseline CNN. baselineical but CNN. is labeledbaseline Forentity example, CNN.by DTCNN. For (a) example, should (a) be shouldnumer- bebelednumer-entity byFor DTCNN example, and description (a) shouldby be CNN.numerical but is la- ical but is labeled entity by DTCNN. For example,beled entity (a)by DTCNN should and bedescriptionnumericalbybut CNN. is la- ical butysis, is we labeled concludeentity that thisby DCNN.is mainly due to the Figure 3 showcases examples where baseline beled entity by DCNN and description by CNN. difference ofysis, the we parse conclude tree quality thatthis between is mainly the dueCNNs to theget betterFigure results 3 than showcases DTCNNs. examples Exam- where baseline differencetwo tasks. of thedifference In sentiment parse of tree analysis, the parsequality the tree dataset between quality is between theple (a) the isown misclassifiedCNNs part-of-speech get as betterentity by results tagging). DTCNN than due DTCNNs. The word Exam- “fly” at Rotten Tomatoes two tasks.collected Infromtwo sentiment the tasks. In analysis, sentimentwebsite analysis, the which dataset theto dataset is parsing/taggingthe is endple of(a) error the is (the misclassified sentence Stanford should parser as entity per- beby a DTCNN verb instead due of collectedincludes from manycollected the irregularRotten from usage Tomatoes the ofRotten language. Tomatoeswebsite Somewebsite whichforms which its ownto part-of-speech parsing/tagging tagging). error The (the word Stanford parser per- of the sentences even come from languages other “fly” at thenoun, end and of the “hummingbirds sentence should be fly” a verb should be a relative includes manyincludes irregular many usage irregular of language. usage of language. Some Some forms its own part-of-speech tagging). The word than English.of The the sentences errors in parse even trees come inevitably from languagesinstead other ofclause noun,“fly” modifying and at “hummingbirds the end “speed”. of the fly” sentence should should be a verb affect the classification accuracy. However, the of the sentencesthan even English. come The from errors languages in parse trees otherbe inevitably a relativeThere clauseinstead modifying are of some noun, “speed”. sentences and “hummingbirds that are fly” misclassified should parser works substantially better on the TREC than English.affect The errors the classification in parse trees accuracy. inevitably However,There the are somebe a sentences relative clause that are modifying misclassified “speed”. dataset since all questions are in formal written by both theby baseline both the CNN baseline and DTCNN. CNN Figure and 4 DCNN. Figure5 affect the classificationparser works accuracy. substantially However, better5 on the the TREC There are some sentences that are misclassified English, and the training set for Stanford parser shows threeshows such three examples. such Example examples. (a) is not Example (a) is not dataset since all questions are in formal written by both the baseline CNN and DTCNN. Figure 4 parseralready works includes substantially the QuestionBank better (Judge on the et al., TRECclassified as numerical by both methods due to the English, and the training set for Stanford parserclassified5 as numerical by both methods due to the dataset2006) since which all includes questions 2,000 TREC are in sentences. formal writtenambiguous meaningshows of three the word such “point” examples. which Example is (a) is not already includes the QuestionBank (Judge et al.,ambiguousclassified meaning as numerical ofby the both word methods “point” due to which the is English,Figure and 2 the visualizes training examples set for where Stanford CNN errs parserdifficult5 to capture by word embedding. This word while DTCNN2006) does which not. includes For example, 2,000 CNN TREC la- sentences.can meandifficult location,ambiguous to opinion, capture meaning etc. by Apparently,word of the wordembedding. the “point” which This is word alreadybels includes (a) as locationFigure thedue QuestionBank 2 visualizes to “Hawaii” examples and (Judge “state”, where et al.,numerical CNN errscan aspect meandifficult is not location, to captured capture by byopinion, word word embed- embedding. etc. Apparently, This word the 2006)while which the long-distance includeswhile DTCNN 2,000 backbone does TREC “What not. sentences. For – flower” example,ding. CNN Example la-numericalcan (c) might mean aspect be location, an isannotation not opinion, captured error. etc. Apparently, by word embed- the Figureis clearly3 visualizes askingbels (a) for as examples anlocationentity. Similarly,due where to “Hawaii” CNNin (d), errs andFrom “state”, the mistakesnumerical made aspect by DTCNNs, is not captured we find by word embed- DTCNN captureswhile the the long-distance obviously negative backbone tree- “What – flower”ding.ding. Example Example (c) (c) might might be be an an annotation annotation error. error. while DCNN does not. For example, CNNthe la- performance of DTCNN is mainly limited by based trigramis “Nothing clearly asking – worth for – emailing”. an entity. Note Similarly,two in factors: (d), Shortly theFrom accuracy before themistakes of the submitting parser made and by the DTCNNs, to ACL we 2015 find we bels (a)that asourlocation modelDTCNN alsodue works captures to with “Hawaii” non-projective the obviously and de- “state”, negativequality tree- oflearned wordthe embedding. performanceMou et al. Future (2015of DTCNN work, unpublished) will is fo- mainly limited have by inde- whilependency the long-distance treesbased such trigram as backbone the “Nothing one in (b).“What – worth The – last emailing”. flower”cus on Note thesependently twotwo issues. factors: reported the accuracyconcurrent of the and parser related and efforts. the is clearlytwo examples askingthat in for our Figure model an entity 2 visualizealso. works Similarly, cases with where non-projective in (d), de- quality of word embedding. Future work will fo- DTCNN outperforms the baseline CNNs in fine- 4 ConclusionsTheir constituency and Future Work model, based on their unpub- DCNN capturespendency the obviously trees such negative as the one tree-based in (b). The last cus on these two issues. grained TREC.two In examples example (e),in Figure the word 2 visualize“temper- casesWehave wherelished presented work a very in simple programming dependency tree- languages (Mou et trigramature” “Nothing is atDTCNN second – worth from outperforms the – top emailing”. and the is baseline root Note of aCNNs thatbased in fine- convolutional., 20144Conclusions framework),6 performs which and convolution outperforms Future Work on pretrained re- our model8 word alsospangrained “the works ... TREC. earth”. with In Whennon-projective example we use (e), a the win- depen- wordsequential “temper- CNN baselines on various classification dow of size 5 for tree convolution, every words cursiveWenode have presentedrepresentations a very simple rather dependency than word tree- em- dency trees suchature” as the is at one second in (b). from The the last top and two is ex-tasks. root of Extensions a based of convolution this model would framework consider which outperforms in that span get convolved with “temperature” and dependencybeddings, labels and thus constituency baring little, trees. if Also, any, resemblance to 8 word span “the ... earth”. When we use a win- sequential CNN baselines on various classification amplesthis in should Figure be the3 reasonvisualize why DTCNN cases where get correct. DCNNwe wouldour evaluate dependency-based on gold-standard parse model. trees. Their dependency dow of size 5 for tree convolution, every words tasks. Extensions of this model would consider outperforms5 the baseline CNNs in fine-grained http://nlp.stanford.edu/software/parser-faq.shtmlin that span get convolved with “temperature” andmodeldependency is related, labels but and always constituency includes trees. a node Also, and TREC. In example (e), the word “temperature” is this should be the reason why DTCNN get correct.all itswe children would evaluate (resembling on gold-standard Iyyer parse et al. trees. (2014)), at second from the5http://nlp.stanford.edu/software/parser-faq.shtml top and is root of a 8 word span which is a variant of our sibling model and always “the ... earth”. When we use a window of size 5 flat. By contrast, our ancestor model looks at the for tree convolution, every words in that span get vertical path from any word to its ancestors, being convolved with “temperature” and this should be linguistically motivated (Shen et al., 2008). the reason why DCNN get correct. 4 Conclusions Figure4 showcases examples where baseline CNNs get better results than DCNNs. Example We have presented a very simple dependency- (a) is misclassified as entity by DCNN due to pars- based convolution framework which outperforms ing/tagging error (the Stanford parser performs its sequential CNN baselines on modeling sentences. 6 Both their 2014 and 2015 reports proposed (independently of each other 5 http://nlp.stanford.edu/software/parser-faq.shtml and independently of our work) the term “tree-based convolution” (TBCNN).

178 References Shotaro Matsumoto, Hiroya Takamura, and Manabu Okumura. 2005. Sentiment classification using R. Collobert, J. Weston, L. Bottou, M. Karlen, word sub-sequences and dependency sub-trees. In K. Kavukcuoglu, and P. Kuksa. 2011. Natural lan- Proceedings of PA-KDD. guage processing (almost) from scratch. Journal of Research, 12. Ryan McDonald and Fernando Pereira. 2006. Online learning of approximate dependency parsing algo- Kushal Dave, Steve Lawrence, and David M Pennock. rithms. In Proceedings of EACL. 2003. Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In Lili Mou, Ge Li, Zhi Jin, Lu Zhang, and Tao Wang. Proceedings of World Wide Web. 2014. TBCNN: A tree-based convolutional neu- ral network for programming language processing. Michael Gamon. 2004. Sentiment classification on Unpublished manuscript: http://arxiv.org/ customer feedback data: noisy data, large feature abs/1409.5718. vectors, and the role of linguistic analysis. In Pro- ceedings of COLING. Lili Mou, Hao Peng, Ge Li, Yan Xu, Lu Zhang, and Geoffrey E. Hinton, Nitish Srivastava, Alex Zhi Jin. 2015. Discriminative neural sentence Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- modeling by tree-based convolution. Unpublished dinov. 2014. Improving neural networks by manuscript: http://arxiv.org/abs/1504. preventing co-adaptation of feature detectors. 01106v5. Version 5 dated June 2, 2015; Version 1 Journal of Machine Learning Research, 15. (“Tree-based Convolution: A New Architecture for Sentence Modeling”) dated Apr 5, 2015. Ozan Irsoy and Claire Cardie. 2014. Deep recursive neural networks for compositionality in language. Bo Pang and Lillian Lee. 2005. Seeing stars: Exploit- In Advances in Neural Information Processing Sys- ing class relationships for sentiment categorization tems, pages 2096–2104. with respect to rating scales. In Proceedings of ACL, pages 115–124. Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daume´ III. 2014. A neural Libin Shen, Lucas Champollion, and Aravind K Joshi. network for factoid question answering over para- 2008. LTAG-spinal and the treebank. Language Re- graphs. In Proceedings of EMNLP. sources and Evaluation, 42(1):1–19. John Judge, Aoife Cahill, and Josef van Genabith. Yelong Shen, Xiaodong he, Jianfeng Gao, Li Deng, and 2006. Questionbank: Creating a corpus of parse- Gregoire Mesnil. 2014. Learning semantic repre- annotated questions. In Proceedings of COLING. sentations using convolutional neural networks for web search. In Proceedings of WWW. Nal Kalchbrenner, Edward Grefenstette, and Phil Blun- som. 2014. A convolutional neural network for J. Silva, L. Coheur, A. C. Mendes, and Andreas modelling sentences. In Proceedings of ACL. Wichert. 2011. From symbolic to sub-symbolic in- formation in question classification. Artificial Intel- Yoon Kim. 2014. Convolutional neural networks for ligence Review, 35. sentence classification. In Proceedings of EMNLP. Richard Socher, Jeffrey Pennington, Eric H. Huang, Terry Koo and Michael Collins. 2010. Efficient third- Andrew Y. Ng, and Christopher D. Manning. 2011. order dependency parsers. In Proceedings of ACL. Semi-Supervised Recursive for Pre- dicting Sentiment Distributions. In Proceedings of Taku Kudo and Yuji Matsumoto. 2004. A boosting EMNLP 2011. algorithm for classification of semi-structured text. In Proceedings of EMNLP. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, Quoc V Le and Tomas Mikolov. 2014. Distributed and Christopher Potts. 2013. Recursive deep mod- Pro- representations of sentences and documents. In els for semantic compositionality over a sentiment ceedings of ICML . treebank. In Proceedings of EMNLP 2013. Y. LeCun, L. Jackel, L. Bottou, A. Brunot, C. Cortes, Wen-tau Yih, Xiaodong He, and Christopher Meek. J. Denker, H. Drucker, I. Guyon, U. Mller, 2014. Semantic parsing for single-relation question E. Sckinger, P. Simard, and V. Vapnik. 1995. Com- answering. In Proceedings of ACL. parison of learning algorithms for handwritten digit recognition. In Int’l Conf. on Artificial Neural Nets. Mattgew Zeiler. 2012. Adadelta: An adaptive learning http:// Xin Li and Dan Roth. 2002. Learning question classi- rate method. Unpublished manuscript: arxiv.org/abs/1212.5701 fiers. In Proceedings of COLING. . Christopher D. Manning, Mihai Surdeanu, John Bauer, Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. Jenny Finkel, Steven J. Bethard, and David Mc- 2015. Long short-term memory over tree structures. Closky. 2014. The Stanford CoreNLP natural lan- In Proceedings of ICML. guage processing toolkit. In Proceedings of ACL: Demonstrations, pages 55–60.

179