When Are Tree Structures Necessary for Deep Learning of Representations?

When Are Tree Structures Necessary for Deep Learning of Representations? Jiwei Li1, Minh-Thang Luong1, Dan Jurafsky1 and Eduard Hovy2 1Computer Science Department, Stanford University, Stanford, CA 94305 2Language Technology Institute, Carnegie Mellon University, Pittsburgh, PA 15213 jiweil,lmthang,[email protected] [email protected] Abstract For tasks where the inputs are larger text units (e.g., phrases, sentences or documents), a compo- Recursive neural models, which use syn- sitional model is first needed to aggregate tokens tactic parse trees to recursively generate into a vector with fixed dimensionality that can be representations bottom-up, are a popular used as a feature for other NLP tasks. Models for architecture. However there have not been achieving this usually fall into two categories: re- rigorous evaluations showing for exactly current models and recursive models: which tasks this syntax-based method is Recurrent models (also referred to as sequence appropriate. In this paper, we benchmark models) deal successfully with time-series data recursive neural models against sequential (Pearlmutter, 1989; Dorffner, 1996) like speech recurrent neural models, enforcing apples- (Robinson et al., 1996; Lippmann, 1989; Graves et to-apples comparison as much as possible. al., 2013) or handwriting recognition (Graves and We investigate 4 tasks: (1) sentiment clas- Schmidhuber, 2009; Graves, 2012). They were ap- sification at the sentence level and phrase plied early on to NLP (Elman, 1990), by modeling level; (2) matching questions to answer- a sentence as tokens processed sequentially and at phrases; (3) discourse parsing; (4) seman- each step combining the current token with pre- tic relation extraction. viously built embeddings. Recurrent models can Our goal is to understand better when, be extended to bidirectional ones from both left- and why, recursive models can outperform to-right and right-to-left. These models generally simpler models. We find that recursive consider no linguistic structure aside from word models help mainly on tasks (like seman- order. tic relation extraction) that require long- Recursive neural models (also referred to as tree distance connection modeling, particularly models), by contrast, are structured by syntactic on very long sequences. We then intro- parse trees. Instead of considering tokens sequen- duce a method for allowing recurrent mod- tially, recursive models combine neighbors based els to achieve similar performance: break- on the recursive structure of parse trees, starting ing long sentences into clause-like units from the leaves and proceeding recursively in a at punctuation and processing them sepa- bottom-up fashion until the root of the parse tree rately before combining. Our results thus is reached. For example, for the phrase the food help understand the limitations of both is delicious, following the operation sequence ( classes of models, and suggest directions (the food) (is delicious) ) rather than the sequen- for improving recurrent models. tial order (((the food) is) delicious). Many recursive models have been proposed (e.g., (Paulus et 1 Introduction al., 2014; Irsoy and Cardie, 2014)), and applied to various NLP tasks, among them entailment (Bow- Deep learning based methods learn low- man, 2013; Bowman et al., 2014), sentiment anal- dimensional, real-valued vectors for word ysis (Socher et al., 2013; Irsoy and Cardie, 2013; tokens, mostly from large-scale data corpus (e.g., Dong et al., 2014), question-answering (Iyyer et (Mikolov et al., 2013; Le and Mikolov, 2014; al., 2014), relation classification (Socher et al., Collobert et al., 2011)), successfully capturing 2012; Hashimoto et al., 2013), and discourse (Li syntactic and semantic aspects of text. and Hovy, 2014). 2304 Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2304–2314, Lisbon, Portugal, 17-21 September 2015. c 2015 Association for Computational Linguistics. One possible advantage of recursive models is ing is useful for finding similarities between their potential for capturing long-distance depen- question sentences and target phrases. dencies: two tokens may be structurally close to Semantic Relation Classification on the each other, even though they are far away in word • sequence. For example, a verb and its correspond- SemEval-2010 (Hendrickx et al., 2009) data ing direct object can be far away in terms of to- can help understand whether parsing is help- kens if many adjectives lies in between, but they ful in dealing with long-term dependencies, are adjacent in the parse tree (Irsoy and Cardie, such as relations between two words that are 2013). However we do not know if this advan- far apart in the sequence. tage is truly important, and if so for which tasks, Discourse parsing (RST dataset) is useful • or whether other issues are at play. Indeed, the for measuring the extent to which parsing im- reliance of recursive models on parsing is also a proves discourse tasks that need to combine potential disadvantage, given that parsing is rela- meanings of larger text units. Discourse pars- tively slow, domain-dependent, and can be error- ing treats elementary discourse units (EDUs) ful. as basic units to operate on, which are usually On the other hand, recent progress in multi- short clauses. The task also sheds light on ple subfields of neural NLP has suggested that re- the extent to which syntactic structures help current nets may be sufficient to deal with many acquire shot text representations. of the tasks for which recursive models have been proposed. Recurrent models without parse The principal motivation for this paper is to un- structures have shown good results in sequence- derstand better when, and why, recursive models to-sequence generation (Sutskever et al., 2014) are needed to outperform simpler models by en- for machine translation (e.g., (Kalchbrenner and forcing apples-to-apples comparison as much as Blunsom, 2013; 3; Luong et al., 2014)), pars- possible. This paper applies existing models to ing (Vinyals et al., 2014), and sentiment, where existing tasks, barely offering novel algorithms or for example recurrent-based paragraph vectors (Le tasks. Our goal is rather an analytic one, to inves- and Mikolov, 2014) outperform recursive models tigate different versions of recursive and recurrent (Socher et al., 2013) on the Stanford sentiment- models. This work helps understand the limita- bank dataset. tions of both classes of models, and suggest direc- Our goal in this paper is thus to investigate a tions for improving recurrent models. number of tasks with the goal of understanding The rest of this paper organized as follows: We for which kinds of problems recurrent models may detail versions of recursive/recurrent models in be sufficient, and for which kinds recursive mod- Section 2, present the tasks and results in Section els offer specific advantages. We investigate four 3, and conclude with discussions in Section 4. tasks with different properties. 2 Recursive and Recurrent Models Binary sentiment classification at the sen- 2.1 Notations • tence level (Pang et al., 2002) and phrase We assume that the text unit S, which could level (Socher et al., 2013) that focus on be a phrase, a sentence or a document, is com- understanding the role of recursive models prised of a sequence of tokens/words: S = in dealing with semantic compositionally in w , w , ..., w , where N denotes the num- { 1 2 NS } s various scenarios such as different lengths of ber of tokens in S. Each word w is associated inputs and whether or not supervision is com- with a K-dimensional vector embedding ew = prehensive. e1 , e2 , ..., eK . The goal of recursive and re- { w w w } current models is to map the sequence to a K- Phrase Matching on the UMD-QA dataset • dimensional eS, based on its tokens and their cor- (Iyyer et al., 2014) can help see the difference respondent embeddings. between outputs from intermediate compo- nents from different models, i.e., representa- Standard Recurrent/Sequence Models suc- tions for intermediate parse tree nodes and cessively take word wt at step t, combines its vec- outputs from recurrent models at different tor representation et with the previously built hid- time steps. It also helps see whether pars- den vector ht 1 from time t 1, calculates the re- − − 2305 sulting current embedding ht, and passes it to the next step. The embedding ht for the current time t is thus: it σ ft σ ht 1 = W − (5) ot σ · et ht = f(W ht 1 + V et) (1) · − · l tanh t where W and V denote compositional matrices. If ct = ft ct 1 + it lt (6) Ns denotes the length of the sequence, hNs repre- · − · sents the whole sequence S. hs = o c (7) t t · t 4K 2K Standard recursive/Tree models work in a where W R × . Labels at the ∈ similar way, but processing neighboring words by phrase/sentence level are predicted representations parse tree order rather than sequence order. It outputted from the last time step. computes a representation for each parent node based on its immediate children recursively in a Tree LSTMs Recent research has extended the bottom-up fashion until reaching the root of the LSTM idea to tree-based structures (Zhu et al., tree. For a given node η in the tree and its left child 2015; Tai et al., 2015) that associate memory and forget gates to nodes of the parse trees. ηleft (with representation eleft) and right child ηright (with representation eright), the standard recursive Bi-directional LSTMs These combine bi- network calculates eη as follows: directional models and LSTMs. e = f(W e + V e ) (2) η · ηleft · ηright 3 Experiments Bidirectional Models (Schuster and Paliwal, In this section, we detail our experimental settings 1997) add bidirectionality to the recurrent frame- and results. We consider the following tasks, each work where embeddings for each time are calcu- representative of a different class of NLP tasks.

When Are Tree Structures Necessary for Deep Learning of Representations?

Transfer Learning Using CNN for Handwritten Devanagari Character

A Novel Approach to On-Line Handwriting Recognition Based on Bidirectional Long Short-Term Memory Networks

Unconstrained Online Handwriting Recognition with Recurrent Neural Networks

Start, Follow, Read: End-To-End Full Page Handwriting Recognition

Contributions to Handwriting Recognition Using Deep Neural Networks and Quantum Computing Bogdan-Ionut Cirstea

Unsupervised Feature Learning for Writer Identification

Fast Multi-Language LSTM-Based Online Handwriting Recognition

A Connectionist Recognizer for On-Line Cursive Handwriting Recognition

Candidate Fusion: Integrating Language Modelling Into a Sequence-To-Sequence Handwritten Word Recognition Architecture

Digit Recognition System Using Back Propagation Neural Network

Audio Word2vec: Unsupervised Learning of Audio Segment Representations Using Sequence-To-Sequence Autoencoder

Handwriting Recognition Using LSTM Networks