Learning to Embed Sentences Using Attentive Recursive Trees

The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Learning to Embed Sentences Using Attentive Recursive Trees Jiaxin Shi,1 Lei Hou,1∗ Juanzi Li,1 Zhiyuan Liu,1 Hanwang Zhang2 1Tsinghua University 2Nanyang Technological University [email protected], fhoulei,lijuanzi,[email protected], [email protected] Abstract interesting Sentence embedding is an effective feature representation for movie me . most deep learning-based NLP tasks. One prevailing line of methods is using recursive latent tree-structured networks to The very to . embed sentences with task-specific structures. However, ex- The movie is to me isting models have no explicit mechanism to emphasize task- very interesting informative words in the tree structure. To this end, we pro- is pose an Attentive Recursive Tree model (AR-Tree), where (a) (b) the words are dynamically located according to their impor- tance in the task. Specifically, we construct the latent tree for a sentence in a proposed important-first strategy, and place Figure 1: Two recursive trees for sentence The movie is very more attentive words nearer to the root; thus, AR-Tree can interesting to me. in the sentiment analysis task. Our AR- inherently emphasize important words during the bottom- Tree (a) is constructed by recursively selecting the most in- up composition of the sentence embedding. We propose an formative word, e.g., interesting. However, other latent trees end-to-end reinforced training strategy for AR-Tree, which is (b) are built by composing adjacent pairs, e.g., very interest- demonstrated to consistently outperform, or be at least com- ing captured by (Choi, Yoo, and goo Lee 2017), which lacks parable to, the state-of-the-art sentence embedding methods the potential to emphasize words. on three sentence understanding tasks. Introduction along its parsing tree. Tree-structured Long Short-Term Memory (Tree-LSTM) (Tai, Socher, and Manning 2015; Along with the success of representation learning (e.g., Zhu, Sobihani, and Guo 2015) is one of the most renowned word2vec (Mikolov et al. 2013)), sentence embedding, variants of TreeRNNs that is shown to be effective in which maps sentences into dense real-valued vectors that learning task-specific sentence embeddings (Bowman et al. represent their semantics, has received much attention. It is 2016). playing a critical role in many applications such as sentiment Tree-LSTM models are motivated by the intuition that in analysis (Socher et al. 2013), question answering (Wang and human languages there are complicated hierarchical struc- Nyberg 2015) and entailment recognition (Bowman et al. tures which contain rich semantics. Latent tree models (Yo- 2015). gatama et al. 2016; Maillard, Clark, and Yogatama 2017; There are three predominant approaches for construct- Choi, Yoo, and goo Lee 2017; Williams, Drozdov, and ing sentence embeddings. (1) Recurrent neural networks Bowman 2017) can learn the optimal hierarchical structure, (RNNs) encode sentences word by word in sequential or- which may vary from tasks to tasks, without explicit struc- der (Dai and Le 2015; Hill, Cho, and Korhonen 2016). ture annotations. The training signals to parse and embed (2) Convolutional neural networks (CNNs) produce sen- sentences are both from certain downstream tasks. Existing tence embeddings in a bottom-up manner, moving from lo- models place all words in leaves equally and build the tree cal n-grams to the global sentence as the receptive fields structure and the sentence embedding by composing adja- enlarge (Blunsom, Grefenstette, and Kalchbrenner 2014; cent node pairs bottom up (e.g., Figure 1b). This mechanism Hu et al. 2014). However, the above two approaches can- prevents the sentence embedding from focusing on the most not well encode linguistic composition of natural languages informative words, resulting in a performance limitation on to some extent. (3) The last approach, on which this paper certain tasks (Shi et al. 2018). focuses, exploits tree-structured recursive neural networks To address this issue, we propose an Attentive Recursive (TreeRNNs) (Socher et al. 2011; 2013) to embed a sentence Tree model (AR-Tree) for sentence embedding, which is ∗Corresponding author. a novel framework that incorporates task-specific attention Copyright c 2019, Association for the Advancement of Artificial mechanism into the latent tree structure learning (dos Santos Intelligence (www.aaai.org). All rights reserved. et al. 2016). AR-Tree represents a sentence as a binary tree 6991 that contains one word in each leaf and non-leaf node, simi- reduce parser, whose training relies on ground-truth pars- lar to the dependency parsing tree (Nivre 2003) but our AR- ing trees. In this paper, we are interested in latent trees that Tree does not depend on manual rules. To utilize the sequen- dynamically parse a sentence without syntax supervision. tial information, we expect the tree’s in-order traversal pre- Combination of latent tree learning with TreeRNNs has been serves the word sequence, so that we can easily recover the shown as an effective approach for sentence embedding as original word sequence and obtain context of a word from it jointly optimizes the sentence compositions and a task- its subtrees. As shown in Figure 1a, the key advantage of an specific objective. For example, (Yogatama et al. 2016) use AR-Tree is that those task-important words will be placed at reinforcement learning to train a shift-reduce parser without those nodes near the root and will be naturally emphasized any ground-truth. in tree-based embedding. This is attributed to our proposed Maillard, Clark, and Yogatama use a CYK chart top-down attention-first parsing strategy, inspired by easy- parser (Cocke 1970; Younger 1967; Kasami 1965) instead first parsing (Goldberg and Elhadad 2010). Specifically, we of the shift-reduce parser and make it fully differentiable introduce a trainable scoring function to measure the word with the help of the softmax annealing technique. How- attention in a sentence with respect to a task. We greedily ever, their model suffers from both time and space issues select the word with the highest score (e.g., interesting) as as the chart parser requires O(n3) time and space complex- the root node and then recursively parse the remaining two ity. (Choi, Yoo, and goo Lee 2017) propose an easy-first subsequences (e.g., The movie is and to me.) to obtain two parsing strategy, which scores each adjacent node pair us- children of the parent node. After the tree construction, we ing a query vector and greedily combines the best pair into embed the sentence using a modified Tree-LSTM unit (Tai, one parent node at each step. They use Straight-Through Socher, and Manning 2015; Zhu, Sobihani, and Guo 2015) Gumbel-Softmax estimator (Jang, Gu, and Poole 2016) to in a bottom-up manner, i.e., the resultant embedding is ob- compute parent embedding in a hard categorical gating way tained at the root node and is then applied in a downstream and enable the end-to-end training. (Williams, Drozdov, and application. As the Tree-LSTM computes node vectors in- Bowman 2017) compare above-mentioned models on sev- crementally from leaf nodes to the root node, our model nat- eral datasets and demonstrate that (Choi, Yoo, and goo Lee urally pays more attention to those shallower words, i.e., 2017) achieve the best performance. task-informative words, meanwhile remaining advantages Attention-Based Sentence Embedding. Attention-based of the recursive semantic composition (Socher et al. 2013; methods can be divided into two categories: inter- Zhu, Sobihani, and Guo 2015). attention (dos Santos et al. 2016; Munkhdalai and Yu Training AR-Tree is challenging due to the non- 2017b), which requires a pair of sentences to attend with differentiability caused by the dynamic decision-making each other, and intra-attention (Arora, Liang, and Ma 2016; procedure. To this end, we develop a novel end-to-end train- Lin et al. 2017), which does not require extra inputs except ing strategy based on REINFORCE algorithm (Williams the sentence; thus the latter is more flexible than the former. 1992). To make REINFORCE work for the structure infer- (Kim et al. 2017) incorporate structural distributions into at- ence, we equip it with a weighted reward which is sensitive tention networks using graphical models instead of recursive to the tree structures and a macro normalization strategy for trees. Note that existing latent tree-based models treat all in- the policy gradients. put words equally as leaf nodes and ignore the fact that dif- We evaluate our model on three benchmarking tasks: tex- ferent words make varying degrees of contributions to the tual entailment, sentiment classification, and author pro- sentence semantics, which is nevertheless the fundamental filing. We show that AR-Tree outperforms previous Tree- motivation of attention mechanism. To our best knowledge, LSTM models and is comparable to other state-of-the-art AR-Tree is the first model that generates attentive tree struc- sentence embedding models. Further qualitative analyses tures and allows the TreeRNNs to focus on more informative demonstrate that AR-Tree learns reasonable task-specific at- words for sentence embeddings. tention structures. To sum up, the contributions of our work are as follows: Attentive Recursive Tree • We propose Attentive Recursive Tree (AR-Tree), a Tree- We represent an input sentence of words as LSTM based sentence embedding model, which can parse S N fx x ··· x g, where x is a -dimensional word em- the latent tree structure dynamically and emphasize infor- 1; 2; ; N i Dx bedding vector. For each sentence, we build an Attentive Re- mative words inherently. cursive Tree (AR-Tree) where the root and nodes are de- • We design a novel REINFORCE algorithm for the train- noted by R and T , respectively. Each node t 2 T contains ing of discrete tree parsing.

Learning to Embed Sentences Using Attentive Recursive Trees

Deep Learning for Natural Language Processing Creating Neural Networks with Python

Text Feature Mining Using Pre-Trained Word Embeddings

A Deep Reinforcement Learning Based Multi-Step Coarse to Fine Question Answering (MSCQA) System

Deep Reinforcement Learning for Chatbots Using Clustered Actions and Human-Likeness Rewards

Abusive and Hate Speech Tweets Detection with Text Generation

Learning Sentence Embeddings Using Recursive Networks

Learning Representations of Social Media Users Arxiv:1812.00436V1

Learning Bilingual Sentence Embeddings Via Autoencoding and Computing Similarities with a Multilayer Perceptron

Dependency-Based Convolutional Neural Networks for Sentence Embedding

Automatic Text Summarization Using Reinforcement Learning with Embedding Features

Dependency-Based Convolutional Neural Networks for Sentence Embedding∗

A Sentence Embedding Method by Dissecting BERT-Based Word Models Bin Wang, Student Member, IEEE, and C.-C