Learning Structured Representation for Text Classification Via

Learning Structured Representation for Text Classification via Reinforcement Learning Tianyang Zhang?, Minlie Huang?;y, Li Zhaoz ?Tsinghua National Laboratory for Information Science and Technology Dept. of Computer Science and Technology, Tsinghua University, Beijing 100084, PR China z Microsoft Research Asia [email protected]; [email protected]; [email protected]; yCorresponding Author: [email protected] (Minlie Huang) Abstract and recursive autoencoders (Socher et al. 2013; 2011; Qian et al. 2015) use pre-specified parsing trees to build structured Representation learning is a fundamental problem in natural representations. Attention-based methods (Yang et al. 2016; language processing. This paper studies how to learn a structured representation for text classification. Unlike most ex- Zhou, Wan, and Xiao 2016; Lin et al. 2017) use attention isting representation models that either use no structure or mechanisms to build representations by scoring input words rely on pre-specified structures, we propose a reinforcemen- or sentences differentially. t learning (RL) method to learn sentence representation by However, in existing structured representation models, the discovering optimized structures automatically. We demon- structures are either provided as input or predicted using strate two attempts to build structured representation: Infor- supervision from explicit treebank annotations. There has mation Distilled LSTM (ID-LSTM) and Hierarchically Struc- been few studies on learning representations with automati- tured LSTM (HS-LSTM). ID-LSTM selects only important, cally optimized structures. Yogatama et al. (2017) proposed task-relevant words, and HS-LSTM discovers phrase struc- to compose binary tree structure for sentence representa- tures in a sentence. Structure discovery in the two represen- tion with only supervision from downstream tasks, but such tation models is formulated as a sequential decision problem: current decision of structure discovery affects following deci- structure is very complex and overly deep, leading to un- sions, which can be addressed by policy gradient RL. Results satisfactory classification performance. In (Chung, Ahn, and show that our method can learn task-friendly representation- Bengio 2017), a hierarchical representation model was pro- s by identifying important words or task-relevant structures posed to capture latent structure in the sequences with latent without explicit structure annotations, and thus yields com- variables. Structure is discovered in a latent, implicit man- petitive performance. ner. In this paper, we propose a reinforcement learning (RL) method to build structured sentence representations by iden- Introduction tifying task-relevant structures without explicit structure an- Representation learning is a fundamental problem in AI, notations. Structure discovery in this paper is formulated as and particularly important for natural language process- a sequential decision problem: current decision (or action) ing (NLP) (Bengio, Courville, and Vincent 2013; Le and of structure discovery affects following decisions, which can Mikolov 2014). As one of the most common tasks of NLP, be naturally addressed by policy gradient method (Sutton et text classification depends heavily on the learned represen- al. 2000). A delayed reward is used to guide the learning of tation, and is widely applied in sentiment analysis (Socher et the policy for structure discovery. The reward is computed al. 2013), question classification (Kim 2014), and language from the text classifier’s prediction based on the structured inference (Bowman et al. 2015). representation. The representation is available only when all Mainstream representation models for text classification sequential decisions are completed. can be roughly classified into four types. Bag-of-words In our RL method, we design two structured representa- representation models ignore the order of words, includ- tion models: Information Distilled LSTM (ID-LSTM) which ing deep average network (Iyyer et al. 2015; Joulin et al. selects important, task-relevant words to build sentence rep- 2017) and autoencoders (Liu et al. 2015). Sequence representation, and Hierarchical Structured LSTM (HS-LSTM) resentation models such as convolutional neural network which discovers phrase structures and builds sentence repre- (Kim 2014; Kalchbrenner, Grefenstette, and Blunsom 2014; sentation with a two-level LSTM. The representation mod- Lei, Barzilay, and Jaakkola 2015) and recurrent neural net- els are integrated seamlessly with a policy network and a work (Hochreiter and Schmidhuber 1997; Chung et al. 2014) classification network. The policy network defines a policy consider word order but do not use any structure. Structured for structure discovery, and the classification network makes representation models such as tree-structured LSTM (Zhu, prediction on top of structured sentence representation and Sobihani, and Guo 2015; Tai, Socher, and Manning 2015) facilitates reward computation for the policy network. To summarize, our contributions are as follows: Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. • We propose a reinforcement learning method which dis- Figure 1: Illustration of the overall process. The policy network (PNet) samples an action at each state. The structured representation model offers state representation to PNet and outputs the final sentence representation to the classification network (CNet) when all actions are sampled. CNet performs text classification and provides reward to PNet. covers task-relevant structures to build structured sen- representation is obtained from the representation models. tence representations for text classification problems. We In order to obtain the delayed reward which is based on C- propose two structured representation models: informa- Net’s prediction, we perform action sampling for the entire tion distilled LSTM (ID-LSTM) and hierarchical struc- sentence. Once all the actions are decided, the representation tured LSTM (HS-LSTM). models will obtain a structured representation of the sen- • Even without explicit structure annotations, our method tence, and it will be used by CNet to compute P (yjX). The can identify task-relevant structures effectively. More- reward computed with P (yjX) is used for policy learning. over, the performance is better or comparable to strong We briefly introduce state, action and policy, reward, and baselines that use pre-specified parsing structures. objective function as follows: State State encodes the current input and previous con- Methodology texts, and has different definitions in the two representation Overview models. The detailed definition of state st will be introduced in the following sections. The goal of this paper is to learn structured representation for text classification by discovering important, task-relevant Action and Policy We adopt binary actions in two set- structures. We argue that text classification can be improved tings, but with different meanings. In ID-LSTM, the ac- with an optimized, structured representation. tion space is fRetain, Deleteg, where a word can be deleted The overall process is shown in Figure 1. The mod- from or retained in the final sentence representation. In HS- el consists of three components: Policy Network (PNet), LSTM, the action space is fInside, Endg, indicating that a structured representation models, and Classification Net- word is inside or at the end of a phrase1. Clearly, each ac- work (CNet). PNet adopts a stochastic policy and samples tion is a direct indicator of structure selection in both an action at each state. It keeps sampling until the end of representation models. a sentence, and produces an action sequence for the sen- We adopt a stochastic policy. Let at denote the action at tence. Then the structured representation models translate state t, the policy is defined as follows: the actions into a structured representation. We design t- π(a js ; Θ) = σ(W ∗ s + b); (1) wo representation models, information distilled LSTM (ID- t t t LSTM) and hierarchically structured LSTM (HS-LSTM). where π(atjst; Θ) denotes the probability of choosing at, σ CNet makes classification based on the structured repre- denotes the sigmoid function and Θ = fW; bg denotes the sentation and offers reward computation to PNet. Since the parameters of PNet. reward can be computed once the final representation is During training, the action is sampled according to the available (completely determined by the action sequence), probability in Eq. 1. During test, the action with the maximal ∗ the process can be naturally addressed by policy gradient probability (i.e., at = argmaxaπ(ajst; Θ)) will be chosen method (Sutton et al. 2000). in order to obtain superior prediction. Obviously the three components are interleaved together. Reward Once all the actions are sampled by the policy The state representation of PNet is derived from the repre- network, the structured representation of a sentence is de- sentation models, CNet relies on the final structured repre- termined by our representation models, and the representa- sentation obtained from the representation model to make tion will be passed to CNet to obtain P (yjX) where y is the prediction, and PNet obtains rewards from CNet’s predic- class label. The reward will be calculated from the predicted tion to guide the learning of a policy. distribution (P (yjX)), and also has a factor considering the Policy Network

Learning Structured Representation for Text Classification Via

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support