EMNLP 2017

The Conference on Empirical Methods in Natural Language Processing

Proceedings of the 2nd Workshop on Structured Prediction for Natural Language Processing

September 9-11, 2017 Copenhagen, Denmark c 2017 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected]

ISBN 978-1-945626-93-7

ii Introduction

Welcome to the Second workshop on structured prediction for NLP!

Many prediction tasks in NLP involve assigning values to mutually dependent variables. For example, when designing a model to automatically perform linguistic analysis of a sentence or a document (e.g., parsing, semantic role labeling, or discourse analysis), it is crucial to model the correlations between labels. Many other NLP tasks, such as machine translation, textual entailment, and information extraction, can be also modeled as structured prediction problems.

In order to tackle such problems, various structured prediction approaches have been proposed, and their effectiveness has been demonstrated. Studying structured prediction is interesting from both NLP and (ML) perspectives. From the NLP perspective, syntax and semantics of the natural language are clearly structured and advances in this area will enable researchers to understand the linguistic structure of data. From the ML perspective, a large amount of available text data and complex linguistic structures bring challenges to the learning community. Designing expressive yet tractable models and studying efficient learning and inference algorithms become important issues.

Recently, there has been significant interest in non-standard structured prediction approaches that take advantage of non-linearity, latent components, and/or approximate inference in both the NLP and ML communities. Researchers have also been discussing the intersection between and structured prediction through the DeepStructure reading group. This workshop intends to bring together NLP and ML researchers working on diverse aspects of structured prediction and expose the participants to recent progress in this area.

This year we have eight papers covering various aspects of structured prediction, including neural networks, deep structured prediction, and imitation learning. We also invited four fantastic speakers. We hope you all enjoy the program!

Finally, we would like to thank all programming committee members, speakers, and authors. We are looking forward to seeing you in Copenhagen.

iii

Organizers: Kai-Wei Chang, UCLA Ming-Wei Chang, Microsoft Research Alexander Rush, Harvard University Vivek Srikumar, University of Utah

Program Committee: Amir Globerson, Tel Aviv University (Israel) Alexander Schwing (UIUC) Ivan Titov (University of Amsterdam) Janardhan Rao Doppa (WSU) Jason Eisner (JHU) Karl Stratos (TTIC) Kevin Gimpel (TTIC) Luke Zettlemoyer (UW) Matt Gormley (CMU) Mohit Bansal (UNC) Mo Yu (IBM) Noah Smith (UW) Parisa Kordjamshidi (Tulane University) Raquel Urtasun (University of Toronto) Roi Reichart (Technion) Ofer Meshi (Google) Scott Yih (Microsoft) Shay Cohen (University of Edinburgh) Shuohang Wang (Singapore Management University) Waleed Ammar (AI2) Yoav Artzi (Cornell)

v

Table of Contents

Dependency Parsing with Dilated Iterated Graph CNNs Emma Strubell and Andrew McCallum ...... 1

Entity Identification as Multitasking Karl Stratos ...... 7

Towards Neural Machine Translation with Latent Tree Attention James Bradbury and Richard Socher ...... 12

Structured Prediction via Learning to Search under Bandit Feedback Amr Sharaf and Hal Daumé III ...... 17

Syntax Aware LSTM model for Semantic Role Labeling Feng Qian, Lei Sha, Baobao Chang, LuChen Liu and Ming Zhang ...... 27

Spatial Language Understanding with Multimodal Graphs using Declarative Learning based Program- ming Parisa Kordjamshidi, Taher Rahgooy and Umar Manzoor ...... 33

Boosting Information Extraction Systems with Character-level Neural Networks and Free Noisy Super- vision Philipp Meerkamp and Zhengyi Zhou ...... 44

Piecewise Latent Variables for Neural Variational Text Processing Iulian Vlad Serban, Alexander Ororbia II, Joelle Pineau and Aaron Courville ...... 52

vii

Conference Program

Thursday, September 7, 2016

9:00–10:30 Section 1

9:00–9:15 Welcome Organizers

9:15–10:00 Invited Talk

10:00–10:30 Dependency Parsing with Dilated Iterated Graph CNNs Emma Strubell and Andrew McCallum

10:30–11:00 Coffee Break

11:00–12:15 Section 2

11:00–11:45 Invited Talk

11:45–12:15 Poster Madness

12:15–14:00 Lunch

ix Thursday, September 7, 2016 (continued)

14:00–15:30 Section 3

14:00–14:45 Poster Session

Entity Identification as Multitasking Karl Stratos

Towards Neural Machine Translation with Latent Tree Attention James Bradbury and Richard Socher

Structured Prediction via Learning to Search under Bandit Feedback Amr Sharaf and Hal Daumé III

Syntax Aware LSTM model for Semantic Role Labeling Feng Qian, Lei Sha, Baobao Chang, LuChen Liu and Ming Zhang

Spatial Language Understanding with Multimodal Graphs using Declarative Learning based Programming Parisa Kordjamshidi, Taher Rahgooy and Umar Manzoor

Boosting Information Extraction Systems with Character-level Neural Networks and Free Noisy Supervision Philipp Meerkamp and Zhengyi Zhou

14:45–15:30 Invited Talk

15:30–16:00 Coffee Break

x Thursday, September 7, 2016 (continued)

16:00–17:30 Section 4

16:00–16:45 Invited Talk

16:45–17:15 Piecewise Latent Variables for Neural Variational Text Processing Iulian Vlad Serban, Alexander Ororbia II, Joelle Pineau and Aaron Courville

17:15–17:30 Closing

xi

Dependency Parsing with Dilated Iterated Graph CNNs

Emma Strubell Andrew McCallum College of Information and Computer Sciences University of Massachusetts Amherst strubell, mccallum @cs.umass.edu { }

Abstract Heads

Dependency parses are an effective way [root] My dog also likes eating sausage . to inject linguistic knowledge into many My nmod downstream tasks, and many practitioners wish to efficiently parse sentences at scale. dog nsubj Recent advances in GPU hardware have also advmod enabled neural networks to achieve signif-

icant gains over the previous best models, likes root these models still fail to leverage GPUs’

Dependents xcomp capability for massive parallelism due to eating their requirement of sequential process- dobj ing of the sentence. In response, we pro- sausage pose Dilated Iterated Graph Convolutional . punct Neural Networks (DIG-CNNs) for graph- based dependency parsing, a graph con- volutional architecture that allows for ef- Figure 1: Receptive field for predicting the head- ficient end-to-end GPU parsing. In ex- dependent relationship between likes and eating. periments on the English Penn TreeBank Darker cell indicates more layers include that benchmark, we show that DIG-CNNs per- cell’s representation. Heads and labels corre- form on par with some of the best neural sponding to gold tree are indicated. network parsers.

1 Introduction Donald et al., 2005; Kiperwasser and Goldberg, By vastly accelerating and parallelizing the core 2016; Dozat and Manning, 2017) generally em- numeric operations for performing inference and ploy attention to produce marginals over each pos- computing gradients in neural networks, recent de- sible edge in the graph, followed by a dynamic velopments in GPU hardware have facilitated the programming algorithm to find the most likely tree emergence of deep neural networks as state-of- given those marginals. the-art models for many NLP tasks, such as syn- Because of their dependency on sequential pro- tactic dependency parsing. The best neural de- cessing of the sentence, none of these architectures pendency parsers generally consist of two stages: fully exploit the massive parallel processing ca- First, they employ a such pability that GPUs possess. If we wish to max- as a bidirectional LSTM to encode each token in imize GPU resources, graph-based dependency context; next, they compose these token represen- parsers are more desirable than their transition- tations into a parse tree. Transition based depen- based counterparts since attention over the edge- dency parsers (Nivre, 2009; Chen and Manning, factored graph can be parallelized across the entire 2014; Andor et al., 2016) produce a well-formed sentence, unlike the transition-based parser which tree by predicting and executing a series of shift- must sequentially predict and perform each transi- reduce actions, whereas graph-based parsers (Mc- tion. By encoding token-level representations with

1

Proceedings of the 2nd Workshop on Structured Prediction for Natural Language Processing, pages 1–6 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics an Iterated Dilated CNN (ID-CNN) (Strubell et al., 2.1 Iterated Dilated CNNs 2017), we can also remove the sequential depen- Stacking many dilated CNN layers can easily in- dencies of the RNN layers. Unlike Strubell et al. corporate information from a whole sentence. For (2017) who use 1-dimensional convolutions over example, with a radius of 1 and 4 layers of dilated the sentence to produce token representations, our convolutions, the effective input window size for network employs 2-dimensional convolutions over each token is width 31, which exceeds the average the adjacency matrix of the sentence’s parse tree, sentence length (23) in the Penn TreeBank corpus. modeling attention from the bottom up. By train- However, simply increasing the depth of the CNN ing with an objective that encourages our model can cause considerable over-fitting when data is to predict trees using only simple matrix opera- sparse relative to the growth in model parameters. tions, we additionally remove the additional com- To address this, we employ Iterated Dilated CNNs putational cost of dynamic programming infer- (ID-CNNs) (Strubell et al., 2017), which instead ence. Combining all of these ideas, we present apply the same small stack of dilated convolutions Dilated Iterated Graph CNNs (DIG-CNNs): a repeatedly, each time taking the result of the last combined convolutional neural network architec- stack as input to the current iteration. Applying ture and training objective for efficient, end-to-end the parameters recurrently in this way increases GPU graph-based dependency parsing. the size of the window of context incorporated We demonstrate the efficacy of these models in into each token representation while allowing the experiments on English Penn TreeBank, in which model to generalize well. Their training objec- our models perform similarly to the state-of-the- tive additionally computes a loss for the output of art. each application, encouraging parameters that al- low subsequent stacks to resolve dependency vio- lations from their predecessors. 2 Dilated Convolutions 3 Dilated Iterated Graph CNNs We describe how to extend ID-CNNs (Strubell Though common in other areas such as computer et al., 2017) to 2-dimensional convolutions over vision, 2-dimensional convolutions are rarely used the adjacency matrix of a sentence’s parse tree, in NLP since it is usually unclear how to pro- allowing us to model the parse tree through cess text as a 2-dimensional grid. However, 2- the whole network, incorporating evidence about dimensional convolutional layers are a natural nearby head-dependent relationships in every model for embedding the adjacency matrix of a layer of the network, rather than modeling at sentence’s parse. the token level followed by a single layer of at- A 2-dimensional convolutional neural network tention to produce head-dependent compatibilities layer transforms each input element, in our case between tokens. ID-CNNs allow us to efficiently an edge in the dependency graph, as a linear func- incorporate evidence from the entire tree without tion of the width rw and height rh window of sur- sacrificing generalizability. rounding input elements (other possible edges in the dependency graph). In this work we assume 3.1 Model architecture 1 square convolutional windows: rh = rw. Let x = [x1, . . . , xT ] be our input text Let Dilated convolutions perform the same opera- y = [y1, . . . , yT ] be labels with domain size D for tion, except rather than transforming directly adja- the edge between each token xi and its head xj. cent inputs, the convolution is defined over a wider We predict the most likely y, given a conditional model P (y x) where the tags are conditionally in- input window by skipping over δ inputs at a time, | where δ is the dilation width. A dilated convo- dependent given some features for x: lution of width 1 is equivalent to a simple con- T volution. Using the same number of parameters P (y x) = P (y F (x)), (1) | t| as a simple convolution with the same radius, the t=1 δ > 1 dilated convolution incorporates broader Y 1In practice, we include a dummy root token at the begin- context into the representation of a token than a ning of the sentence which serves as the head of the root. We simple convolution. do not predict a head for this dummy token.

2 The local conditional distributions of Eqn. (1) 3.2 Training come from a straightforward extension of ID- Our main focus is to apply the DIG-CNN as fea- CNNs (Strubell et al., 2017) to 2-dimensional con- ture extraction for the conditional model described volutions. This network takes as input a sequence in Sec. 3.1, where tags are conditionally indepen- of T vectors x , and outputs a T T matrix of t × dent given deep features, since this will enable per-class scores hij for each pair of tokens in the prediction that is parallelizable across all possi- sentence. ble edges. Here, maximum likelihood training is We denote the kth dilated convolutional layer of straightforward because the likelihood decouples (k) dilation width δ as Dδ . The first layer in the net- into the sum of the likelihoods of independent lo- work transforms the input to a graph by concate- gistic regression problems for every edge, with nating all pairs of vectors in the sequence xi, xj natural parameters given by Eqn. (6): and applying a 2-dimensional dilation-1 convolu- T tion D(0) to form an initial edge representation 1 1 log P (y h ) (8) (0) T t | t cij for each token pair: t=1 X (0) (0) We could also use the DIG-CNN as input fea- cij = D1 [xi; xj] (2) tures for an MST parser, where the partition func- We denote vector concatenation with [ ; ]. Next, tion and its gradient are computed using Kirch- · · Lc layers of dilated convolutions of exponentially hoffs Matrix-Tree Theorem (Tutte, 1984), but (0) increasing dilation width are applied to cij , our aim is to approximate inference in a tree- folding in increasingly broader context into the structured using greedy inference embedded representation of eij at each layer. Let and expressive features over the input in order to r() denote the ReLU activation function (Glorot perform inference as efficiently as possible on a (0) et al., 2011). Beginning with ct = it we define GPU. the stack of layers with the following recurrence: To help bridge the gap between these two tech- niques, we use the training technique described in (k) (k 1) (k 1) cij = r D L−c 1 ct − (3) 2 − (Strubell et al., 2017). The tree-structured graph-   ical model has preferable and and add a final dilation-1 layer to the stack: accuracy since prediction directly reasons in the

(Lc+1) (Lc) (Lc) space of structured outputs. Instead, we com- cij = r D ct (4) 1 pile some of this reasoning in output space into   We refer to this stack of dilated convolutions as a DIG-CNN feature extraction. Instead of explicit block B( ), which has output resolution equal to reasoning over output labels during inference, we · its input resolution. To incorporate even broader train the network such that each block is predictive context without over-fitting, we avoid making B of output labels. Subsequent blocks learn to cor- deeper, and instead iteratively apply BLb times, rect dependency violations of their predecessors, introducing no extra parameters. Starting with refining the final sequence prediction. (1) bt = B (it), we define the output of block m: To do so, we first define predictions of the model after each of the Lb applications of the (m) (m 1) (m) bij = B bt − (5) block. Let ht be the result of applying the ma- (m)   trix Wo from (6) to bt , the output of block m. We apply a simple affine transformation Wo to this We minimize the average of the losses for each ap- final representation to obtain label scores for each plication of the block: edge eij: Lb T 1 1 (m) (Lb) (Lb) log P (y h ). (9) hij = Wobt (6) L T t | t b t=1 Xk=1 X We can obtain the most likely head (and its la- By rewarding accurate predictions after each bel) for each dependent by computing the argmax application of the block, we learn a model where over all labels for all heads for each dependent: later blocks are used to refine initial predictions. (Lb) ht = arg max hij (7) The loss also helps reduce the vanishing gradient j problem (Hochreiter, 1998) for deep architectures.

3 We apply dropout (Srivastava et al., 2014) to the perform a pooling operation over characters. Lei (m) raw inputs xij and to each block’s output bt to et al.(2015) present a CNN variant where convo- help prevent overfitting. lutions adaptively skip neighboring words. While the flexibility of this model is powerful, its adap- 4 Related work tive behavior is not well-suited to GPU accelera- tion. Currently, the most accurate parser in terms of More recently, inspired by the success of deep labeled and unlabeled attachment scores is the dilated CNNs for image segmentation in com- neural network graph-based dependency parser of puter vision (Yu and Koltun, 2016; Chen et al., Dozat and Manning(2017). Their parser builds 2015), convolutional neural networks have been token representations with a bidirectional LSTM employed as fast models for tagging, speech gen- over word embeddings, followed by head and de- eration and machine translation. (van den Oord pendent MLPs. Compatibility between heads and et al., 2016) use dilated CNNs to efficiently gen- dependents is then scored using a biaffine model, erate speech, and Kalchbrenner et al.(2016) de- and the highest scoring head for each dependent is scribes an encoder-decoder model for machine selected. translation which uses dilated CNNs over bytes Previously, (Chen and Manning, 2014) pio- in both the encoder and decoder. Strubell et al. neered neural network paring with a transition- (2017) first described the one-dimensional ID- based dependency parser which used features from CNN architecture which is the basis for this work, a fast feed-forward neural network over word, to- demonstrating its success as a fast and accurate ken and label embeddings. Many improved upon NER tagger. Gehring et al.(2017) report state-of- this work by increasing the size of the network the-art results and much faster training from using and using a structured training objective (Weiss many CNN layers with gated activations as en- et al., 2015; Andor et al., 2016). (Kiperwasser coders and decoders for a sequence-to-sequence and Goldberg, 2016) were the first to present a model. While our architecture is similar to the graph-based neural network parser, employing an encoder architecture of these models, ours is dif- MLP with bidirectional LSTM inputs to score ferentiated by (1) being tailored to smaller-data arcs and labels. (Cheng et al., 2016) propose a regimes such as parsing via our iterated architec- similar network, except with additional forward ture and loss, and (2) employing two-dimensional and backward encoders to allow for conditioning convolutions to model the adjacency matrix of the on previous predictions. (Kuncoro et al., 2016) parse tree. We are the first to our knowledge to take a different approach, distilling a consensus of use dilated convolutions for parsing, or to use two- many LSTM-based transition-based parsers into dimensional dilated convolutions for NLP. one graph-based parser. (Ma and Hovy, 2017) em- ploy a similar model, but add a CNN over char- acters as an additional word representation and 5 Experimental Results perform structured training using the Matrix-Tree 5.1 Data and Evaluation Theorem. Hashimoto et al.(2017) train a large network which performs many NLP tasks includ- We train our parser on the English Penn Tree- ing part-of-speech tagging, chunking, graph-based Bank on the typical data split: training on sec- parsing, and entailment, observing benefits from tions 2–21, testing on section 23 and using sec- multitasking with these tasks. tion 22 for development. We convert constituency Despite their success in the area of computer trees to dependencies using the Stanford depen- vision, in NLP convolutional neural networks dency framework v3.5 (de Marneffe and Man- have mainly been relegated to tasks such as sen- ning, 2008), and use part-of-speech tags from the tence classification, where each input sequence Stanford left3words part-of-speech tagger. As is is mapped to a single label (rather than a la- the norm for this dataset, our evaluation excludes bel for each token) Kim(2014); Kalchbrenner punctuation. Hyperparameters that resulted in the et al.(2014); Zhang et al.(2015); Toutanova et al. best performance on the validation set were se- (2015). As described above, CNNs have also lected via grid search. A more detailed descrip- been used to encode token representations from tion of optimization and data pre-processing can embeddings of their characters, which similarly be found in the Appendix.

4 Model UAS LAS part by the Center for Intelligent Information Re- Kiperwasser and Goldberg(2016) 93.9 91.9 trieval, in part by DARPA under agreement num- Cheng et al.(2016) 94.10 91.49 ber FA8750-13-2-0020, in part by Defense Ad- Kuncoro et al.(2016) 94.3 92.1 vanced Research Agency (DARPA) contract num- Hashimoto et al.(2017) 94.67 92.90 ber HR0011-15-2-0036, in part by the National Ma and Hovy(2017) 94.9 93.0 Science Foundation (NSF) grant number DMR- Dozat and Manning(2017) 95.74 94.08 1534431, and in part by the National Science DIG-CNN 93.70 91.72 Foundation (NSF) grant number IIS-1514053. DIG-CNN + Eisner 94.03 92.00 The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes Table 1: Labeled and unlabeled attachment scores notwithstanding any copyright notation thereon. of our model compared to state-of-the-art graph- Any opinions, findings and conclusions or recom- based parsers mendations expressed in this material are those of the authors and do not necessarily reflect those of 5.2 English PTB Results the sponsor. We compare our models labeled and unlabeled at- tachment scores to the neural network graph-based References dependency parsers described in Sec.4. Without Daniel Andor, Chris Alberti, David Weiss, Aliaksei enforcing trees at test time, our model performs Severyn, Alessandro Presta, Kuzman Ganchev, Slav just under the LSTM-based parser of Kiperwasser Petrov, and Michael Collins. 2016. Globally nor- and Goldberg(2016), and a few points lower than malized transition-based neural networks. In Pro- the state-of-the-art. When we post-process our ceedings of the 54th Annual Meeting of the Associa- tion for Computational Linguistics. model’s outputs into trees, like all the other mod- els in our table, our results increase to perform Danqi Chen and Christopher D. Manning. 2014. A fast slightly above Kiperwasser and Goldberg(2016). and accurate dependency parser using neural net- We believe our model’s relatively poor perfor- works. In EMNLP. mance compared to existing models is due to its Liang-Chieh Chen, George Papandreou, Iasonas limited incorporation of context from the entire Kokkinos, Kevin Murphy, and Alan L. Yuille. 2015. sentence. While each bidirectional LSTM token Semantic image segmentation with deep convolu- representation observes all tokens in the sentence, tional nets and fully connected crfs. In ICLR. our reported model observes a relatively small Hao Cheng, Hao Fang, Xiaodong He, Jianfeng Gao, window, only 9 tokens. We hypothesize that this and Li Deng. 2016. Bi-directional attention with window is not sufficient for producing accurate agreement for dependency parsing. In EMNLP. parses. Still, we believe this is a promising archi- Marie-Catherine de Marneffe and Christopher D. Man- tecture for graph-based parsing, and with further ning. 2008. The stanford typed dependencies rep- experimentation could meet or exceed the state- resentation. In COLING 2008 Workshop on Cross- of-the-art while running faster by better leveraging framework and Cross-domain Parser Evaluation. GPU architecture. Timothy Dozat and Christopher D. Manning. 2017. Deep biaffine attention for neural dependency pars- 6 Conclusion ing. In ICLR.

We present DIG-CNNs, a fast, end-to-end convo- Jonas Gehring, Michael Auli, David Grangier, Denis lutional architecture for graph-based dependency Yarats, and Yann N. Dauphin. 2017. Convolutional arXiv preprint: parsing. Future work will experiment with deeper sequence to sequence learning. arXiv:1705.03122 . CNN architectures which incorporate broader sen- tence context in order to increase accuracy without Xavier Glorot, Antoine Bordes, and Yoshua Bengio. sacrificing speed. 2011. Deep sparse rectifier neural networks. In AIS- TATS.

Acknowledgments Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsu- ruoka, and Richard Socher. 2017. A joint many-task We thank Patrick Verga and David Belanger for model: Growing a neural network for multiple nlp helpful discussions. This work was supported in tasks. arXiv preprint: arXiv:1611.01587 .

5 Sepp Hochreiter. 1998. The vanishing gradient prob- and knowledge bases. In Proceedings of the 2015 lem during learning recurrent neural nets and prob- Conference on Empirical Methods in Natural Lan- lem solutions. International Journal of Uncer- guage Processing. Association for Computational tainty, Fuzziness and Knowledge-Based Systems Linguistics, pages 1499–1509. 6(02):107–116. William Thomas Tutte. 1984. Graph theory, vol- Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, ume 11. Addison-Wesley Menlo Park. Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. 2016. Neural machine translation in Aaron van den Oord, Sander Dieleman, Heiga Zen, linear time. arXiv preprint arXiv:1610.10099 . Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Nal Kalchbrenner, Edward Grefenstette, and Phil Blun- Kavukcuoglu. 2016. Wavenet: A generative model som. 2014. A convolutional neural network for for raw audio. arXiv preprint arXiv:1609.03499 . modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computa- David Weiss, Chris Alberti, Michael Collins, and Slav tional Linguistics. Petrov. 2015. Structured training for neural network transition-based parsing. In Annual Meeting of the Yoon Kim. 2014. Convolutional neural networks for Association for Computational Linguistics. sentence classification. In EMNLP. Fisher Yu and Vladlen Koltun. 2016. Multi-scale con- Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim- text aggregation by dilated convolutions. In Inter- ple and accurate dependency parsing using bidirec- national Conference on Learning Representations tional lstm feature representations. Transactions (ICLR). of the Association for Computational Linguistics 4:313–327. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text clas- Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng sification. In Advances in Neural Information Pro- Kong, Chris Dyer, and Noah A. Smith. 2016. Dis- cessing Systems 28 (NIPS). tilling an ensemble of greedy dependency parsers into one mst parser. In EMNLP.

Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2015. Molding cnns for text: non-linear, non-consecutive convolutions. Empirical Methods in Natural Lan- guage Processing .

Xuezhe Ma and Eduard Hovy. 2017. Neural proba- bilistic model for non-projective mst parsing. arXiv preprint: 1701.00874 .

Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajic. 2005. Non-projective depen- dency parsing using spanning tree algorithms. In Proc. Human Language Technology Conf. and Conf. Empirical Methods Natural Language Pro- cess. (HLT/EMNLP). pages 523–530.

Joakim Nivre. 2009. Non-projective dependency pars- ing in expected linear time. In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJC- NLP of the AFNLP.

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Re- search 15(1):1929–1958.

Emma Strubell, Patrick Verga, David Belanger, and Andrew McCallum. 2017. Fast and accurate se- quence labeling with iterated dilated convolutions. arXiv preprint: arXiv:1702.02098 .

Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoi- fung Poon, Pallavi Choudhury, and Michael Gamon. 2015. Representing text for joint embedding of text

6 Entity Identification as Multitasking∗

Karl Stratos Toyota Technological Institute at Chicago [email protected]

Abstract in the number of types (assuming exact decoding with first-order label dependency). We empha- Standard approaches in entity identifica- size that the asymptotic runtime remains quadratic tion hard-code boundary detection and even if we heuristically prune previous labels type prediction into labels and perform based on the BIO scheme. This is not an issue Viterbi. This has two disadvantages: 1. the when the number of types is small but quickly be- runtime complexity grows quadratically in comes problematic as the number grows. Second, the number of types, and 2. there is no nat- there is no segment-level prediction: every predic- ural segment-level representation. In this tion happens at the word-level. As a consequence, paper, we propose a neural architecture models do not induce representations correspond- that addresses these disadvantages. We ing to multi-word mentions, which can be useful frame the problem as multitasking, sep- for downstream tasks such as named-entity disam- arating boundary detection and type pre- biguation (NED). diction but optimizing them jointly. De- In this paper, we propose a neural architecture spite its simplicity, this architecture per- that addresses these disadvantages. Given a sen- forms competitively with fully structured tence, the model uses bidirectional LSTMs (BiL- models such as BiLSTM-CRFs while scal- STMs) to induce features and separately predicts: ing linearly in the number of types. Fur- thermore, by construction, the model in- duces type-disambiguating embeddings of 1. Boundaries of mentions in the sentence. predicted mentions. 2. Entity types of the boundaries. 1 Introduction A popular convention in segmentation tasks such Crucially, during training, the errors of these two as named-entity recognition (NER) and chunk- predictions are minimized jointly. ing is the so-called “BIO”-label scheme. It One might suspect that the separation could de- hard-codes boundary detection and type predic- grade performance; neither prediction accounts tion into labels using the indicators “B” (Begin- for the correlation between entity types. But we ning), “I” (Inside), and “O” (Outside). For in- find that this is not the case due to joint optimiza- stance, the sentence Where is John Smith tion. In fact, our model performs competitively is tagged as Where/O is/O John/B-PER with fully structured models such as BiLSTM- Smith/I-PER. In this way, we can treat the CRFs (Lample et al., 2016), implying that the problem as sequence labeling and apply standard model is able to capture the entity correlation in- structured models such as CRFs. directly by multitasking. On the other hand, the But this approach has certain disadvantages. model scales linearly in the number of types and First, the runtime complexity grows quadratically induces segment-level embeddings of predicted

∗Part of the work was done while the author was at mentions that are type-disambiguating by con- Bloomberg L. P. struction.

7

Proceedings of the 2nd Workshop on Structured Prediction for Natural Language Processing, pages 7–11 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics

2 Related Work describe it here for completeness. The model pa- rameters Θ associated with this base network are Our work is directly inspired by Lample et al. 25 Character embedding ec R for c (2016) who demonstrate that a simple neural ar- • ∈ ∈ C chitecture based on BiLSTMs achieves state-of- 25 25 25 Character LSTMs φfC, φbC : R R R the-art performance on NER with no external • × → 100 features. They propose two models. The first ew R for w • ∈ ∈ W makes structured prediction of NER labels with 150 100 100 Word LSTMs φW , φW : R R R a CRF loss (LSTM-CRF) using the conventional • f b × → BIO-label scheme. The second, which performs Let w . . . w denote a word sequence where 1 n ∈ W slightly worse, uses a shift-reduce framework mir- word w has character w (j) at position j. i i ∈ C roring tansition-based dependency parsing (Ya- First, the model computes a character-sensitive 150 mada and Matsumoto, 2003). While the latter also word representation vi R as scales linearly in the number of types and pro- ∈ fjC = φfC ewi(j), fjC 1 j = 1 ... wi duces embeddings of predicted mentions, our ap- − ∀ | | proach is quite different. We frame the problem bC = φC e , bC j = w ... 1 j b wi(j) j+1  ∀ | i| as multitasking and do not need the stack/buffer vi = f Cw b1C ewi data structure. Semi-Markov models (Kong et al., | i| ⊕ ⊕ 2015; Sarawagi et al., 2004) explicitly incorporate for each i = 1 . . . n.1 Next, the model computes the segment structure but are computationally in- fiW = φfW vi, fiW1 i = 1 . . . n tensive (quadratic in the sentence length). − ∀ Multitasking has been shown to be effective in bW = φW vi, bW i = n . . . 1 i b i+1  ∀ numerous previous works (Collobert et al., 2011; and induces a character-  and context-sensitive Yang et al., 2016; Kiperwasser and Goldberg, 200 word representation hi R as 2016). This is especially true with neural networks ∈ which greatly simplify joint optimization across hi = fiW biW (1) multiple objectives. Most of these works con- ⊕ sider multitasking across different problems. In for each i = 1 . . . n. These vectors are used to de- contrast, we decompose a single problem (NER) fine the boundary detection loss and the type clas- into two natural subtasks and perform them jointly. sification loss described below. Particularly relevant in this regard is the parsing Boundary detection loss We frame boundary model of Kiperwasser and Goldberg(2016) which detection as predicting BIO tags without types. multitasks edge prediction and classification. A natural approach is to optimize the condi- LSTMs (Hochreiter and Schmidhuber, 1997), tional probability of the correct tags y . . . y 1 n ∈ and other variants of recurrent neural networks B, I, O : such as GRUs (Chung et al., 2014), have recently { } p(y . . .y h . . . h ) been wildly successful in various NLP tasks (Lam- 1 n| 1 n ple et al., 2016; Kiperwasser and Goldberg, 2016; n Chung et al., 2014). Since there are many detailed exp Tyi 1,yi gyi (hi) (2) ∝ − × ! descriptions of LSTMs available, we omit a pre- Xi=1 cise definition. For our purposes, it is sufficient to where g : R200 R3 is a function that ad- d d d → treat an LSTM as a mapping φ : R R 0 R 0 justs the length of the LSTM output to the num- × → that takes an input vector x and a state vector h to ber of targets. We use a feedforward network 2 1 1 2 output a new state vector h0 = φ(x, h). g(h) = W relu(W h + b ) + b . We write Θ1 3 3 to refer to T R × and the parameters in g. The ∈ 3 Model boundary detection loss is given by the negative log likelihood: Let denote the set of character types, the C W set of word types, and the set of entity types. L (Θ, Θ ) = log p y(l) h(l) E 1 1 − | Let denote the vector concatenation operation. l ⊕ X   Our model first constructs a network over a sen- 1For simplicity, we assume some random initial state vec- tence closely following Lample et al.(2016); we tors such as f0C and bCw +1 when we describe LSTMs. | i| 8 where l iterates over tagged sentences in the data. CoNLL 2003 (4 types) F1 # words/sec The global normalizer for (2) can be com- BiLSTM-CRF 90.22 3889 puted using dynamic programming; see Collobert Mention2Vec 90.90 4825 et al.(2011). Note that the runtime complexity OntoNotes (18 types) F1 # words/sec of boundary detection is constant despite dynamic BiLSTM-CRF 90.77 495 programming since the number of tags is fixed Mention2Vec 89.37 4949 (three). Table 1: Test F1 scores on CoNLL 2003 and Type classification loss Given a mention bound- OntoNotes newswire portion. ary 1 s t n, we predict its type using ≤ ≤ ≤ (1) as follows. We introduce an additional pair Model F1 200 200 200 of LSTMs φE , φE : R R R and McCallum and Li(2003) 84.04 f b × → compute a corresponding mention representation Collobert et al.(2011) 89.59 µ R|E| as Lample et al.(2016)–Greedy 89.15 ∈ Lample et al.(2016)–Stack 90.33 fjE = φfE hj, fjE 1 j = s . . . t Lample et al.(2016)–CRF 90.94 − ∀ bE = φE h , bE j = t . . . s Mention2Vec 90.90 j b j j+1  ∀ µ = q f E bE (3) Table 2: Test F1 scores on CoNLL 2003. t ⊕ s  400  where q : R R|E| is again a feedforward 4 Experiments → network that adjusts the vector length to .2 We |E| Data We use two NER datasets: CoNLL 2003 write Θ2 to refer to the parameters in φfE , φbE , q. which has four entity types PER, LOC, ORG and Now we can optimize the conditional probability MISC (Tjong Kim Sang and De Meulder, 2003), of the correct type τ: and the newswire portion of OntoNotes Release 5.0 which has 18 entity types (Weischedel et al., p(τ h . . . h ) exp (µ ) (4) | s t ∝ τ 2013). The type classification loss is given by the negative Implementation and baseline We denote our log likelihood: model Mention2Vec and implement it using the DyNet library.3 We use the same pre-trained word L (Θ, Θ ) = log p τ (l) h(l) . . . h(l) 2 2 − | s t embeddings in Lample et al.(2016). We use the Xl   Adam optimizer (Kingma and Ba, 2014) and ap- where l iterates over typed mentions in the data. ply dropout at all LSTM layers (Hinton et al., 2012). We perform minimal tuning over devel- Joint loss The final training objective is to min- opment data. Specifically, we perform a 5 5 × imize the sum of the boundary detection loss and grid search over learning rates 0.0001 ... 0.0005 the type classification loss: and dropout rates 0.1 ... 0.5 and choose the con- figuration that gives the best performance on the L(Θ, Θ , Θ ) = L (Θ, Θ ) + L (Θ, Θ ) (5) 1 2 1 1 2 2 dev set. In stochastic gradient descent (SGD), this amounts We also re-implement the BiLSTM-CRF model of Lample et al.(2016); this is equivalent to opti- to computing the tagging loss l1 and the classifi- mizing just L (Θ, Θ ) but using typed BIO tags. cation loss l2 (summed over all mentions) at each 1 1 annotated sentence, and then taking a gradient step Lample et al.(2016) use different details in opti- mization (SGD with gradient clipping), data pre- on l1 + l2. Observe that the base network Θ is optimized to handle both tasks. During train- processing (replacing every digit with a zero), and ing, we use gold boundaries and types to optimize the dropout scheme (droptout at BiLSTM input (1)). As a result, our re-implementation is not di- L2 (Θ, Θ2). At test time, we predict boundaries from the tagging layer (2) and classify them using rectly comparable and obtains different (slightly the classification layer (4). lower) results. But we emphasize that the main goal of this paper is to demonstrate the utility the 2Clearly, one can consider different networks over the boundary, for instance simple bag-of-words or convolutional 3https://github.com/karlstratos/ neural networks. We leave the exploration as future work. mention2vec

9 PER In another letter dated January 1865, a well-to-do Washington matron wrote to Lincoln to plead for ... Chang and Washington were the only men’s seeds in action on a day that saw two seeded women’s ... “Just one of those things, I was just trying to make contact,” said Bragg. Washington’s win was not comfortable, either. LOC Lauck, from Lincoln, Nebraska, yelled a tirade of abuse at the court after his conviction for inciting ...... warring factions, with the PUK aming to break through to KDP’s headquarters in Saladhuddin. ... is not expected to travel to the West Bank before Monday,” Nabil Abu Rdainah told Reuters. ... off a bus near his family home in the village of Donje Ljupce in the municipality of Podujevo. ORG English division three - Swansea v Lincoln. SOCCER - OUT-OF-SORTS NEWCASTLE CRASH 2 1 AT HOME. Moura, who appeared to have elbowed Cyprien in the final minutes of the 3 0 win by Neuchatel, was ... In Sofia: Leviski Sofia (Bulgaria) 1 Olimpija (Slovenia) 0 WORK OF ART ... Bond novels, and “Treasure Island,” produced by Charlton Heston who also stars in the movie. ... probably started in 1962 with the publication of Rachel Carson’s book “Silent Spring.” ... Victoria Petrovich) spout philosophic bon mots with the self-concious rat-a-tat pacing of “Laugh In.” Dennis Farney’s Oct. 13 page - one article “River of Despair,” about the poverty along the ... GPE ... from a naval station at Treasure Island near the Bay Bridge to San Francisco to help fight fires. ... lived in an expensive home on Lido Isle, an island in Newport’s harbor, according to investigators. ... Doris Moreno, 37, of Bell Gardens; and Ana L. Azucena, 27, of Huntington Park. One group of middle-aged manufacturing men from the company’s Zama plant outside Tokyo was ... ORG ... initiative will spur members of the General Agreement on Tariffs and Trade to reach ...... question of Taiwan’s membership in the General Agreement on Tariffs and Trade should ... ”He doesn’t know himself,” Kathy Stanwick of the Abortion Rights League says of ...... administrative costs, management and research, the Office of Technology Assessment just reported. Table 3: Nearest neighbors of detected mentions in CoNLL 2003 and OntoNotes using (3). proposed approach rather than obtaining a new output layer (CRF). Mention2Vec performs com- state-of-the-art result on NER. petitively.

4.1 NER Performance 4.2 Mention Embeddings Table1 compares the NER performance and de- Table3 shows nearest neighbors of detected coding speed between BiLSTM-CRF and Men- mentions using the mention representations µ in tion2Vec. The F1 scores are obtained on test data. (3). Since µτ represents the score of type τ, The speed is measured by the average number of the mention embeddings are clustered by en- words decoded per second. tity types by construction. The model induces On CoNLL 2003 in which the number of types completely different representations even when is small, our model achieves 90.50 compared to the mention has the same lexical form. For 90.22 of BiLSTM-CRF with minor speed im- instance, based on its context Lincoln re- provement. This shows that despite the separation ceives a person, location, or organization repre- between boundary detection and type classifica- sentation; Treasure Island receives a book tion, we can achieve good performance through or location representation. The model also joint optimization. On OntoNotes in which the learns representations for long multi-word expres- number of types is much larger, our model still sions such as the General Agreement on performs well with an F1 score of 89.37 but is Tariffs and Trade. behind BiLSTM-CRF which achieves 90.77. We suspect that this is due to strong correlation be- 5 Conclusion tween mention types that fully structured models can exploit more effectively. However, our model We have presented a neural architecture for en- is also an order of magnitude faster: 4949 com- tity identification that multitasks boundary detec- pared to 495 words/second. tion and type classification. Joint optimization en- Finally, Table2 compares our model with other ables the base BiLSTM network to capture the works in the literature on CoNLL 2003. McCal- correlation between entities indirectly via multi- lum and Li(2003) use CRFs with manually crafted tasking. As a result, the model is competitive features; Collobert et al.(2011) use convolutional with fully structured models such as BiLSTM- neural networks; Lample et al.(2016) use BiL- CRFs on CoNLL 2003 while being more scal- STMs in a greedy tagger (Greedy), a stack-based able and also inducing context-sensitive mention model (Stack), and a global tagger using a CRF embeddings clustered by entity types. There are

10 many interesting future directions, such as apply- Barbara Plank. 2016. Keystroke dynamics as signal for ing this framework to NED in which type classi- shallow syntactic parsing. In Proceedings of COL- fication is much more fine-grained and finding a ING. better method for optimizing the multitasking ob- Barbara Plank, Anders Søgaard, and Yoav Goldberg. jective (e.g., instead of using gold boundaries for 2016. Multilingual part-of-speech tagging with training, dynamically use predicted boundaries in bidirectional long short-term memory models and auxiliary loss. In Proceedings of ACL. a framework). Sunita Sarawagi, William W Cohen, et al. 2004. Semi- Acknowledgments markov conditional random fields for information extraction. In NIPs. volume 17, pages 1185–1192. The author would like to thank Linpeng Kong for his consistent help with DyNet and Miguel Balles- Erik F Tjong Kim Sang and Fien De Meulder. teros for pre-trained word embeddings. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4. References Association for Computational Linguistics, pages Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, 142–147. and Yoshua Bengio. 2014. Empirical evaluation of Ralph Weischedel, Martha Palmer, Mitchell Marcus, gated recurrent neural networks on sequence model- Eduard Hovy, Sameer Pradhan, Lance Ramshaw, ing. In NIPS Deep Learning Workshop. Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Ronan Collobert, Jason Weston, Leon´ Bottou, Michael Franchini, et al. 2013. Ontonotes release 5.0 Karlen, Koray Kavukcuoglu, and Pavel Kuksa. ldc2013t19. Linguistic Data Consortium, Philadel- 2011. Natural language processing (almost) from phia, PA . scratch. The Journal of Machine Learning Research Hiroyasu Yamada and Yuji Matsumoto. 2003. Statis- 12:2493–2537. tical dependency analysis with support vector ma- Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, chines. In Proceedings of IWPT. volume 3, pages Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. 195–206. Improving neural networks by preventing co- adaptation of feature detectors. arXiv preprint Zhilin Yang, Ruslan Salakhutdinov, and William Co- arXiv:1207.0580 . hen. 2016. Multi-task cross-lingual sequence tag- ging from scratch. arXiv preprint arXiv:1603.06270 Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. . Long short-term memory. Neural computation 9(8):1735–1780. Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 . Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim- ple and accurate dependency parsing using bidirec- tional lstm feature representations. Transactions of the Association for Computational Linguistics 4:313–327. Lingpeng Kong, Chris Dyer, and Noah A Smith. 2015. Segmental recurrent neural networks. arXiv preprint arXiv:1511.06018 . Guillaume Lample, Miguel Ballesteros, Sandeep Sub- ramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of NAACL. Andrew McCallum and Wei Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the seventh conference on Natu- ral language learning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics, pages 188–191.

11 Towards Neural Machine Translation with Latent Tree Attention

James Bradbury and Richard Socher [email protected] [email protected]

Abstract ence over graph-based conditional random fields (Kim et al., 2017), the straight-through estimator Building models that take advantage of the (Chung et al., 2017), or policy gradient reinforce- hierarchical structure of language without ment learning (Yogatama et al., 2017) to work a priori annotation is a longstanding goal around the inapplicability of gradient-based learn- in natural language processing. We intro- ing to problems with discrete latent states. duce such a model for the task of machine For the task of machine translation, translation, pairing a recurrent neural net- syntactically-informed models have shown work grammar encoder with a novel atten- promise both inside and outside the deep learn- tional RNNG decoder and applying pol- ing context, with hierarchical phrase-based icy gradient reinforcement learning to in- models frequently outperforming traditional duce unsupervised tree structures on both ones (Chiang, 2005) and neural MT models the source and target. When trained augmented with morphosyntactic input features on character-level datasets with no ex- (Sennrich and Haddow, 2016; Nadejde et al., plicit segmentation or parse annotation, 2017), a tree-structured encoder (Eriguchi et al., the model learns a plausible segmentation 2016; Hashimoto and Tsuruoka, 2017), and a and shallow parse, obtaining performance jointly trained parser (Eriguchi et al., 2017) each close to an attentional baseline. outperforming purely-sequential baselines. Drawing on many of these precedents, we in- 1 Introduction troduce an attentional neural machine translation Many efforts to exploit linguistic hierarchy in NLP model whose encoder and decoder components tasks make use of the output of a self-contained are both tree-structured neural networks that pre- parser system trained from a human-annotated dict their own constituency structure as they con- treebank (Huang et al., 2006). An alternative ap- sume or emit text. The encoder and decoder net- proach aims to jointly learn the task at hand and works are variants of the RNNG model introduced relevant aspects of linguistic hierarchy, inducing by Dyer et al.(2016b), allowing tree structures of from an unannotated training dataset parse trees unconstrained arity, while text is ingested at the that may or may not correspond to treebank anno- character level, allowing the model to discover and tation practices (Wu, 1997; Chiang, 2005). make use of structure within words. Most deep learning models for NLP that aim to The parsing decisions of the encoder and de- make use of linguistic hierarchy integrate an exter- coder RNNGs are parameterized by a stochastic nal parser, either to prescribe the recursive struc- policy trained using a weighted sum of two ob- ture of the neural network (Pollack, 1990; Goller jectives: a language model loss term that rewards and Kuchler¨ , 1996; Socher et al., 2013) or to pro- predicting the next character with high likelihood, vide a supervision signal or training data for a net- and a tree attention term that rewards one-to-one work that predicts its own structure (Socher et al., attentional correspondence between constituents 2010; Bowman et al., 2016; Dyer et al., 2016b). in the encoder and decoder. But some recently described neural network mod- We evaluate this model on the German-English els take the second approach and treat hierarchi- language pair of the flickr30k dataset, where it cal structure as a latent variable, applying infer- obtains similar performance to a strong character-

12

Proceedings of the 2nd Workshop on Structured Prediction for Natural Language Processing, pages 12–16 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics level baseline. Analysis of the latent trees pro- S duced by the encoder and decoder shows that x0 the model learns a reasonable segmentation and h1 x2 h x shallow parse, and most phrase-level constituents 3 4 S h i r t x5 constructed while ingesting the German input T - ... sentence correspond meaningfully to constituents E i n e m _ NT built while generating the English output. S x0 2 Model h1 x2 h3 x4 2.1 Encoder/Decoder Architecture S h i r t _ ... T - GEN The model consists of a coupled encoder and de- E i n e m _ coder, where the encoder is a modified stack- S only recurrent neural network grammar (Kuncoro x0 et al., 2017) and the decoder is a stack-only RNNG h1 x2 REDUCE augmented with constituent-level attention. An h3 h4 ... RNNG is a top-down transition-based model that S h i r t jointly builds a sentence representation and parse T - E i n e m _ tree, representing the parser state with a StackL- STM and using a bidirectional LSTM as a con- Figure 1: At a given timestep during either encoding or stituent composition function. Our implemen- decoding there are three possible transitions (although one tation is detailed in Figure 1, and differs from or more may be forbidden): begin a new nonterminal con- Dyer et al.(2016b) in that it lacks separate new- stituent (NT), predict and ingest a terminal (GEN), or end the nonterminal tokens for different phrase types, and current nonterminal (REDUCE). If the chosen transition is NT, thus does not include the phrase type as an input to the RNNG adds a new-nonterminal token xi+1 to the active the composition function. Instead, the values of xi constituent and begins a new nonterminal constituent (1a). for the encoder are fixed to a constant xenc while If the transition is GEN, the RNNG predicts the next token the values of xj for the decoder are determined (Section 2.3) and adds the ground-truth next token e from the through an attention procedure (Section 2.2). context buffer at the cursor location (1b). If the transition is As originally described, the RNNG predicts REDUCE, the contents of the active nonterminal are passed to parser transitions using a one-layer tanh percep- the composition function, the new-nonterminal token xi is re- tron with three concatenated inputs: the last state placed with the result of the composition hi, and the StackL- of a unidirectional LSTM over the stack contents STM rolls back to the previously active constituent (1c). In (s), the last state of a unidirectional LSTM over all three cases, the StackLSTM then advances one step with the reversed buffer of unparsed tokens (b), and the the newly added token as input (xi+1, e, or hi). result of an LSTM over the past transitions (a). All three of these states can be computed with at most terminal, the decoder represents a new nontermi- one LSTM step per parser transition using the nal on the stack as a sum weighted by structural StackLSTM algorithm (Dyer et al., 2016a). But attention of the phrase representations of all non- such a baseline RNNG is actually outperformed terminal tree nodes produced by the encoder. In by one which conditions the parser transitions only particular, we use the normalized dot products be- on the stack representation (Kuncoro et al., 2017). dec tween the decoder stack representation sj and Restricting our model to this stack-only case al- i the stack representation at each encoder node senc lows both the encoder and decoder to be super- (that is, the hidden state of the StackLSTM up to vised using a language model loss, while allowing and including xenc but not henc) as coefficients the model access to b would give it a trivial way j j in a weighted sum of the phrase embeddings hi to predict the next character and obtain zero loss. enc corresponding to the encoder nodes:

2.2 Attention enc dec αij = softmax(si sj ) With the exception of the attention mechanism, the all i · dec enc (1) encoder and decoder are identical. While the en- xj = αijhi . coder uses a single token to represent a new non- Xi 13 Since the dot products between encoder and de- bers correspond to penalties. We assign a tree re- coder stack representations are a measure of struc- ward of 1 when the model predicts a REDUCE − tural similarity between the (left context of) the with only one child constituent (REDUCE with current decoder state and the encoder state. Within zero child constituents is forbidden) or predicts a particular decoder nonterminal, the model re- two REDUCE or NT transitions in a row. This bi- duces to ordinary sequence-to-sequence transduc- ases the model against unary branching and re- tion. Starting from the encoder’s representation of duces its likelihood of producing an exclusively the corresponding nonterminal or a weighted com- left- or right-branching tree structure. In addition, bination of such representations, the decoder will for all constituents except the root, we assign a tree emit a translated sequence of child constituents reward based on the size and type of its children. (both nonterminal and terminal) one by one— If n and t are the number of nonterminal and ter- applying attention only when emitting nontermi- minal children, this reward is 4t if all children are nal children. terminal and 9√n otherwise. A reward structure like this biases the model against freely mixing 2.3 Training terminals and nonterminals within the same con- We formulate our model as a stochastic compu- stituent and provides incentive to build substantial tation graph (Schulman et al., 2015), leading to a tree structures early on in training so the model training paradigm that combines doesn’t get stuck in trivial local minima. (which provides the exact gradient through deter- Within both the encoder and decoder, each ministic nodes) and vanilla policy gradient (which stochastic action node has a corresponding tree provides a Monte Carlo estimator for the gradient reward rk if the action was REDUCE (otherwise through stochastic nodes). zero) and a corresponding language model loss Lk There are several kinds of training signals in our if the action was GEN (otherwise zero). We sub- model. First, when the encoder or decoder chooses tract an exponential moving average baseline from the GEN action it passes the current stack state s each tree reward and additional exponential mov- through a one-layer softmax , giving the ing average baselines—computed independently probability that the next token is each of the char- for each character z in the vocabulary, because we acters in the vocabulary. The language model loss want to reduce the effect of character frequency— k for each generated token is the negative log from the language model losses. If GEN(k) is L probability assigned to the ground-truth next to- the number of GEN transitions among actions one ken. The other differentiable training signal is the through k, and γ is a decay constant, the final re- m coverage loss c, which is a measure of how much ward for action k with m enc, dec is: L k the attention weights diverge from the ideal of a R ∈ { } rˆ = r r one-to-one mapping. This penalty is computed as k k − baseline ˆ = (z ) a sum of three MSE terms: Lk Lk − Lbaseline k Km 2 = mean(1 α ) ˆ GEN(κ) GEN(k) ˆm c ij k = γ − (ˆrκ ) (4) L all i − R − Lκ Xall j κ=k 2 X + mean(1 max αij) (2) Kdec all i − all j enc dec = ˆk c (m = enc) . 2 Rk R − L − Lk + mean(1 max αij) κ=1 all j − all i X These rewards define the gradient that each Backpropagation using the differentiable losses stochastic node (with normalized action probabil- affects only the weights of the output softmax a ities pk and chosen action ak) produces during perceptron. The overall loss function for these backpropagation according to the standard multi- weights is a weighted sum of all terms and : Lk Lc nomial score function estimator (REINFORCE):

a ak k = 100 c + 10 k θp = mean k θ log p = mean −Ra (5) (3) k a =a k a =a k L L L ∇ k R ∇ k pk Xall k There are additionally nondifferentiable rewards 3 Results r that bias the model towards or away from cer- We evaluated our model on the German-English tain kinds of tree structures. Here, negative num- language pair of the flickr30k data, the tex-

14 . tion of translations is included in the supplemen- c i r

b tal material, while two attention plots are shown a f in Figure 2. Figure 2b demonstrates a common r e

v pathology of the model, where a phrasal encoder o

k constituent would be attended to during decod- o o l ing of the head word of the corresponding de- n e

m coder constituent, while the head word of the en- o w coder constituent would be attended to during de- n a i

d coding of the decoder constituent corresponding n I to the whole phrase. Another common pathology o w

T is repeated sentence fragments in the translation, Zwei Inderinnen sehen Stoff durch. which are likely generated because the model can- .

a not condition future attention directly on past at- r e m a

c tention weights (the “input feeding” approach in- e h t troduced by Luong et al.(2015)). r o f Translation quality also suffers because of our e l i m

s use of a stack-only RNNG, which we chose be- n e r

d cause an RNNG with both stack and buffer inputs l i h c is incompatible with a language model loss. Dur- d e t a

e ing encoding, the model must decide at the very s f o beginning of the sentence how deeply to embed p u o r

g the first character. But with a stack-only RNNG, a

n it must make this decision randomly, since it isn’t i y o

b able to use the buffer representation—which con- A Ein Junge in einer Gruppe von sitzenden Kindern lächelt für die Kamera. tains the entire sentence.

Figure 2: Attention visualizations for two sentences from 4 Conclusion the development set. Attention between two constituents is We introduce a new approach to leveraging unsu- represented by a shaded rectangle whose projections on the pervised tree structures in NLP tasks like machine x and y axes cover the encoder and decoder constituents re- translation. Our experiments demonstrate that a spectively. small-scale MT dataset contains sufficient train- ing signal to infer latent linguistic structure, and tual component of the WMT Multimodal Trans- we are excited to learn what models like the one lation shared task (Specia et al., 2016). An at- presented here can discover in full-size translation tentional sequence-to-sequence model with two corpora. One particularly promising avenue of re- search is to leverage the inherently compositional layers and 384 hidden units from the OpenNMT enc project (Klein et al., 2017) was run at the character phrase representations hi produced by the en- level as a baseline, obtaining 32.0 test BLEU with coder for other NLP tasks. greedy inference. Our model with the same hid- There are also many possible directions for im- den size and greedy inference achieves test BLEU proving the model itself and the training process. of 28.5 after removing repeated bigrams. Value function baselines can replace exponential moving averages, pure reinforcement learning can We implemented the model in PyTorch, ben- replace teacher forcing, and beam search can be efiting from its strong support for dynamic and used in place of greedy inference. Solutions to the stochastic computation graphs, and trained with translation pathologies presented in Section 3 are batch size 10 and the Adam optimizer (Kingma likely more complex, although one possible ap- and Ba, 2015) with early stopping after 12 proach would leverage variational inference using epochs. Character embeddings and the encoder’s a teacher model that can see the buffer and helps xenc embedding were initialized to random 384- train a stack-only student model. dimensional vectors. The value of γ and the de- cay constant for the baselines’ exponential moving average were both set to 0.95. A random selec-

15 References Maria Nadejde, Siva Reddy, Rico Sennrich, Tomasz Dwojak, Marcin Junczys-Dowmunt, Philipp Koehn, Samuel Bowman, Jon Gauthier, Abhinav Rastogi, and Alexandra Birch. 2017. Syntax-aware neu- Raghav Gupta, Christopher Manning, and Christo- ral machine translation using ccg. arXiv preprint pher Potts. 2016. A fast unified model for parsing arXiv:1702.01147 . and sentence understanding. In ACL. Jordan B Pollack. 1990. Recursive distributed repre- David Chiang. 2005. A hierarchical phrase-based sentations. Artificial Intelligence 46(1):77–105. model for statistical machine translation. In ACL. John Schulman, Nicolas Heess, Theophane Weber, and Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Pieter Abbeel. 2015. Gradient estimation using 2017. Hierarchical multiscale recurrent neural net- stochastic computation graphs. In NIPS. works. In ICLR. Rico Sennrich and Barry Haddow. 2016. Linguistic in- Chris Dyer, Miguel Ballesteros, Wang Ling, Austin put features improve neural machine translation. In Matthews, and Noah A Smith. 2016a. Transition- WMT. based dependency parsing with stack long short- term memory. In EMNLP. Richard Socher, Christopher Manning, and Andrew Ng. 2010. Learning continuous phrase representa- Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, tions and syntactic parsing with recursive neural net- and Noah Smith. 2016b. Recurrent neural network works. In NIPS Workshop on Deep Learning and grammars. In NAACL. Unsupervised . Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Richard Socher, Alex Perelygin, Jean Wu, Jason Tsuruoka. 2016. Tree-to-sequence attentional neu- Chuang, Christopher Manning, Andrew Ng, and ACL ral machine translation. In . Christopher Potts. 2013. Recursive deep models Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun for semantic compositionality over a sentiment tree- Cho. 2017. Learning to parse and translate im- bank. In EMNLP. proves neural machine translation. arXiv preprint Lucia Specia, Stella Frank, Khalil Simaan, and arXiv:1702.03525 . Desmond Elliott. 2016. A shared task on multi- Christoph Goller and Andreas Kuchler.¨ 1996. Learning modal machine translation and crosslingual image task-dependent distributed representations by back- description. In WMT. propagation through structure. In IEEE Interna- Dekai Wu. 1997. Stochastic inversion transduction tional Conference on Neural Networks. IEEE, vol- grammars and bilingual parsing of parallel corpora. ume 1, pages 347–352. Computational linguistics 23(3):377–403. Kazuma Hashimoto and Yoshimasa Tsuruoka. 2017. Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Neural machine translation with source-side latent Grefenstette, and Wang Ling. 2017. Learning to graph parsing. arXiv preprint arXiv:1702.02265 . compose words into sentences with reinforcement Liang Huang, Kevin Knight, and Aravind Joshi. 2006. learning. In ICLR. A syntax-directed translator with extended domain of locality. In CHPJI-NLP. Association for Compu- tational Linguistics. Yoon Kim, Carl Denton, Luong Hoang, and Alexan- der Rush. 2017. Structured attention networks. In ICLR. Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR. G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. ArXiv preprint arXiv:1701.02810 . Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng Kong, Chris Dyer, Graham Neubig, and Noah Smith. 2017. What do recurrent neural network grammars learn about syntax? In EACL. Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention- based neural machine translation. In EMNLP.

16 Structured Prediction via Learning to Search under Bandit Feedback

Amr Sharaf and Hal Daume´ III University of Maryland, College Park amr,hal @cs.umd.edu { }

Abstract Pre-Trained POS Tagging Model We present an algorithm for structured Empirical Empirical/NN prediction under online bandit feedback. Methods in Methods/NN in/IN The learner repeatedly predicts a sequence Natural Natural/NNP of actions, generating a structured output. Language π Language/NNP Processing Processing/NNP It then observes feedback for that output and no others. We consider two cases: a pure bandit setting in which it only ob- Input Update Policy Output serves a loss, and more fine-grained feed- hamming_loss(input, output) = 1 back in which it observes a loss for every action. We find that the fine-grained feed- Total Loss for predicted structure back is necessary for strong empirical per- formance, because it allows for a robust Figure 1: BLS for learning POS tagging. We learn π variance-reduction strategy. We empiri- a policy , whose output a user sees. The user cally compare a number of different algo- views predicted tags and provides a loss (and pos- rithms and exploration methods and show sibly additional feedback, such as which words are π the efficacy of BLS on sequence labeling labeled incorrectly). This is used to update . and dependency parsing tasks. The goal of the system is to minimize it’s cumula- 1 Introduction tive loss over time, using the feedback to improve itself. This introduces a fundamental exploration- In structured prediction the goal is to jointly pre- versus-exploitation trade-off, in which the system dict the values of a collection of variables that in- must try new things in hopes that it will learn teract. In the usual “supervised” setting, at train- something useful, but also in which it is penalized ing time, you have access to ground truth outputs (by high cumulative loss) for exploring too much.1 (e.g., dependency trees) on which to build a pre- One natural question we explore in this paper dictor. We consider the substantially harder case is: in addition to the loss, what forms of feed- of online bandit structured prediction, in which back are both easy for a user to provide and use- the system never sees supervised training data, but ful for a system to utilize? At one extreme, one instead must make predictions and then only re- can solicit no additional feedback, which makes ceives feedback about the quality of that single the learning problem very difficult. We describe prediction. The model we simulate (Figure 1) is: Bandit Learning to Search,BLS2, an approach 1. the world reveals an input (e.g., a sentence) for improving joint predictors from different types 2. the algorithm produces a single structured of bandit feedback. The algorithm predicts out- prediction (e.g., full parse tree); puts and observes the loss of the predicted struc- 3. the world provides a loss (e.g., overall quality 1 rating) and possibly a small amount of addi- Unlike active learning—in which the system chooses which examples it wants feedback on—in our setting the sys- tional feedback; tem is beholden to the human’s choice in inputs. 4. the algorithm updates itself 2Our implementation will be made freely available.

17

Proceedings of the 2nd Workshop on Structured Prediction for Natural Language Processing, pages 17–26 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics ture; but then it uses a regression strategy to es- VB NNP NNP loss = 1 timate counterfactual costs of (some) other struc- Empirical/ Methods/ in/ Natural/ Language/ Processing/ loss = 0 tures that it did not predict. This variance reduc- JJ NN IN NNP NNP NNP tion technique ( 2.2) is akin to doubly-robust esti- S R E § NN VB NNP loss = 2 mation in contextual bandits. The trade-off is that { Roll-in one-step in order to accurately train these regressors, BLS deviation { requires per-action feedback from the user (e.g., Roll-out which words were labeled incorrectly). It appears that this feedback is necessary; with out it, accu- Figure 2: A search space for part of speech tag- racy degrades over time. Additionally, we con- ging, explored by a policy that chooses to “ex- sider several forms of exploration beyond a sim- plore” at state R. ple -greedy strategy, including Boltzmann explo- ration and Thompson sampling ( 2.4). We demon- § ing the learner to take the action with lowest cost, strate the efficacy of these developments on POS updating the learned policy from πˆi to πˆi+1. tagging, syntactic chunking and dependency pars- In the bandit setting, this is not possible: we can ing ( 3), in which we show improvements over § only evaluate one loss; nevertheless, we can follow both LOLS (Chang et al., 2015) and Policy Gradi- a similar algorithmic structure. Our specific algo- ent (Sutton et al., 1999). rithm, BLS, is summarized in algorithm 1. We start with a pre-trained reference policy πref and 2 Learning with Bandit Feedback seek to improve it with bandit feedback. On each example, an exploration algorithm ( 2.4) chooses We operate in the learning to search framework, § a family of algorithms for solving structured pre- whether to explore or exploit. If it chooses to ex- diction problems, which essentially train a policy ploit, a random learned policy is used to make a to make sequence of decisions that are stitched to- prediction and no updates are made. If, instead, it gether into a final structured output. Such algo- chooses to explore, it executes a roll-in as usual, rithms decompose a joint prediction task into a se- a single deviation at time t according to the ex- quence of action predictions, such as predicting ploration policy, and then a roll-out. Upon com- the label of the next word in sequence labeling pletion it suffers a bandit loss for the entire com- or predicting a shift/reduce action in dependency plete trajectory. It then uses a cost estimator ρ to parsing3; these predictions are tied by features guess the costs of the un-taken actions. From this and/or internal state. Algorithms in this family it forms a complete cost vector, and updates the have recently met with success in neural networks underlying policy based on this cost vector. Fi- (Bengio et al., 2015; Wiseman and Rush, 2016), nally, it updates the cost estimator ρ. though date back to models typically based on lin- ear policies (Collins and Roark, 2004; Daume´ III 2.1 Cost Estimation by Importance Sampling and Marcu, 2005; Xu et al., 2007; Daume´ III et al., The simplest possible method of cost estimation 2009; Ross et al., 2010; Ross and Bagnell, 2014; is importance sampling (Horvitz and Thompson, Doppa et al., 2014; Chang et al., 2015). 1952; Chang et al., 2015). If the third action is Most learning to search algorithms operate by the one explored with probability p3 and a cost cˆ3 considering a search space like that shown in Fig- is observed, then the cost vector for all actions is ure 2. The learning algorithm first rolls-in for a set to 0, 0, cˆ /p , 0,..., 0 . This yields an unbi- h 3 3 i few steps using a roll-in policy πin to some state ased estimate of the true cost vector because in ex- R, then considers all possible actions available at pectation over all possible actions, the cost vector state R, and then rolls out using a roll-out policy equals cˆ , cˆ ,..., cˆ . h 1 2 K i πout until the end of the sequence. In the fully Unfortunately, this type of cost estimate suffers supervised case, the learning algorithm can then from huge variance (see experiments in 3). If ac- § compute a loss for all possible outputs, and use tions are explored uniformly at random, then all this loss to drive learning at state R, by encourag- cost vectors look like 0, 0,Kcˆ , 0,... 0 , which h 3 i varies quite far from its expectation when K is 3Although the decomposition is into a sequence of predic- tions, such approaches are not limited to “left-to-right” style large. To better understand the variance reduction prediction tasks (Ross et al., 2010; Stoyanov et al., 2011). issue, consider the part of speech tagging exam-

18 N Input: examples xi , reference policy Input: current state: st; roll-in trajectory: τ; { }i=1 πref, exploration algorithm explorer, K regression functions (one for every and rollout-parameter β 0 action): ρ; set of allowed actions: ≥ out π initial policy A(st); roll-out policy: π ; explored 0 ← action: at; bandit loss: c I ← ∅ ρ initial cost estimator t τ ← ← | | each x in training examples Initialize cˆ: a vector of size A(st) for i do | | explorer cˆ0 0 if chooses not to explore then ← π Unif( ) // pick policy for (a, s) τ do ← I ∈ y predict using π cˆ0 cˆ0 + ρa(Φ(s)) i ← ← c bandit loss of y end ← i for a A(st) do else ∈ t Unif(0,T 1) // deviation time if a = at then ← − τ roll-in with πˆ for t rounds cˆ(a) c ← i ← s final state in τ else t ← at = explorer(st) // deviation action cˆ(a) cˆ0 ← out πout πref with prob β, else πˆ y roll-out with π from τ + a ← i ← y roll-out with πout from τ + a for (a0, s0) in y do i ← t c bandit loss of y cˆ(a) cˆ(a) + ρa0 (Φ(s0)) ← i ← cˆ est cost(s , τ, ρ, A(s ), a , c) end ← t t t πˆ update(ˆπ , (Φ(x , s ), cˆ)) end i+1 ← i i t πˆ return cˆ: a vector of size A(st) , where cˆ(a) I ← I ∪ { i+1} | | update cost estimator ρ is the estimated cost for action a at state st. Algorithm 2: est cost end end Algorithm 1: Bandit Learning to Search (BLS) Figure 2 were tagged incorrectly. This would add a loss of 2 to any of the estimated costs, but could be very difficult to fit because this loss was based ple from Figure 2. As in the figure, suppose that on past actions, not the current action. the deviation occurs at time step 3 and that during roll-in, the first two words are tagged correctly by 2.2 Doubly Robust Cost Estimation the roll-in policy. At t = 3, there are 45 possible To address the challenge of high variance cost es- actions (each possible part of speech) to take from timates, we adopt a strategy similar to the doubly- the deviation state, of which three are shown; each robust estimation used in the (non-structured) con- action (under uniform exploration) will be taken textual bandit setting (Dudik et al., 2011). In par- with probability 1/45. If the first is taken, a loss of ticular, we train a separate set of regressors to esti- one will be observed, if the second, a loss of zero, mate the total costs of any action, which we use to and if the third, a loss of two. When a fair coin is impute a counterfactual cost for untaken actions. flipped, perhaps the third choice is selected, which Algorithm2 spells this out, taking advantage of will induce a cost vector of ~c = 0, 0, 90, 0,... . h i an action-to-cost regressor, ρ. To estimate the cost In expectation over this random choice, we have of an un-taken action a0 at a deviation, we simulate [c] is correct, implying unbiasedness, but the Ea the execution of a0, followed by the execution of variance is very large: ((Kc )2). O max the current policy through the end. The cost of This problem is exacerbated by the fact that that entire trajectory is estimated by summing ρ many learning to search algorithms define the cost over all states along the path. For example, in the of an action a to be the difference between the part of speech tagging example above, we learn cost of a and the minimum cost. This is desirable 45 regressors: one for each part of speech. The because when the policy is predicting greedily, it cost of a roll-out is estimated as the sum of these should choose the action that adds the least cost; regressors over the entire predicted sequence. it should not need to account for already-incurred Using this doubly-robust estimate strategy ad- cost. For example, suppose the first two words in dresses both of the problems mentioned in 2.1. § 19 First, this is able to provide a cost estimate for all actions. Second, because ρ is deterministic, it will 1 (V(x,a) ν [r∗(x, a) + (1 ρ1(x, a))∆(x, a)]) give the same cost to the common prefix of all tra- n ∼ − jectories, thus resolving credit assignment.  1  +E(x,a) ν ρ1(x, a)Vr D( x,a)[r] ∼ µˆ1(a x) ∼ ·| The remaining challenge is: how to estimate |  1 µ1(a x) 2 +E(x,a) ν − | ρ1(x, a)∆(x, a) ] these regressors. Unfortunately, this currently ∼ µˆ1(a x) | comes at an additional cost to the user: we must observe per-action feedback. In particular, when The theorem show that the variance can be de- the user views a predicted output (e.g., part of composed into three terms. The first term ac- speech sequence), we ask for a binary signal for counts for the randomness in the context features. each action whether the predicted action was right The second term accounts for randomness in re- or wrong. Thus, for a sentence of length T , we wards and disappears when rewards are determin- generate T training examples for every time step istic functions of the context and actions. The last 1 t T . Each training example has the form: term accounts for the disagreement between ac- ≤ ≤ tions taken by the exploration policy µ and the (at, ct), where at is the predicted action at time t, target policy ν. This decomposition shows that and ct is a binary cost, either 1 if the predicted ac- tion was correct, or zero otherwise. This amounts doubly robust estimation yields accurate value es- to a user “crossing out” errors, which hopefully timates when there is either a good model of re- incurs low overhead. Using these T training ex- wards or a good model of the exploration policy. amples, we can effectively train the K regressors We can build on this result for BLS to show an identical for estimating the cost of unobserved actions. result under the following reduction. The exploration policy µ in our setting is defined as fol- lows: for every exploration round, a randomly se- lected time-step is assigned a randomly chosen ac- 2.3 Theoretical Analysis tion, and a deterministic reference policy is used to generate the roll-in and roll-out trajectories. Our In order to analyze the variance of the BLS es- goal is to evaluate and optimize a better target pol- timator (in particular to demonstrate that it has icy ν. Under this setting, and assuming that the lower variance than importance sampling), we structures are generated i.i.d from a fixed but un- provide a reduction to contextual bandits in an i.i.d known distribution, the structured prediction prob- setting. Dud´ık et al.(2014) studied the contextual lem will be equivalent to a contextual bandit prob- bandit setting, which is similar to out setting but lem were we consider the roll-in trajectory as part without the complexity of sequences of actions. of the context. (In particular, if T = 1 then our setting is ex- actly the contextual bandit setting.) They studied 2.4 Options for Exploration Strategies the task of off-policy evaluation and optimization In addition to the -greedy exploration algorithm, for a target policy ν using doubly robust estima- we consider the following exploration strategies: tion given historic data from an exploration pol- icy µ consisting of contexts, actions, and received Boltzmann (Softmax) Exploration. Boltz- rewards. They prove that this approach yields ac- mann exploration varies the action probabilities curate value estimates when there is either a good as a graded function of estimated value. The (but not necessarily consistent) model of rewards greedy action is still given the highest selection or a good (but not necessarily consistent) model of probability, but all the others are ranked and past policy. In particular, they show: weighted according to their cost estimates; ac- tion a is chosen with probability proportional 1 Theorem. Let ∆(x, a) and ρk(x, a) denote, re- to exp temp c(a) , where “temp” is a positive spectively, the additive error of the reward esti- parameterh calledi the temperature, and c(a) is the mator rˆ and the multiplicative error of the action current predicted cost of taking action a. High probability estimator µˆk. If exploration policy µ temperatures cause the actions to be all (nearly) and the estimator µˆk are stationary, and the target equiprobable. Low temperatures cause a greater policy ν is deterministic, then the variance of the difference in selection probability for actions that ˆ doubly robust estimator Vµ[VDR] is: differ in their value estimates. 20 Thompson Sampling estimates the following main, but does poorly out of domain. We sim- elements: a set Θ of parameters µ; a prior distribu- ulate bandit feedback over the entire Penn Tree- tion P (µ) on these parameters; past observations bank Wall Street Journal (sections 02–21 and 23), D consisting of observed contexts and rewards; comprising 42k sentences and about one million a likelihood function P (r b, µ), which gives the words. (Adapting from tweets to WSJ is nonstan- | probability of reward given a context b and a pa- dard; we do it here because we need a large dataset rameter µ; In each round, Thompson Sampling on which to simulate bandit feedback.) The mea- selects an action according to its posterior prob- sure of performance is average per-word accuracy ability of having the best parameter µ. This is (one minus Hamming loss). achieved by taking a sample of parameter for each Noun Phrase Chunking is a sequence segmen- action, using the posterior distributions, and se- tation task, in which sentences are divided into lecting that action that produces the best sam- base noun phrases.We solve this problem using ple (Agrawal and Goyal, 2013; Komiyama et al., a sequence span identification predictor based on 2015). We use Gaussian likelihood function and Begin-In-Out encoding, following (Ratinov and Gaussian prior for the Thompson Sampling algo- Roth, 2009), though applied to chunking rather rithm. In addition, we make a linear payoff as- than named-entity recognition. We used the sumption similar to (Agrawal and Goyal, 2013), CoNLL-2000 datasetfor training and testing. We where we assume that there is an unknown under- used the smaller test split (2, 012 sentences) for d lying parameter µa R such that the expected training a reference policy, and used the training ∈ cost for each action a, given the state st and con- split (8, 500 sentences) for online evaluation. Per- T text xi is Φ(xi, st) µa. formance was measured by F-score over predicted noun phrases (for which one has to predict the en- 3 Experimental Results tire noun phrase correctly to get any points). The evaluation framework we consider is the fully Dependency Parsing is a syntactic analysis online setup described in the introduction, mea- task, in which each word in a sentence gets as- suring the degree to which various algorithms can signed a grammatical head (or “parent”). The ex- effectively improve upon a reference policy by ob- perimental setup is similar to part-of-speech tag- serving only a partial feedback signal, and effec- ging. We train an arc-eager dependency parser tively balancing exploration and exploitation. We (Nivre, 2003), which chooses among (at most) learn from one structured example at every time four actions at each state: Shift, Reduce, Left or step, and we do a single pass over the available Right. As in part of speech tagging, the reference examples. We measure loss as the average cumu- policy is trained on the TweetNLP dataset (us- lative loss over time, thus algorithms are appropri- ing an oracle due to (Goldberg and Nivre, 2013)), ately “penalized” for unnecessary exploration. and evaluated on the Penn Treebank corpus (again, sections 02 21 and section 23). The loss is un- − 3.1 Tasks, Policy Classes and Data Sets labeled attachment score (UAS), which measures the fraction of words that pick the correct parent. We experiment with the following three tasks. For each, we briefly define the problem, describe the 3.2 Main Results policy class that we use for solving that problem in a learning to search framework (we adopt a sim- Here, we describe experimental results (Table 1) ilar setting to that of (Chang et al., 2016), who de- comparing several algorithms: (line B) The ban- scribes the policies in more detail), and describe dit variant of the LOLS algorithm, which uses the data sets that we use. The regression problems importance sampling and -greedy exploration; are solved using squared error regression, and the (lines C-F) BLS, with bandit feedback and per- classification problems (policy learning) is solved word error correction, with variance reduction and via cost-sensitive one-against-all. four exploration strategies: -greedy, Boltzmann, Part-Of-Speech Tagging over the 45 Penn Thompson, and “oracle” exploration in which case Treebank (Marcus et al., 1993) tags. To simu- the oracle action is always chosen during explo- late a domain adaptation setting, we train a ref- ration; (line G) The Policy Gradient reinforcement erence policy on the TweetNLP dataset (Owoputi learning algorithm, with -greedy exploration on et al., 2013), which achieves good accuracy in do- one-step deviations; and (line H) A fully super-

21 POS DepPar Chunk Variance for Doubly Robust and Importance Sampling 80 Algorithm Variant Acc UAS F-Scr BLS 70 LOLS A. Reference - 47.24 44.15 74.73 B. LOLS -greedy 2.29 18.55 31.76 60 50 C. BLS -greedy 86.55 56.04 90.03 D. Boltz. 89.62 57.20 90.91 40 Variance E. Thomp. 89.37 56.60 90.06 30 F. Oracle 89.23 56.60 90.58 20

G. Policy -greedy 75.10 - 90.07 10 ∇ H. 0 DAgger Full Sup 96.51 90.64 95.29 0 10000 20000 30000 40000 Number of Sequences Table 1: Total progressive accuracies for the dif- Figure 3: Analyzing the variance of the the cost ferent algorithms on the three natural language estimates from LOLS and BLS over a run of the processing tasks. LOLS uniformly decreases per- algorithm for POS; the x-axis is number of sen- formance over the Reference baseline. BLS, tences processed, y-axis is empirical variance. which integrates cost regressors, uniformly im- proves, often quite dramatically. The overall ef- fect of the exploration mechanism is small, but mic settings. To understand the effect of vari- in all cases Boltzmann exploration is statistically ance, it is enough to compare the performance significantly better than the other options at the of the Reference policy (the policy learned from p < 0.05 level (because the sample size is so the out of domain data) with that of LOLS. In large). Policy Gradient for dependency parsing is 1 all of these cases, running LOLS substantially de- missing because after processing 4 of the data, it creases performance. Accuracy drops by 45% for was substantially subpar. POS tagging, 26% for dependency parsing and 43% for noun phrase chunking. For POS tagging, vised “upper bound” trained with DAgger. the LOLS accuracy falls below the accuracy one From these results, we draw the following con- would get for random guessing (which is approxi- clusions (the rest of this section elaborates on mately 14% on this dataset for NN)! these conclusions in more detail): When the underlying algorithm changes from 1. The original LOLS algorithm is ineffective at LOLS to BLS, the overall accuracies go up signif- improving the accuracy of a poor reference icantly. Part of speech tagging accuracy increases policy (A vs B); from 47% to 86%; dependency parsing accuracy 2. Collecting additional per-word feedback in from 44% to 57%; and chunking F-score from BLS allows the algorithm to drastically im- 74% to 90%. These numbers naturally fall be- prove on the reference (A vs C) and on LOLS low state of the art for fully (B vs C); we show in 3.3 that this happens on these data sets, precisely because these results § are based only on bandit feedback (see 3.6). because of variance reduction; § 3. Additional leverage can be gained by varying the exploration strategy, and in general Boltz- 3.4 Effect of Exploration Strategy mann exploration is effective (C,D,E), but the Figure 4 shows the effect of the choice of  for Oracle exploration strategy is not optimal (F -greedy exploration in BLS. Overall, best results vs D); see 3.4; are achieved with remarkably high epsilon, which § 4. For large action spaces like POS tagging, is possibly counter-intuitive. The reason this hap- the BLS-type updates outperform Policy pens is because BLS only explores on one out of T Gradient-type updates, when the exploration time steps, of which there are approximately 30 in strategy is held constant (G vs D), see 3.5. each of these experiments (the sentence lengths). § 5. Bandit feedback is less effective than full This means that even with  = 1, we only take feedback (H vs D) ( 3.6). a random action roughly 3.3% of the time. It is § therefore not surprising that large  is the most ef- 3.3 Effect of Variance Reduction fective strategy. Overall, although the differences Table 1 shows the progressive validation accura- are small, the best choice of  across these differ- cies for all three tasks for a variety of algorith- ent tasks is 0.6. ≈ 22 Figure 4: Analyzing the effect of  in exploration/exploitation trade-off. Overall, large values of  are strongly preferred.

Returning to Table 1, we can consider the effect linear function approximation models the proba- of different exploration mechanisms: -greedy, bility of selecting action at at state st under πθ: Boltzmann (or softmax) exploration, and Thomp- π (a s ) exp(θT φ(s )), where K is the total θ t| t ∝ at t son sampling. Overall, Boltzmann exploration number of actions. PG maximizes the total ex- was the most effective strategy, gaining about 3% pected return under the distribution of trajectories accuracy in POS tagging, just over 1% in depen- sampled from the policy πθ. dency parsing, and just shy of 1% in noun phrase To balance the exploration / exploitation trade- chunking. Although the latter two effects are off, we use exactly the same epsilon greedy tech- small, they are statistically significant, which is nique used in BLS (Algorithm1). For each tra- measurable due to the fact that the evaluation sets jectory τ sampled from πθ, a state is selected uni- are very large. In general, Thompson sampling is formly at random, and an action is selected greed- also effective, though worse than Boltzmann. ily with probability . The policy πθ is used to con- Finally, we consider a variant in which when- struct the roll-in and roll-out trajectories. For ev- ever BLS requests exploration, the algorithm ery trajectory τ, we collect the same binary grades “cheats” and chooses the gold standard decision from the user as in BLS, and use them to train a re- at that point. This is the “oracle exploration” line gression function to estimate the per-step reward. in Table 1. We see that this does not improve These estimates are then be summed up to com- overall quality, which suggests that a good explo- pute the total return Gt from time step t onwards ration strategy is not one that always does the right (Algorithm2). thing, but one that also explored bad—but useful- We use standard policy gradient update for opti- to-learn-from—options. mizing the policy θ based on the observed rewards:

3.5 Policy Gradient Updates θ θ + α log(π (s , a ))G (1) ← ∇θ θ t t t A natural question is: how does bandit structured prediction compare to more standard approaches The results of this experiment are shown in line to reinforcement learning (we revisit the question G of Table 1. Here, we see that on POS tagging, of how these problems differ in 4). We chose where the number of actions is very large, PG sig- § Policy Gradient (Sutton et al., 1999) as a point of nificantly underperforms BLS. Our initial experi- comparison. The main question we seek to ad- ments in dependency parsing showed the PG sig- 1 dress is how the BLS update rule compares to the nificantly underperformed BLS after processing 4 Policy Gradient update rule. In order to perform of the data. The difference is substantially smaller this comparison, we hold the exploration strategy in chunking, where PG is on part with BLS with fixed, and implement the Policy Gradient update -greedy exploration. Figure 4 shows the effect rule inside our system. of  on PG, where we see that it also prefers large values of , but its performance saturates as  1. More formally, the policy gradient optimiza- → tion is similar to that used in BLS. PG maintains 3.6 Bandit Feedback vs Full Feedback a policy πθ, which is parameterized by a set of parameters θ. Features are extracted from each Finally, we consider the trade-off between ban- state st to construct the feature vectors φ(st), and dit feedback in BLS and full feedback. To make

23 this comparison, we run the fully supervised al- Learning from partial feedback has generated a gorithm DAgger (Ross et al., 2010) which is ef- vast amount of work in the literature, dating back fectively the same algorithm as LOLS and BLS to the seminal introduction of multi-armed bandits under full supervision. In Table 1, we can see by (Robbins, 1985). However, the vast number of that full supervision dramatically improves perfor- papers on this topic does not consider joint pre- mance from around 90% to 97% in POS tagging, diction tasks; see (Auer et al., 2002; Auer, 2003; 57% to 91% in dependency parsing, and 91% to Langford and Zhang, 2008; Srinivas et al., 2009; 95% in chunking. Of course, achieving this im- Li et al., 2010; Beygelzimer et al., 2010; Dudik proved performance comes at a high labeling cost: et al., 2011; Chapelle and Li, 2011; Valko et al., a human has to provide exact labels for each deci- 2013) and references inter alia. There, the system sion, not just binary “yes/no” labels. observes (bandit) feedback for every decision. Other forms of contextual bandits on structured 4 Discussion & Conclusion problems have been considered recently. Kalai and Vempala(2005) studied the structured prob- The most similar algorithm to ours is the bandit lem of online shortest paths, where one has a di- version of LOLS (Chang et al., 2015) (which is an- rected graph and a fixed pair of nodes (s, t). Each alyzed theoretically but not empirically); the key period, one has to pick a path from s to t, and then differences between BLS and LOLS are: (a) BLS the times on all the edges are revealed. The goal employs a doubly-robust estimator for “guessing” of the learner is to improve it’s path predictions the costs of counterfactual actions; (b) BLS em- over time. Relatedly, Krishnamurthy et al.(2015) ploys alternative exploration strategies; (c) BLS is studied a variant of the contextual bandit problem, effective in practice at improving the performance where on each round, the learner plays a sequence of an initial policy. of actions, receives a score for each individual ac- In the NLP community, Sokolov et al.(2016a) tion, and obtains a final reward that is a linear com- and Sokolov et al.(2016b) have proposed a pol- bination to those scores. icy gradient-like method for optimizing log-linear In this paper, we presented a computationally models like conditional random fields (Lafferty efficient algorithm for structured contextual ban- et al., 2001) under bandit feedback. Their eval- dits, BLS, by combining: locally optimal learning uation is most impressive on the problem of do- to search (to control the structure of exploration) main adaptation of a machine translation system, and doubly robust cost estimation (to control the in which they show that their approach is able to variance of the cost estimation). This provides the learn solely from bandit-style feedback, though re- first practically applicable learning to search algo- quiring a large number of samples. rithm for learning from bandit feedback. Unfortu- In the learning-to-search setting, the difference nately, this comes at a cost to the user: they must between structured prediction under bandit feed- make more fine-grained judgments of correctness back and reinforcement learning gets blurry. A than in a full bandit setting. In particular, they distinction in the problem definition is that the must mark each decision as correct or incorrect:it world is typically assumed to be fixed and stochas- is an open question whether this feedback can be tic in RL, while the world is both deterministic removed without incurring a substantially larger and known (conditioned on the input, which is ran- sample complexity. A second large open question dom) in bandit structured prediction: given a state is whether the time step at which to deviate can and action, the algorithm always knows what the be chosen more intelligently, similar to selective next state will be. A difference in solution is that sampling (Shi et al., 2015), using active learning. there has been relatively little work in reinforce- ment learning that explicitly begins with a refer- Acknowledgements ence policy to improve and often assumes an ab initio training regime. In practice, in large state We thank the anonymous reviewers for many help- spaces, this makes the problem almost impossi- ful comments. This work was supported by NSF ble, and practical settings like AlphaGo (Silver grant IIS-1618193 and LTS grant DO-0032. Opin- et al., 2016) require imitation learning to initialize ions, findings, conclusions, or recommendations a good policy, after which reinforcement learning expressed here are those of the authors and do not is used to improve that policy. necessarily reflect the view of the sponsor(s).

24 References Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Shipra Agrawal and Navin Goyal. 2013. Thompson Tong Zhang. 2011. Efficient optimal learning for sampling for contextual bandits with linear payoffs. contextual bandits. arXiv preprint arXiv:1106.2369 In ICML (3). pages 127–135. .

Peter Auer. 2003. Using confidence bounds for Yoav Goldberg and Joakim Nivre. 2013. Training de- exploitation-exploration trade-offs. The Journal of terministic parsers with non-deterministic oracles. Machine Learning Research 3:397–422. Transactions of the ACL 1.

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Daniel G Horvitz and Donovan J Thompson. 1952. Robert E Schapire. 2002. The nonstochastic multi- A generalization of sampling without replacement armed bandit problem. SIAM Journal on Computing from a finite universe. Journal of the American sta- 32(1):48–77. tistical Association 47(260):663–685.

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Adam Kalai and Santosh Vempala. 2005. Efficient al- Noam Shazeer. 2015. Scheduled sampling for se- gorithms for online decision problems. Journal of quence prediction with recurrent neural networks. Computer and System Sciences 71(3):291–307. In Advances in Neural Information Processing Sys- tems. pages 1171–1179. Junpei Komiyama, Junya Honda, and Hiroshi Nak- agawa. 2015. Optimal regret analysis of thomp- Alina Beygelzimer, John Langford, Lihong Li, Lev son sampling in stochastic multi-armed bandit Reyzin, and Robert E Schapire. 2010. Contextual problem with multiple plays. arXiv preprint bandit algorithms with supervised learning guaran- arXiv:1506.00779 . tees. arXiv preprint arXiv:1002.4058 . Akshay Krishnamurthy, Alekh Agarwal, and Miroslav Kai-Wei Chang, He He, Hal Daume´ III, John Langford, Dudik. 2015. Efficient contextual semi-bandit learn- and Stephane´ Ross. 2016. A credit assignment com- ing. arXiv preprint arXiv:1502.05890 . piler for joint prediction. In NIPS. John Lafferty, Andrew McCallum, and Fernando CN Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agar- Pereira. 2001. Conditional random fields: Prob- wal, Hal Daume´ III, and John Langford. 2015. abilistic models for segmenting and labeling se- Learning to search better than your teacher. In Pro- quence data . ceedings of The 32nd International Conference on Machine Learning. pages 2058–2066. John Langford and Tong Zhang. 2008. The epoch- greedy algorithm for multi-armed bandits with side Olivier Chapelle and Lihong Li. 2011. An empirical information. In Advances in neural information pro- evaluation of thompson sampling. In Advances in cessing systems. pages 817–824. neural information processing systems. pages 2249– 2257. Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to Michael Collins and Brian Roark. 2004. Incremental personalized news article recommendation. In Pro- parsing with the perceptron algorithm. In Proceed- ceedings of the 19th international conference on ings of the 42nd Annual Meeting on Association for World wide web. ACM, pages 661–670. Computational Linguistics. Association for Compu- tational Linguistics, page 111. Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated Hal Daume´ III, John Langford, and Daniel Marcu. corpus of english: The penn treebank. Computa- 2009. Search-based structured prediction. Machine tional linguistics 19(2):313–330. learning 75(3):297–325. Joakim Nivre. 2003. An efficient algorithm for projec- Hal Daume´ III and Daniel Marcu. 2005. Learning tive dependency parsing. In IWPT. pages 149–160. as search optimization: Approximate large margin methods for structured prediction. In Proceedings Olutobi Owoputi, Brendan O’Connor, Chris Dyer, of the 22nd international conference on Machine Kevin Gimpel, Nathan Schneider, and Noah A learning. ACM, pages 169–176. Smith. 2013. Improved part-of-speech tagging for online conversational text with word clusters. Asso- Janardhan Rao Doppa, Alan Fern, and Prasad Tade- ciation for Computational Linguistics. palli. 2014. Hc-search: A learning framework for search-based structured prediction. J. Artif. Intell. Lev Ratinov and Dan Roth. 2009. Design challenges Res.(JAIR) 50:369–407. and misconceptions in named entity recognition. In CoNLL. Miroslav Dud´ık, Dumitru Erhan, John Langford, Li- hong Li, et al. 2014. Doubly robust policy Herbert Robbins. 1985. Some aspects of the sequential evaluation and optimization. Statistical Science design of experiments. In Herbert Robbins Selected 29(4):485–511. Papers, Springer, pages 169–177.

25 Stephane Ross and J Andrew Bagnell. 2014. Rein- forcement and imitation learning via interactive no- regret learning. arXiv preprint arXiv:1406.5979 . Stephane´ Ross, Geoffrey J Gordon, and J Andrew Bag- nell. 2010. A reduction of imitation learning and structured prediction to no-regret online learning. arXiv preprint arXiv:1011.0686 . Tianlin Shi, Jacob Steinhardt, and Percy Liang. 2015. Learning where to sample in structured prediction. In AISTATS. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Ju- lian Schrittwieser, Ioannis Antonoglou, Veda Pan- neershelvam, Marc Lanctot, et al. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489. Artem Sokolov, Julia Kreutzer, Christopher Lo, and Stefan Riezler. 2016a. Learning structured predic- tors from bandit feedback for interactive NLP. In Proceedings of the 54th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers). Association for Computational Lin- guistics (ACL). https://doi.org/10.18653/v1/p16- 1152. Artem Sokolov, Stefan Riezler, and Tanguy Urvoy. 2016b. Bandit structured prediction for learning from partial feedback in statistical machine transla- tion. arXiv preprint arXiv:1601.04468 . Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. 2009. Gaussian process opti- mization in the bandit setting: No regret and experi- mental design. arXiv preprint arXiv:0912.3995 . Veselin Stoyanov, Alexander Ropson, and Jason Eis- ner. 2011. Empirical risk minimization of graphi- cal model parameters given approximate inference, decoding, and model structure. In AISTATS. pages 725–733.

Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. 1999. Policy gradient methods for reinforcement learning with function approximation. In NIPS. volume 99, pages 1057– 1063.

Michal Valko, Nathaniel Korda, Remi´ Munos, Ilias Flaounas, and Nelo Cristianini. 2013. Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869 .

Sam Wiseman and Alexander M Rush. 2016. Sequence-to-sequence learning as beam-search op- timization. arXiv preprint arXiv:1606.02960 .

Yuehua Xu, Alan Fern, and Sung Wook Yoon. 2007. Discriminative learning of beam-search heuristics for planning. In IJCAI. pages 2041–2046.

26 Syntax Aware LSTM model for Semantic Role Labeling

2 1 1 2 2 Feng Qian , Lei Sha , Baobao Chang , Lu-chen Liu , Ming Zhang ∗ 1 Key Laboratory of Computational Linguistics, Ministry of Education School of Electronics Engineering and Computer Science, Peking University 2 Institute of Network Computing and Information Systems School of Electronics Engineering and Computer Science, Peking University nickqian, shalei, chbb, liuluchen, mzhang cs @pku.edu.cn { }

ᦄ੊ ྋࣁ ᧣ັ Ԫඳ ܻࢩ WORD Abstract P olice now investigate accident cause

ROLE >@A0 >@AM–TMP REL >@A1

In Semantic Role Labeling (SRL) task, the IBOES S–A0 S–AM–TMP REL B–A1 E–A1

advmod compound – nn tree structured dependency relation is rich DEPENDENCY P ARSING nsubj dobj in syntax information, but it is not well ROOT handled by existing models. In this pa- Figure 1: A sentence from Chinese Proposition per, we propose Syntax Aware Long Short Bank 1.0 (CPB 1.0)(Xue and Palmer, 2003) with Time Memory (SA-LSTM). The structure semantic role labels and dependency. of SA-LSTM changes according to de- pendency structure of each sentence, so that SA-LSTM can model the whole tree Traditional methods (Sun and Jurafsky, 2004; structure of dependency relation in an ar- Xue, 2008; Ding and Chang, 2008, 2009; Sun, chitecture engineering way. Experiments 2010) do classification according to manually de- demonstrate that on Chinese Proposition signed features. requires ex- pertise and is labor intensive. Recent works based Bank (CPB) 1.0, SA-LSTM improves F1 by 2.06% than ordinary bi-LSTM with on Recurrent Neural Network (RNN) (Zhou and feature engineered dependency relation in- Xu, 2015; Wang et al., 2015; He et al., 2017) ex- tract features automatically, and significantly out- formation, and gives state-of-the-art F1 of 79.92%. On English CoNLL 2005 dataset, perform traditional methods. However, because SA-LSTM brings improvement (2.1%) to RNN methods treat language as sequential data, bi-LSTM model and also brings slight im- they fail to integrate the tree structured depen- provement (0.3%) when added to the state- dency into RNN. of-the-art model. We propose Syntax Aware Long Short Time Memory (SA-LSTM) to directly model complex 1 Introduction tree structure of dependency relation in an ar- chitecture engineering way. Architecture of SA- The task of Semantic Role Labeling (SRL) is to LSTM is shown in Figure2. SA-LSTM is based recognize arguments of a given predicate in a sen- on bidirectional LSTM (bi-LSTM). In order to tence and assign semantic role labels. Many NLP model the whole dependency tree, we add addi- works such as machine translation (Xiong et al., tional directed connections between dependency 2012; Aziz et al., 2011) benefit from SRL because related words in bi-LSTM. SA-LSTM integrates of the semantic structure it provides. Figure1 the whole dependency tree directly into the model shows a sentence with semantic role label. in an architecture engineering way. Also, to take Dependency relation is considered important dependency relation type into account, we intro- for SRL task (Xue, 2008; Punyakanok et al., 2008; duce trainable weights for different types of de- Pradhan et al., 2005), since it can provide rich pendency relation. The weights can be trained to structure and syntax information for SRL. At the indicate importance of a dependency type. bottom of Figure1 shows dependency of the sen- SA-LSTM is able to directly model the whole tence. tree structure of dependency relation in an archi- ∗ Corresponding Author tecture engineering way. Experiments show that

27

Proceedings of the 2nd Workshop on Structured Prediction for Natural Language Processing, pages 27–32 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics

SA-LSTM can model dependency relation bet- Dependency Tree Structure ROOT nsubj dobj ter than traditional feature engineering way. SA- advmod compound – nn LSTM gives state of the art F1 on CPB 1.0 and ᦄ੊ ྋࣁ ᧣ັ Ԫඳ ܻࢩ also shows improvement on English CoNLL 2005 P olice now investigate accident cause Architecture Engineering dataset. Bidirectional RNN Structure

dobj 2 Syntax Aware LSTM

nsubj In this section, we first introduce ordinary bi- advmod compound – nn LSTM. Based on bi-LSTM, we then introduce the proposed SA-LSTM. Finally, we introduce how to do optimization for SA-LSTM. Cell Structure hidden state from dependency related cell 2.1 Conventional bi-LSTM Model for SRL h weighttype St t In a corpus sentence, each word wt has a feature representation xt which is generated automatically as (Wang et al., 2015) did. zt is feature embedding Ct 1 Ct for wt, calculated as followed: tanh

zt = f(W1xt) (1) tanh n1 n0 where W1 R × . n0 is the dimension of word ht 1 ∈ h feature representation. t In a corpus sentence, each word wt has six in- Zt ternal vectors, C, gi, gf , go, Ct, and ht, shown in Figure 2: Structure of Syntax Aware LSTM. The Equation2: e purple square is the current cell that is calculating. C = f(Wczt + Ucht 1 + bc) The green square is a dependency related cell. − gj = σ(Wjzt + Ujht 1 + bj) j i, f, o e − ∈ { } (2) n3 n2 Ct = gi C + gf Ct 1 where W2 R × , n2 is 2 ht, W3 − ∈ × ∈ n4 n3 n ht = go f(Ct) R × and 4 is the number of tags in IOBES e tagging schema. where C is the candidate value of the current cell state. g are gates used to control the flow of infor- 2.2 Syntax Aware LSTM Model for SRL mation.eCt is the current cell state. ht is hidden This section introduces the proposed SA-LSTM state of wt. Wx and Ux are matrixes used in linear model. Figure2 shows the structure of SA-LSTM. transformation: SA-LSTM is based on bidirectional LSTM. By ar-

nh n1 chitecture engineering, SA-LSTM can model the Wx, x c, i, f, o R × ∈ { } ∈ (3) whole tree structure of dependency relation. nh nh Ux, x c, i, f, o R × ∈ { } ∈ St is the key component of SA-LSTM. It stands As convention, f stands for tanh and σ stands for for information from other dependency related sigmoid. means the element-wise multiplica- words, and is calculated as shown in Equation6: tion. t 1 In order to make use of bidirectional informa- − T T St = f( α hi) (6) −→h ←−h × tion, the forward t and backward t are con- i=0 catenated together, as shown in Equation4: X T T at = [−→ht , ←−ht ] (4) 1 If there exists dependency α = relation from w to w (7) Finally, o is the result vector with each dimension  i t t  corresponding to the score of each semantic role 0 Otherwise tag, and are calculated as shown in Equation5:  St is the weighted sum of all hidden state vectors ot = W3f(W2at) (5) hi which come from previous words wi . Note

28 that, α 0, 1 indicates whether there is a de- Method F1% ∈ { } pendency relation pointed from wi to wt. Xue(2008) 71.90 We add a gate gs to constrain information from Sun et al.(2009) 74.12 St, as shown in Equation8: Yand and Zong(2014) 75.31 Wang et al.(Bi-LSTM)(2015) 77.09 gs = σ(Wszt + Usht 1 + bs) (8) − Sha et al.(2016) 77.69 Path LSTM, Roth et al. (2016)3 79.01 To protect the original word information from be- BiLSTM+feature engineering depen- 77.75 ing diluted (Wu et al., 2016) by S , we add S to t t dency hidden layer vector ht instead of adding to cell SA-LSTM(Random Initialized) 79.81 state Ct. So ht in SA-LSTM cell is calculated as: SA-LSTM(Pre-trained Embedding) 79.92 h = g f(C ) + g S (9) Table 1: Results comparison on CPB 1.0 t o t s t For example, in Figure2, there is a dependency likelihood of a single sentence is: relation “advmod” from green square to purple exp(s(x, y, θ)) square. By Equation7, only the hidden state of log p(y x, θ) = log | y exp(s(x, y0, θ)) green square is selected, then transformed by gs 0 (12) in Equation8, finally added to hidden layer of the = s(x, y, θ) log P exp(s(x, y0, θ)) − y0 purple cell. X SA-LSTM changes structure by adding differ- where y0 ranges from all valid paths of answers. ent connections according to dependency relation. We use to calculate the global In this way, SA-LSTM integrates the whole tree normalization. Besides, we collected those impos- structure of dependency. sible transitions from corpus beforehand. When However, by using α in Equation7, we do not calculating global normalization, we prevent cal- take dependency type into account, so we further culating impossible paths which contains impossi- improve the way α is calculated from Equation7 ble transitions. to Equation 10. Each typem of dependency rela- 3 Experiment tion is assigned a trainable weight αm. In this way, SA-LSTM can model differences between types of 3.1 Experiment setting dependency relation. In order to compare with previous Chinese SRL works, we choose to do experiment on CPB 1.0. αm exists typem dependency We also follow the same data setting as previous α = w w (10)  relation from i to t Chinese SRL work (Xue, 2008; Sun et al., 2009)  0 Otherwise did. Pre-trained1 word embeddings are tested on SA-LSTM and shows improvement. 2.3 Optimization For English SRL, we test on CoNLL 2005 This section introduces optimization methods for dataset. SA-LSTM. We use maximum likelihood criterion We use Stanford Parser (Chen and Manning, to train SA-LSTM. We choose stochastic gradient 2014) to get dependency relation. The training descent algorithm to optimize parameters. set of Chinese parser overlaps a part of CPB 1.0 Given a training pair T = (x, y) where T is test set, so we retrained the parser. Dimension the current training pair, x denotes current train- of hyper parameters are tuned according to devel- ing sentence, and y is the corresponding correct opment set. n1 = 200, nh = 100, n2 = 200, answer path. yt = k means that the t-th word has n3 = 100, = 0.001. the k-th semantic role label. The score of ot is calculated as: 3.2 Syntax Aware LSTM Performance To prove that SA-LSTM models dependency Ni relation better than simple feature engineering s(x, y, θ) = otyt (11) t=1 1 X Trained by on Chinese Gigaword Corpus 2All experiment code and related files are available on re- where Ni is the word number of the current sen- quest tence and θ stands for all parameters. So the log 3We test the model on CPB 1.0

29 2.5

2

1.5

1

0.5

0 cc acl etc det cop neg dep conj root case dobj mark nsubj name punct amod nmod appos ccomp xcomp aux:ba aux:asp auxpass advmod mark:clf advcl:loc nummod discourse nsubjpass aux:modal nmod:prep nsubj:xsubj nmod:topic advmod:loc aux:prtmod nmod:tmod nmod:range advmod:dvp para:prnmod compound:vc amod:odmod compound:nn nmod:assmod advmod:comp Figure 3: Visualization of trained weight αm. X axis is Universal Dependency type, Y axis is the weight.

Method F1% ture. However, by building the dependency rela- Bi-LSTM(2 layers) 74.52 tion directly into the structure of SA-LSTM and Bi-LSTM + SA-LSTM(2 layers) 76.63 changing the way information flows, SA-LSTM is He(2017)(Single Model, state of the art) 81.62 able to model the whole tree structure of depen- He(Single Model, 8 layers) + SA-LSTM 81.90 dency relation. Table 2: Results on English CoNLL 2005 We also test our SA-LSTM on English CoNLL 2005 dataset. Replacing conventional bi-LSTM method, we design an experiment in which depen- by SA-LSTM brings 1.7%F1 improvement. Re- dency relation is added to bi-LSTM in a traditional placing bi-LSTM layers of the state of the art feature engineering way. model (He et al., 2017) by SA-LSTM 1 brings Given a word wt, Ft is the average of all depen- 0.3%F1 improvement. dency related xi of previous words wi, as shown in Equation 13: 3.3 Visualization of Trained Weights

t 1 According to Equation 10, influence from a sin- 1 − F = α x (13) gle type of dependency relation will be multiplied t T × i i=0 with type weight α . When α is 0, the influence X m m from this type of dependency relation will be ig- where T is the number of dependency related nored totally. When the weight is bigger, the type words and α is a 0, 1 variable calculated as in of dependency relation will have more influence Equation7. on the whole system. Then Ft is concatenated to xt to form a new fea- As shown in Figure3, dependency relation type ture representation. Then these representations are dobj receives the highest weight after training, fed into bi-LSTM. as shown by the red bar. According to grammar As shown in Table1, on CPB 1.0, SA-LSTM knowledge, dobj should be an informative rela- reaches 79.81%F1 score with random initializa- tion for SRL task, and SA-LSTM gives dobj the tion and 79.92%F1 score with pre-trained word most influence automatically. This further demon- embedding. Both of them are the best F1 score strate that the result of SA-LSTM is highly in ac- ever published on CPB 1.0 dataset. cordance with grammar knowledge. In contrast to the “bi-LSTM+feature engi- neering dependency” model, it is clear that 4 Related works architecture method of SA-LSTM gains more improvement(77.09% to 79.81%) than simple Semantic role labeling (SRL) was first de- feature engineering method(77.09% to 77.75%). fined by (Gildea and Jurafsky, 2002). Early Path-LSTM (Roth and Lapata, 2016) embeds de- works (Gildea and Jurafsky, 2002; Sun and Ju- pendency path between predicate and argument rafsky, 2004) on SRL got promising result with- for each word using LSTM, then does classifica- out large annotated SRL corpus. Xue and Palmer tion according to such path embedding and some (2003) built the Chinese Proposition Bank to stan- dardize Chinese SRL research. other features. SA-LSTM (79.81%F1) outper- forms Path-LSTM (79.01%F1) on CPB 1.0. Traditional works such as (Xue and Palmer, Both “bi-LSTM + feature engineering depen- 2005; Xue, 2008; Ding and Chang, 2009; Sun dency” and Path-LSTM only model dependency et al., 2009; Chen et al., 2006; Yang et al., 2014) parsing information for each single word, which 1We add syntax-aware connections to every bi-LSTM can not model the whole dependency tree struc- layer in the 8-layer model of (He et al., 2017)

30 use feature engineering methods. Their methods the National Basic Research Program (973 Pro- can take dependency relation into account in fea- gram No. 2014CB340405). ture engineering way, such as syntactic path fea- ture. It is obvious that feature engineering method can not fully capture the tree structure of depen- References dency relation. Wilker Aziz, Miguel Rios, and Lucia Specia. 2011. More recent SRL works often use neural net- Shallow semantic trees for smt. In Proceedings of the Sixth Workshop on Statistical Machine Transla- work based methods. Collobert and Weston tion, pages 316–322. Association for Computational (2008) proposed a Convolutional Neural Network Linguistics. (CNN) method for SRL. Zhou and Xu(2015) pro- Danqi Chen and Christopher D Manning. 2014. A fast posed bidirectional RNN-LSTM method for En- and accurate dependency parser using neural net- glish SRL, and Wang et al.(2015) proposed a bi- works. In EMNLP, pages 740–750. RNN-LSTM method for Chinese SRL on which SA-LSTM is based. He et al.(2017) further ex- Wenliang Chen, Yujie Zhang, and Hitoshi Isahara. 2006. An empirical study of chinese chunking. In tends the work of Zhou and Xu(2015). NN based Proceedings of the COLING/ACL on Main confer- methods extract features automatically and sig- ence poster sessions, pages 97–104. Association for nificantly outperforms traditional methods. How- Computational Linguistics. ever, most NN based methods can not utilize de- Ronan Collobert and Jason Weston. 2008. A unified pendency relation which is considered important architecture for natural language processing: Deep for semantic related NLP tasks (Xue, 2008; Pun- neural networks with multitask learning. In Pro- yakanok et al., 2008; Pradhan et al., 2005). ceedings of the 25th international conference on Machine learning, pages 160–167. ACM. The work of Roth and Lapata(2016) and Sha et al.(2016) have the same motivation as SA- Weiwei Ding and Baobao Chang. 2008. Improving LSTM, but in different ways. Sha et al.(2016) chinese semantic role classification with hierarchical strategy. In Proceedings of the con- uses dependency relation as feature to do argu- ference on empirical methods in natural language ment relations classification. Roth and Lapata processing, pages 324–333. Association for Compu- (2016) embeds dependency path into feature rep- tational Linguistics. resentation for each word using LSTM. In con- Weiwei Ding and Baobao Chang. 2009. Word based trast, SA-LSTM utilizes dependency relation in an chinese semantic role labeling with semantic chunk- architecture engineering way, by integrating the ing. International Journal of Computer Processing whole dependency tree structure directly into SA- Of Languages, 22(02n03):133–154. LSTM structure. Daniel Gildea and Daniel Jurafsky. 2002. Automatic labeling of semantic roles. Computational linguis- 5 Conclusion tics, 28(3):245–288. Luheng He, Kenton Lee, Mike Lewis, and Luke Zettle- We propose Syntax Aware LSTM model for Se- moyer. 2017. Deep semantic role labeling: What mantic Role Labeling (SRL). SA-LSTM is able works and whats next. In Proceedings of the An- to directly model the whole tree structure of de- nual Meeting of the Association for Computational pendency relation in an architecture engineering Linguistics. way. Experiments show that SA-LSTM can model Sameer Pradhan, Kadri Hacioglu, Wayne Ward, dependency relation better than traditional feature James H Martin, and Daniel Jurafsky. 2005. Seman- engineering way. SA-LSTM gives state of the art tic role chunking combining complementary syn- tactic views. In Proceedings of the Ninth Confer- F1 on CPB 1.0 and also shows improvement on ence on Computational Natural Language Learning, English CoNLL 2005 dataset. pages 217–220. Association for Computational Lin- guistics. Acknowledgments Vasin Punyakanok, Dan Roth, and Wen-tau Yih. 2008. The importance of syntactic parsing and inference in Thanks Natali for proofreading the paper and giv- semantic role labeling. Computational Linguistics, ing so many valuable suggestions. 34(2):257–287. This paper is partially supported by the National Michael Roth and Mirella Lapata. 2016. Neural se- Natural Science Foundation of China (NSFC mantic role labeling with dependency path embed- Grant Nos. 61472006 and 91646202) as well as dings. arXiv preprint arXiv:1605.07515.

31 Lei Sha, Tingsong Jiang, Sujian Li, Baobao Chang, and Zhifang Sui. 2016. Capturing argument relation- ships for chinese semantic role labeling. In EMNLP, pages 2011–2016. Honglin Sun and Daniel Jurafsky. 2004. Shallow semantic parsing of chinese. In Proceedings of NAACL 2004, pages 249–256. Weiwei Sun. 2010. Improving chinese semantic role labeling with rich syntactic features. In Proceed- ings of the ACL 2010 conference short papers, pages 168–172. Association for Computational Linguis- tics. Weiwei Sun, Zhifang Sui, Meng Wang, and Xin Wang. 2009. Chinese semantic role labeling with shallow parsing. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Process- ing: Volume 3-Volume 3, pages 1475–1483. Asso- ciation for Computational Linguistics. Zhen Wang, Tingsong Jiang, Baobao Chang, and Zhi- fang Sui. 2015. Chinese semantic role labeling with bidirectional recurrent neural networks. In EMNLP, pages 1626–1631. Huijia Wu, Jiajun Zhang, and Chengqing Zong. 2016. An empirical exploration of skip connections for se- quential tagging. arXiv preprint arXiv:1610.03167. Deyi Xiong, Min Zhang, and Haizhou Li. 2012. Mod- eling the translation of predicate-argument structure for smt. In Proceedings of the 50th Annual Meet- ing of the Association for Computational Linguis- tics: Long Papers-Volume 1, pages 902–911. Asso- ciation for Computational Linguistics. Nianwen Xue. 2008. Labeling chinese predicates with semantic roles. Computational linguistics, 34(2):225–255. Nianwen Xue and Martha Palmer. 2003. Annotating the propositions in the penn chinese treebank. In Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17, pages 47– 54. Association for Computational Linguistics. Nianwen Xue and Martha Palmer. 2005. Automatic semantic role labeling for chinese verbs. In IJCAI, volume 5, pages 1160–1165. Citeseer. Haitong Yang, Chengqing Zong, et al. 2014. Multi- predicate semantic role labeling. In EMNLP, pages 363–373. Jie Zhou and Wei Xu. 2015. End-to-end learning of semantic role labeling using recurrent neural net- works. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.

32 Spatial Language Understanding with Multimodal Graphs using Declarative Learning based Programming

Parisa Kordjamshidi and Taher Rahgooy and Umar Manzoor Computer Science Department, Tulane University, New Orleans, USA pkordjam,trahgooy,umanzoor @tulane.edu { }

Abstract chine learning solutions to find such a mapping in a data-driven way (Kordjamshidi and Moens, This work is on a previously formalized 2015; Kordjamshidi et al., 2011). Such extrac- semantic evaluation task of spatial role la- tions are made from available textual resources. beling (SpRL) that aims at extraction of However, spatial semantics are the most relevant formal spatial meaning from text. Here, and useful information for visualization of the lan- we report the results of initial efforts to- guage and, consequently, accompanying visual in- wards exploiting visual information in the formation could help disambiguation and extrac- form of images to help spatial language tion of the spatial meaning from text. Recently, understanding. We discuss the way of de- there has been a large community effort to prepare signing new models in the framework of new resources for combining vision and language declarative learning-based programming data (Krishna et al., 2017) though not explicitly fo- (DeLBP). The DeLBP framework facili- cused on formal spatial semantic representations. tates combining modalities and represent- The current tasks are mostly image centered such ing various data in a unified graph. The as image captioning, that is, generating image de- learning and inference models exploit the scriptions (Kiros et al., 2014; Karpathy and Li, structure of the unified graph as well as 2014), image retrieval using textual descriptions, the global first order domain constraints or visual question answering (Antol et al., 2015). beyond the data to predict the semantics In this work, we consider a different problem, which forms a structured meaning repre- that is, how images can help in the extraction of sentation of the spatial context. Continu- a structured spatial meaning representation from ous representations are used to relate the text. This task has been recently proposed as a various elements of the graph originating CLEF pilot task1, the data is publicly available and from different modalities. We improved the task overview will be published (Kordjamshidi over the state-of-the-art results on SpRL. et al., 2017a). Our interest in formal meaning rep- resentation distinguishes our work from other vi- 1 Introduction sion and language tasks and the choice of the data Spatial language understanding is important in since our future goal is to integrate explicit qual- many real-world applications such as geograph- itative spatial reasoning models into learning and ical information systems, robotics, and naviga- spatial language understanding. tion when the robot has a camera on the head The contribution of this paper is a) we report and receives instructions about grabbing objects results on combining vision and language that ex- and finding their locations, etc. One approach to- tend and improve the spatial role labeling state-of- wards spatial language understanding is to map the-art models, b) we model the task in the frame- the natural language to a formal spatial repre- work of declarative learning based programming sentation appropriate for spatial reasoning. The and show its expressiveness in representing such previous research on spatial role labeling (Ko- complex structured output tasks. DeLBP provides rdjamshidi et al., 2010, 2017b, 2012) and ISO- the possibility of seamless integration of heteroge- Space (Pustejovsky et al., 2011, 2015) aimed at 1http://www.cs.tulane.edu/˜pkordjam/ formalizing such a problem and providing ma- mSpRL_CLEF_lab.htm

33

Proceedings of the 2nd Workshop on Structured Prediction for Natural Language Processing, pages 33–43 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics

this task is to extract spatial roles including, spatial indicator trajector landmark composed-of composed-of composed-of (a) Spatial indicators (sp): these are triggers in- spatial relation is-a dicating the existence of spatial information is-a is-a Distance in a sentence; Direction Region is-a is-a above EQ is-a is-a left (b) Trajectors (tr): these are the entities whose is-a is-a DC is-a is-a is-a is-a location are described; PP right EC front is-a PO (c) these are the reference ob- back below Landmarks (lm): jects for describing the location of the trajec- Figure 1: Given spatial ontology (Kordjamshidi tors. and Moens, 2015) In the textual description of Figure2, the loca- tion of kids (trajector) has been described with re- neous data in addition to considering domain on- spect to the stairs (landmark) using the preposition tological and linguistic knowledge in learning and on (spatial indicator). This is example of some inference. To improve the state-of-the-art results spatial roles that we aim to extract from the whole in SpRL and exploiting the visual information we text. The second level of this task is to extract spa- rely on existing techniques for continuous repre- tial relations. sentations of image segments and text phrases, and measuring similarity to find the best alignments. (d) Spatial relations (sr): these indicate a link The challenging aspect of this work is that the between the three above mentioned roles formal representation of the textual spatial seman- (sp.tr.lm), forming spatial triplets. tics is very different from the raw spatial infor- (e) Relation types: these indicate the type of re- mation extracted from image segments using their lations in terms of spatial calculi formalisms. geometrical relationships. To alleviate this prob- Each relation can have multiple types. lem the embeddings of phrases as well as the em- For the above example we have the triplet beddings of the relations helped connecting the spatial relation(kids, on, stairs). Recognizing two modalities. This approach helped improv- the spatial relations is very challenging because ing the state of the art results on spatial role la- there could be several spatial roles in the sentence beling (Kordjamshidi et al., 2012) for recognizing and the model should be able to recognize the right spatial roles. links. The formal type of this relation could be EC that is externally connected. The previous 2 Problem Description research (Kordjamshidi and Moens, 2015) shows The goal is to extract spatial information from text the extraction of triplets is the most challenging while exploiting accompanying visual resources, part of the task for this dataset, therefore we fo- that is, images. We briefly define the task which cus on (a)-(d) tasks in this paper. The hypothe- is based on a previous formalization of spatial role sis of this paper is that knowing the objects and labeling (SpRL) (Kordjamshidi et al., 2011; Ko- their geometrical relationships in the companion rdjamshidi and Moens, 2015). Given a piece of image might help the inference for the extraction text, S, here a sentence, which is segmented into of roles as well as the relations from sentences. a number of phrases, the goal is to identify the In our training dataset, the alignment between the phrases that carry spatial roles and classify them text and image is very coarse-grained and merely according to a given set of spatial concepts; iden- the whole text is associated with the image, that tify the links between the roles and form spatial is, no sentence alignment, no phrase alignment for relations (triplets) and finally classify the spatial segments, etc is available. relations given a set of relation types. A more for- Each companion image I contains a number of mal definition of the problem is given in Section5, segments each of which is related to an object and where we describe our computational model. The the objects spatial relationships can be described spatial concepts and relation types are depicted in qualitatively based on their geometrical structure Figure1 which shows a light-weight spatial ontol- of the image. In this paper, we assume the image ogy. Figure2 shows an example of an image and segments are given and the image object annota- the related textual description. The first level of tions are based on a given object ontology. More-

34 (d) Application program: Specifying the final end-to-end program that starts with reading the raw data into the declared data-model graph referred to as data population and then calls the learners and constrained learners for training, prediction and evaluation. Each application program defines the configura- tion of an end-to-end model based on the above- mentioned components. In the following sections we describe these components and the way they are defined for multimodal spatial role labeling. Figure 2: Image textual description:“About 20 kids in traditional clothing and hats waiting on 3.1 Data Model stairs. A house and a green wall with gate in the A graph is used to explicitly represent the structure background. A sign saying that plants can’t be of the data. This graph is called the data-model picked up on the right.” and contains typed nodes, edges and properties. over, the relationships between objects in the im- The node types are domain’s basic data structures, ages are assumed to be given. The spatial relation- called base types. The base types are mostly pre- ships are obtained by parsing the images and com- established in Saul (Kordjamshidi et al., 2016), puting a number of relations based on geometrical including base types for representing documents, relationships between the objects boundaries. This sentences, phrases, etc, referred to as linguistic implies the spatial representation of the objects in units. In this work, we also have added a set of the image is very different from the spatial ontol- preliminary image base types in Saul that could ogy that we use to describe the spatial meaning be extended to facilitate working on visual data from text; this issue makes combining information task-independently in the future. The below code from images very challenging. shows a data model schema including nodes of lin- guistic units and image segments. The typed nodes 3 Declarative Modeling are declared as follows: To extend the SpRL task to a multimodal setting, val documents = node[Document] we firstly, replicated the state-of-the-art models val sentences = node[Sentence] val tokens = node[Token] using the framework of declarative learning based val phrases = node[Phrase] programming (DeLBP) Saul (Kordjamshidi et al., val pairs = node[Relation] val images = node[Image] 2015, 2016). The goal was to extend the pre- val segments = node[Segement] viously designed configurations easily and facil- val segmentPairs = node[SegmentRelation] itate the integration of various resources of data and knowledge into learning models. In DeLBP (‘val‘ is a Scala keyword to define variables; documents, sen- framework, we need to define the following build- tences, etc, are the programmer-defined variables; ‘node‘ is a ing blocks for an application program, Saul keyword to define typed graph nodes; Document, Sen- tence, etc, are the NLP and other base types built-in for Saul.) (a) DataModel: Declaring a graph schema to Given the base types, domain sensors can be used represent the domain’s concepts and their re- to populate raw data into the data-model. Sensors lationships. This is a first order graph called are black box functions that operate on base types a data-model and can be populated with the and can generate instances of nodes, properties actual data instances. and edges. An edge that connects documents to (b) Learners: Declaring basic learning models sentences using a sensor called ‘documentToSen- in terms of their inputs and outputs; where the tenceMatching‘ is defined as: inputs and outputs are properties of the data- val documentTosentence = edge(documents, model’s nodes. sentences) documentTosentence.addSensor( (c) Constraints: Declaring constraints among documentToSentenceMatching_) output labels using first order logical expres- sions. (‘edge‘ is a Saul keyword, addSensor is a Saul function) 35 The properties are assigned to the graph nodes features are used sometimes based on the head- only and defined based on the existing domain word of the phrases and sometimes by concatena- property sensors. The following example receives tion of the same features for all the tokens in a a phrase and returns the dependency relation label phrase. The relations are, in-fact, a pair of phrases of its head: and the pair features are based on the features of val headDependencyRelation = property( the phrases. The relational features between two phrases){x => getDependencyRelation phrases include their path, distance, before/after (getHeadword(x))} information. In addition to the previously used (‘property‘ is a keyword in Saul, getDependencyRelation and features, here, we add phrase and image embed- getHeadword are two NLP sensors applied on words and dings described in the next section. The details of phrases respectively.) the linguistic features are omitted due to the lack 3.2 Learners of space and since the code is publicly available. The learners are basically a set of classic classi- fiers each of which is related to a target variable 3.2.2 Image and Text Embeddings in the output space. The output variables are a subset of elements represented in the ontology of Using continuous representations has several ad- Figure1. The previous work shows the challeng- vantages in our models. One important aspect is ing element of the ontology is the extraction of compensating for the lack of lexical information spatial triplets. Therefore, in this work our goal due to the lack of training data for this problem. is to improve the extraction of the roles and spa- Another aspect is the mapping between image seg- tial triplets. Each classifier/learner is applied on ments and the phrases occurring in the textual de- a typed node which is defined in the data-model. scriptions and establishing a connection between For example, a trajector role classifier is applied the two modalities. The experiments show these on the phrase nodes and defined as follows: components improve the generalization capabil- ity of our trained models. Since our dataset is object TrajectorRoleClassifier extends Learnable(phrases) { very small, our best embeddings were the com- def label = trajectorRole monly used word2vec (Pennington et al., 2014) override lazy val classifier = new model trained over google’s gigaword+wikipedia SparseNetworkLearner override def feature = using( corpora. headDependencyRelation,...)} Text Embeddings. We generate the embeddings (‘label‘ is a Saul function to define the output of the classifier, for candidate roles. More specifically, for each ‘trajectorRole‘ is a name of a property in the datamodel to be phrase we find its syntactic head and then we use predicted. ‘feature‘ is a Saul function to define the list of the vector representation of the syntactic head as a properties of the datamodel to be used as input features.) feature of the phrase. This is added to the rest of All other learners are defined similarly and they linguistically motivated features. can use different types of data-model properties Image Embeddings. For the image side we rely as ‘feature‘s or as ‘label‘. In our proposed model, on a number of assumptions given the type of only the role and pair classifiers are used and image corpora available for our task. As men- triplets of relations are generated based on the re- tioned in Section1, the input images are assumed sults of the pair classifiers afterwards. to be segmented and the segments have been la- beled according to a given ontology of concepts. 3.2.1 Role and Relation Properties For example, the ontology for a specific object Spatial Roles are applied on phrases and most like Bush can be entity->landscape-nature-> of the features are used based on the previous vegetation->trees->bush. Given the image works (Kordjamshidi and Moens, 2015), however segments, the spatial relations between segments the previous work on this data is mostly token- are automatically extracted in a pairwise exhaus- based; we have extended the features to phrase- tive manner using the geometrical properties of the based and added some more features. We use lin- segments (Escalante et al., 2010). These relations guistically motivated features such as lexical form are limited to relationships such as besides, dis- of the words in the phrases, lemmas, pos-tags, de- joint, below, above, x-aligned, and y-aligned. In pendency relations, subcategorization, etc. These this work, we employed the pre-processed images

36 that were publicly available2. Since the segment are able to design various end-to-end configura- label ontology is independent from the textual de- tions for learning and inference. The first step scriptions, finding the alignment between the seg- for an application program is to populate the an- ments and the words/phrases in the text is very notated corpus in the graph schema, that is, our challenging. To alleviate this problem, we exploit declared data-model. To simplify the procedure the embeddings of the image segment labels us- of populating the graph with linguistic annota- ing the same representations that is used for words tions, we have established a generic XML reader in the text. We measure the similarity between that is able to read the annotated corpora from the segment label embeddings and word embed- XML into the Saul data-model and provide us a dings to help the fine-grained alignments between populated graph. The nodes related to the lin- the image segments and text phrases. To clar- guistic units (i.e. sentence, phrase, etc) are pop- ify, we tried the following variations: we compute ulated with the annotations as their properties. the word embeddings of image segment labels and The population can be done in various ways, for words in the text candidate phrases, then we find example, SpRLDatamodel.documents.populate the most similar object in the image to each can- (xmlReader.documentList()) reads the content didate phrase. We use the embedding of the most of DOCUMENT tag or its pre-defined4 equivalent similar object as a feature of the phrase. Another into documents nodes in the data-model. Popu- variation that we tried is to exploit the embeddings lating documents can lead to populating all other of the image segment ontologies. The vector rep- types of nodes such as sentences, tokens, etc if the resentation of each segment label is computed by necessary sensors and edges are specified before- averaging over the representation of all the onto- hand. Saul functions and data-model primitives logical concepts related to that segment. can be used to make graph traversal queries to ac- cess any information that we need from either im- 3.3 Global Constraints age or text for candidate selection, feature extrac- The key point of considering global correlations tion. in our extraction model is formalizing a number The feature extraction includes segmentation of of global constraints and exploiting those in learn- the text and candidate generation for roles and pair ing and inference. The constraints are declared us- relations. Not all tokens are candidates for playing ing first order logical expressions, for example, the trajector roles, most certainly verbs will not play constraint, ”if there exists a trajector or a landmark this role. After populating the data into the graph in the sentence then an indicator should also exist we program the training strategy. We have the pos- in the sentence” , we call it integrity constraint and sibility of training for each concept independently, it is expressed as follows: that is, each declared classifier can call the learn ((sentences(s) >phraseEdge)._exists{x: , for example, trajectorClassifier.learn(). Phrase=>(TrajectorRoleClassifier∼ on However, the independently trained classifiers can x is"Trajector") or( exploit the global constraints like the one we de- LandmarkRoleClassifier onx is" Landmark"}))==>((sentences(s) > fined in Section 3.3 and be involved in a global phraseEdge)._exists{y:Phrase=>∼ inference jointly with other role and relation clas- IndicatorRoleClassifier ony is" sifiers. Such a model is referred to as L+I (Pun- Indicator"}) yakanok et al., 2008). Moreover, the parameters of The domain knowledge is inspired from this the declared classifiers can be trained jointly and work (Kordjamshidi and Moens, 2015).3 The for this purpose we need to call joinTrain and first order constraints are automatically converted pass the list of classifiers and the constraints to be to linear algebraic constraints for our underlying used together. We use L+I models in this paper computational models. due to the efficiency of the training.

4 Application program 5 Computational Model Using the building blocks of a DeLBP includ- The problem we address in this paper is formu- ing a data-model, learners and constraints, we lated as a structured prediction problem as the out-

2http://www.imageclef.org/photodata 4The programmer is able to specify the tags that are re- 3constraints code is available on GitHub. lated to the base types before reading the xml.

37 put contains a number of spatial roles and relation- phrase by its headword, the pair by the distance of ships that obey some global constraints. In learn- the two headwords, an image segment by the vec- ing models for structured output prediction, given tor representation of its concept). We refer to the a set of N input-output pairs of training examples text-related nodes and image-related nodes differ- E = (xi, yi) : i = 1..N , we learn ently as x and x , respectively. The goal is to { ∈ X × Y } T I an objective function g(x, y; W ) which is a linear map this pair to a set of spatial objects and spatial discriminant function defined over the combined relationships, that is f :(x , x ) y. T I 7→ feature representation of the inputs and outputs de- The output y is represented by a set of la- noted by f(x, y) (Ioannis Tsochantaridis and Al- bels l = l , . . . , l each of which is a prop- { 1 P } tun, 2006): erty of a node. The labels can have semantic relationships. In our model the set of labels is g(x, y; W ) = W, f(x, y) . (1) l = tr, lm, sp, sp.tr, sp.lm, sp.tr.lm . Note that h i { } these labels are applied merely to the parts of the W denotes a weight vector and , denotes a dot h i text, tr, lm and sp are applied on the phrase of product between two vectors. A popular discrimi- a sentence, sp.tr and sp.lm are applied on pairs native training approach is to minimize the follow- of phrases in the sentence, and finally sp.tr.lm is ing convex upper bound of the loss function over applied on triplets of phrases. According to the the training data: terminology used in (Kordjamshidi and Moens, 2015), the labels of atomic components of the text N X (here phrases) are referred to as single-labels and l(W ) = max(g(xi, y; W ) g(xi, yi; W ) +∆(yi, y)), y − i=1 ∈Y the labels that are applied to composed compo- nents of the input such as pairs or triplets are re- the inner maximization is called loss-augmented ferred to as linked-labels. These labels help to rep- inference and finds the so called most violated resent y with a set of indicator functions that indi- constraints/outputs (y) per training example. This cate which segments of the sentence play a specific is the base of inference-based-training models spatial role and which are involved in relations. (IBT). However, the inference over structures can The labels are defined with a graph query that ex- be limited to the prediction time which is known tracts a property from the data-model. The lp(xk) as learning plus inference (L+I) models. L+I uses or shorter lpk denotes an indicator function indi- the independently trained models (this is known cating whether component xk has the label lp. For as piece-wise training as well (Sutton and McCal- example, sp(on) shows whether on plays a spa- lum, 2009)) and has shown to be very efficient and tial role and sp.tr(on, kids) shows whether kids competitive compared to IBT models in various is a trajector of on. As expected, the form of the tasks (Punyakanok et al., 2005). Given this gen- output is dependent on the input since we are deal- eral formalization of the problem we can easily ing with a structured output prediction problem. In consider both configurations of L+I and IBT using our problem setting the spatial roles and relations a declarative representation of our inference prob- are still assigned to the components of the text and lem as briefly discussed in Section4. We define the connections, similarities and embeddings from our structured model in terms of first order con- image are used as additional information for im- straints and classifiers. proving the extractions from text. Here in Saul’s generic setting, inputs x and out- The main objective g is written in terms of the puts y are sub-graphs of the data-model and each instantiations of the feature functions, labels and learning model can use parts and substructures of their related blocks of weights wp in w = this graph. In other words, x is a set of nodes [w1, w2,..., wP ], x , . . . , x and each node has a type p. Each { 1 K } x x is described by a set of properties; this set X X k g(x, y; w) = wp, fp(xk, lp) (2) ∈ h i of properties will be converted to a feature vector lp l x C ∈ k∈ lp φp. Given the multimodal setting of our problem, X X = wp, φp(xk) lpk h i xi ’s can represent segments of an image or various l l x C p∈ k∈ lp linguistic units of a text, such as a phrase (atomic * + X X node) or a pair of phrases (composed node), and = wp, (φp(xk)lpk) , l l x C each type is described by its own properties (e.g. a p∈ k∈ lp

38 where fp(xk, lp) are the local joint feature vector task settings (Kordjamshidi et al., 2012; Olek- for each candidate xk. This feature vector is com- sandr Kolomiyets and Bethard, 2013; Pustejovsky puted by scalar multiplication of the input feature et al., 2015) and with a variation in the annota- vector of xk (i.e. φp(xk)), and the output label lpk. tion schemes. This data includes about 1213 sen- Given this objective, we can view the inference tence containing 20,095 words with 1706 anno- task as a combinatorial constrained optimization tated relations. We preferred this data compared given the polynomial g which is written in terms to more recent related corpora (Pustejovsky et al., of labels, subject to the constraints that describe 2015; Oleksandr Kolomiyets and Bethard, 2013) the relationships between the labels (either single for two main reasons. First is the availability of or linked labels). For example, the is-a relation- the aligned images and the second is the static na- ships can be defined as the following constraint, ture of the most spatial descriptions. (l(x ) is 1) (l (x ) is 1), where l and l are Implementation. As mentioned before, we used c ⇒ 0 c 0 two distinct labels that are applicable on the node Saul (Kordjamshidi et al., 2015, 2016) framework with the same type of xc. These constraints are that allows flexible relational feature extraction as added as a part of Saul’s objective, so we have the well as declarative formulation of the global infer- following objective form, which is in fact a con- ence. We extend Saul’s basic data structures and strained conditional model (Chang et al., 2012), sensors to be able to work with multimodal data g = w, f(x, y) ρ, c(x, y) , where c is the and to populate raw as well as annotated text eas- h i − h i constraint function and ρ is the vector of penalties ily into a Saul multimodal data-model. The code for violating each constraint. This representation is available in Github.5 We face the following corresponds to an integer linear program, and thus challenges when solving this problem: the training can be used to encode any MAP problem. Specif- data is very small; the annotation schemes for the ically, the g function is written as the sum of local text and images are very different and they have joint feature functions which are the counterparts been annotated independently; the image annota- of the probabilistic factors: tions regarding the spatial relations include very naively generated exhaustively pairwise relations X X g(x, y; w) = wp, fp(xk, lpk) which are not very relevant to what human de- h i lp l x τ ∈ k∈{ } scribes by viewing the images. We try to address C (3) X| | these challenges by feature engineering, exploit- + ρ c (x, y), m m ing global constraints and using continuous repre- m=1 sentations for text and image segments. We report where C is a set of global constraints that can the results of the following models in Table1. hold among various types of nodes. g can repre- This is our baseline model built with sent a general scoring function rather than the one BM: extensive feature engineering as described in corresponding to the likelihood of an assignment. Section 3.2.1. We train independent classi- Note that this objective is automatically generated fiers for the roles and relations classification based on the high level specifications of learners in this model; and constraints as described in Section3. BM+C: This is the BM that uses global con- 6 Experimental Results straints to impose, for example, the integrity and consistency of the assignments of the In this section, we experimentally show the in- roles and relation labels at the sentence level. fluence of our new features, constraints, phrase embeddings and image embeddings and compare BM+C+E: To deal with the lack of lexical in- them with the previous research. formation, the features of roles and relations Data. We use the SemEval-2012 shared tasks are augmented by w2vec word embeddings, data (Kordjamshidi et al., 2012) that consists of the results of this model without using con- textual descriptions of 613 images originally se- straints (BM+E) are reported too; lected from the IAPR TC-12 dataset (Grubinger In this model in addition to text et al., 2006), provided by the CLEF organiza- BM+E+I+C: embeddings, we augment the text phrase fea- tion. In the previous works only the text part of this data has been used in various shared 5https://github.com/HetML/SpRL

39 Trajector Landmark Spatial indicator Spatial triplet Pr R F1 Pr R F1 Pr R F1 Pr R F1 BM 56.72 69.57 62.49 72.97 86.21 79.05 94.76 97.74 96.22 75.18 45.47 56.67 BM+C 65.56 69.91 67.66 77.74 87.78 82.46 94.83 96.86 95.83 75.21 48.46 58.94 BM+E 55.87 77.35 64.88 71.47 89.18 79.35 94.76 97.74 96.22 66.50 57.30 61.56 BM+E+C 64.40 76.77 70.04 76.99 89.35 82.71 94.85 97.48 96.15 68.34 57.93 62.71 BM+E+I 56.53 79.29 66.00 71.78 87.44 78.84 94.76 97.74 96.22 64.12 57.08 60.39 BM+E+I+C 64.49 77.92 70.57 77.66 89.18 83.02 94.87 97.61 96.22 66.46 57.61 61.72 BM+E+C-10f 78.49 77.67 78.03 86.43 88.93 87.62 91.70 94.71 93.17 80.85 60.23 68.95 SOP2015-10f ------90.5 84 86.9 67.3 57.3 61.7 SemEval-2012 78.2 64.6 70.7 89.4 68.0 77.2 94.0 73.2 82.3 61.0 54.0 57.3

Table 1: Experimental results on SemEval-2012 data including images. BM: Baseline Model, C: Con- straints, E: Text Embeddings, I: Image Embeddings.

tures with the embeddings of the most similar and landmarks with indicators have been reported, image segments. The version without con- there is no reports on trajecotrs and landmarks pre- straints is denoted as BM+E+I. diction accuracy as designated independent roles SemEval-2012: This model is the best per- –those are left empty in the table. There are forming model of SemEval-2012 (Roberts some differences in our evaluation and the previ- and Harabagiu, 2012). It generates the candi- ous systems evaluations.The SOP2015-10f is eval- date triplets and classifies them as spatial/not- uated by 10-fold cross validation rather than the spatial. It does an extensive feature extraction train/test split. To be able to compare, we re- for the triplets. The roles then are simply in- port the 10-fold cross validation results of our ferred from the relations. The results are re- best model BM+E+C and refer to it as BM+E+C- ported with the same train/test split. 10f in Table1 which is outperforming other mod- SOP2015-10f: This model is an structured els. Note that the folds are chosen randomly and output prediction model that does a global in- might be different from the previous evaluation ference on the whole ontology including the setting. Another difference is that our evaluation prediction of relations and relation types (Ko- is done phrase-based and overlapping phrases are rdjamshidi and Moens, 2015). counted as true predictions. The SemEval-2012 and SOP2015-10f models operate on classifying The experimental results in Table1 show that tokens/words which are the headwords of the an- adding constraints to our baseline and other model notated roles. However, our identified phrases variations consistently improves the classifica- cover the headwords of role (trajectors and land- tion of trajectors and landmarks dramatically al- marks) phrases with 100% and for spatial indica- though it slightly decreases the F1 of spatial in- tors 98% which keeps the comparisons fair yet. dicators in some cases. Adding word embed- dings (BM+C+E) shows a significant improve- Our results exceed the stat-of-the-art models ment on roles and spatial relations. The re- significantly. Both word and image embeddings sults on BM+E+I+C show that image embeddings help expanding our semantic dimensions for spa- improves trajectors and landmarks compared to tial objects but interestingly the spatial indicators BM+E+C, though the results of triples are slightly can not be improved using embeddings. Since the dropped (62.71 61.72). indicators are mostly prepositions, it seems cap- → Our results exceed the state of the art models turing the semantic dimensions of prepositions us- reported in SemEval-2012 (Kordjamshidi et al., ing continuous vectors is harder than other lexical 2012). The SemEval-2012 best model uses same categories such as nouns and verbs. This is even train/test split as ours (Roberts and Harabagiu, worse when we use images since the terminology 2012). The results of the best performing model of the relations in the images is very different from in (Kordjamshidi and Moens, 2015), SOP2015- the way the relations are expressed in the language 10f, are lower than our best model in this work. using prepositions. Though the improvement on Although that model uses structured training but objects can improve the relations but it will be in- here the embeddings make a significant improve- teresting to investigate how the semantics of the ment. While SOP2015-10f performance results relations can be captured using richer representa- on triples, spatial indicators, pairs of trajector tions for spatial prepositions. A possible direction

40 for our work could have been to train deep models ages and visual question answering (Antol et al., that map the images to the formal semantic repre- 2015). However the center of attention has been sentations of the text’s content, however for train- understanding images. Here, our aim is to exploit ing such models using only 2013 sentences related the images for text understanding.This task is as to about 600 images will not be feasible. The ex- challenging as the former ones or even more chal- isting large corpora which contain image and text, lenging because among the many possible objects do not contain formal semantic annotation with the and relationships in the image a very small subset textual description. Dealing with this problems re- of those are important and have been expressed in mains as our future work. the text. Therefore the available visual corpora are not exactly the type of the data that can be used 7 Related Research to train supervised models for our task though it could provide some indirect supervision particu- This work can be related to many research works larly for having a unified semantic representation from various perspectives. However, for the sake of spatial objects (Ludwig et al., 2016). of both clarity and conciseness, we limit our ex- This work can be improved by exploiting exter- ploration in this section to two research directions. nal models and corpora (Pustejovsky and Yocum, First body of related work is about the specific 2014) but this will remain for our future investiga- SpRL task that we are solving. This direction is tion. Our task can benefit from the research per- aiming at obtaining a generic formal spatial mean- formed on reference resolution that targets identi- ing representation from text. The second body fying the objects in the image that are mentioned of the work is about combining vision and lan- in the text (Schlangen et al., 2016). Having a high- guage which itself has a large research community quality alignment by training explicit models for around it recently and has turned to a hot topic. resolving references should help recognizing the Several research efforts in the past few years spatial objects mentioned in the text and the type aimed at defining a framework for the extrac- of spatial relations according to the image. Ex- tion of spatial information form natural language. plicit reference resolution between modalities in These efforts start from defining linguistic an- dialogue systems are also inspiring (Fang et al., notation schemes (Pustejovsky and Moszkowicz, 2014). In the mentioned reference a graph repre- 2008; Kordjamshidi et al., 2010; Pustejovsky and sentation of the scene is gradually made by ma- Moszkowicz, 2012; jeet Mani, 2009), annotating chine based on the grasped static visual informa- data and defining tasks (Kordjamshidi et al., 2012; tion and the representation is corrected and com- Oleksandr Kolomiyets and Bethard, 2013; Puste- pleted dynamically as the dialogue between the jovsky et al., 2015) to operate on the annotated machine and human is going on. However, in this corpora and learn extraction models. However, work there is no learning component and there there exists, yet, a large gap between the current is no spatially annotated data to be used for our models and the ones that can perform reasonably goal of formal spatial meaning representation for well in practice for real world applications in vari- a generic text. ous domains. Though we follow that line of work, In this work we take a small step and investi- we aim at exploiting the visual data in improving gate the ways to integrate information from both spatial extraction models. We exploit the visual in- modalities for our textual extraction target. Our formation accompanying the text which is mostly results are compared to the previous work (Kord- available nowadays. We aim at text understanding jamshidi and Moens, 2015) that exploit the text while assuming that the text highlights the most part of the same spatially annotated corpora and important information that might be confirmed by improve the results when exploiting the accompa- the image. Our goal is to use the image to rec- nying images. ognize and disambiguate the spatial information from text. 8 Conclusion Our work is very related to the research done by community and in the inter- In this paper, we deal with the problem of spa- section of vision and language. There are many tial role labeling which targets at mapping natural progressive research works on generating image language text to a formal spatial meaning repre- captions (Karpathy and Li, 2014), retrieving im- sentation. We use the information from accom-

41 panying segmented images to improve the spatial methods for structured and interdependent output role extractions. Although, there are many recent variables. Journal of Machine Learning Research research on combining vision and language, none 6(2):1453–1484. of them consider obtaining a formal spatial mean- jeet Mani. 2009. SpatialML: annotation scheme ing representation as a target while we do and our for marking spatial expression in natural language. approach will be helpful for adding explicit rea- Technical Report Version 3.0, The MITRE Corpo- soning component to the learning models in the ration. future. We manifest the expressivity of declarative Andrej Karpathy and Fei-Fei Li. 2014. Deep visual- learning based programming paradigm for design- semantic alignments for generating image descrip- ing global models for this task. We put both the tions. CoRR abs/1412.2306. image and text related to a scene in a unified data- Ryan Kiros, Ruslan Salakhutdinov, and Richard S. model graph and use them as structured learning Zemel. 2014. Unifying visual-semantic embeddings examples. We extract features by traversing the with multimodal neural language models. CoRR graph and using the continuous representations to abs/1411.2539. connect the image segment nodes to the nodes re- Parisa Kordjamshidi, Steven Bethard, and Marie- lated to the text phrases. We exploit the continu- Francine Moens. 2012. SemEval-2012 task 3: Spa- ous representation to align the similar concepts in tial role labeling. In Proceedings of the First Joint the two modalities. We exploit global first order Conference on Lexical and Computational Seman- constraints for global inference over roles and re- tics: Proceedings of the Sixth International Work- shop on Semantic Evaluation (SemEval). volume 2, lations. Our models improve the state of the art pages 365–373. results on previous spatial role labeling models. Parisa Kordjamshidi, Daniel Khashabi, Christos Christodoulopoulos, Bhargav Mangipudi, Sameer References Singh, and Dan Roth. 2016. Better call saul: Flex- ible programming for learning and inference in nlp. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- In Proc. of the International Conference on Compu- garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, tational Linguistics (COLING). and Devi Parikh. 2015. VQA: Visual Question An- swering. In International Conference on Computer Parisa Kordjamshidi and Marie-Francine Moens. Vision (ICCV). 2015. Global machine learning for spatial on- tology population. Web Semant. 30(C):3–21. Ming-Wei Chang, Lev Ratinov, and Dan Roth. 2012. https://doi.org/10.1016/j.websem.2014.06.001. Structured learning with constrained conditional models. Machine Learning 88(3):399–431. Parisa Kordjamshidi, Taher Rahgooy, Marie-Francine Moens, James Pustejovsky, Umar Manzoor, and Hugo Jair Escalante, Carlos A. Hernndez, Je- Kirk Roberts. 2017a. CLEF 2017: Multimodal Spa- sus A. Gonzalez, A. Lpez-Lpez, Manuel tial Role Labeling (mSpRL) Task Overview. In Julio Montes, Eduardo F. Morales, L. Enrique Su- Gonzalo Liadh Kelly Lorraine Goeuriot Thomas car, Luis Villaseor, and Michael Grubinger. Mandl Linda Cappellato Nicola Ferro Gareth J. 2010. The segmented and annotated IAPR F. Jones, Samus Lawless, editor, Experimental IR tc-12 benchmark. Computer Vision and{ Im-} Meets Multilinguality, Multimodality, and Interac- age Understanding 114(4):419 – 428. Special tion 8th International Conference of the CLEF As- issue on Image and Video Retrieval Evaluation. sociation, CLEF 2017, Dublin, Ireland, September https://doi.org/http://doi.org/10.1016/j.cviu.2009.03.008. 11-14, 2017, Proceedings. Springer, volume 10456 Rui Fang, Malcolm Doering, and Joyce Y. Chai. 2014. of LNCS. Collaborative models for referring expression gen- eration in situated dialogue. In Proceedings of the Parisa Kordjamshidi, Martijn van Otterlo, and Marie- Twenty-Eighth AAAI Conference on Artificial Intel- Francine Moens. 2010. Spatial role labeling: task ligence. AAAI Press, AAAI’14, pages 1544–1550. definition and annotation scheme. In Nicoletta Cal- zolari, Choukri Khalid, and Maegaard Bente, ed- Michael Grubinger, Paul D. Clough, Henning Muller,¨ itors, Proceedings of the Seventh Conference on and Thomas Deselaers. 2006. The IAPR bench- International Language Resources and Evaluation mark: a new evaluation resource for visual informa- (LREC’10). pages 413–420. tion systems. In Proceedings of the International Conference on Language Resources and Evaluation Parisa Kordjamshidi, Martijn van Otterlo, and Marie- (LREC). pages 13–23. Francine Moens. 2011. Spatial role labeling: to- wards extraction of spatial relations from natural Thorsten Joachims Thomas Hofmann Ioannis Tsochan- language. ACM - Transactions on Speech and Lan- taridis and Yasemin Altun. 2006. Large margin guage Processing 8:1–36.

42 Parisa Kordjamshidi, Martijn van Otterlo, and Marie- James Pustejovsky and J. L. Moszkowicz. 2012. The Francine Moens. 2017b. Spatial role labeling anno- role of model testing in standards development: The tation scheme. In N. Ide James Pustejovsky, editor, case of ISO-space. In Proceedings of LREC’12. Eu- Handbook of Linguistic Annotation, Springer Ver- ropean Language Resources Association (ELRA), lag. pages 3060–3063.

Parisa Kordjamshidi, Hao Wu, and Dan Roth. 2015. James Pustejovsky and Zachary Yocum. 2014. Im- Saul: Towards declarative learning based program- age annotation with iso-space: Distinguishing con- ming. In Proc. of the International Joint Conference tent from structure. In Nicoletta Calzolari (Con- on Artificial Intelligence (IJCAI). ference Chair), Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- Asuncion Moreno, Jan Odijk, and Stelios Piperidis, son, Kenji Hata, Joshua Kravitz, Stephanie Chen, editors, Proceedings of the Ninth International Con- Yannis Kalantidis, Li-Jia Li, David A. Shamma, ference on Language Resources and Evaluation Michael S. Bernstein, and Fei-Fei Li. 2017. Vi- (LREC’14). European Language Resources Asso- sual genome: Connecting language and vision us- ciation (ELRA), Reykjavik, Iceland. ing crowdsourced dense image annotations. Inter- national Journal of Computer Vision . Kirk Roberts and S.M. Harabagiu. 2012. UTD-SpRL: A joint approach to spatial role labeling. In *SEM Oswaldo Ludwig, Xiao Liu, Parisa Kordjamshidi, and 2012: The First Joint Conference on Lexical and Marie-Francine Moens. 2016. Deep embedding for Computational Semantics, Volume 2: Proceedings spatial role labeling. CoRR abs/1603.08474. of the Sixth International Workshop on Semantic Evaluation (SemEval’12). pages 419–424. Parisa Kordjamshidi Marie-Francine Moens Olek- sandr Kolomiyets and Steven Bethard. 2013. David Schlangen, Sina Zarrie, and Casey Kenning- Semeval-2013 task 3: Spatial role labeling. In ton. 2016. Resolving References to Objects in Pho- Second Joint Conference on Lexical and Computa- tographs using the Words-As-Classifiers Model. In tional Semantics (*SEM), Volume 2: Proceedings Proceedings of the 54th Annual Meeting of the Asso- of the Seventh International Workshop on Semantic ciation for Computational Linguistics (ACL 2016). Evaluation (SemEval 2013). Atlanta, Georgia, USA, pages 255–262. Charles Sutton and Andrew McCallum. 2009. Piecewise training for structured predic- Jeffrey Pennington, Richard Socher, and Christopher D tion. Machine Learning 77:165–194. Manning. 2014. Glove: Global vectors for word https://doi.org/10.1007/s10994-009-5112-z. representation. In EMNLP. volume 14, pages 1532– 1543.

Vasin Punyakanok, Dan Roth, and W. Yih. 2008. The importance of syntactic parsing and inference in se- mantic role labeling. Computational Linguistics 34(2).

Vasin Punyakanok, Dan Roth, W. Tau Yih, and D. Zi- mak. 2005. Learning and inference over constrained output. In IJCAI’05. pages 1124–1129.

James Pustejovsky, Parisa Kordjamshidi, Marie- Francine Moens, Aaron Levine, Seth Dworman, and Zachary Yocum. 2015. SemEval-2015 task 8: SpaceEval. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 9th international workshop on semantic evaluation (SemEval 2015), Denver, Colorado, 4-5 June 2015. ACL, pages 884–894.

James Pustejovsky, J. Moszkowicz, and M. Verhagen. 2011. ISO-Space: The annotation of spatial in- formation in language. In ACL-ISO International Workshop on Semantic Annotation (ISA’6).

James Pustejovsky and J. L. Moszkowicz. 2008. In- tegrating motion predicate classes with spatial and temporal annotations. In Donia Scott and Hans Uszkoreit, editors, COLING 2008: Companion vol- ume D, Posters and Demonstrations. pages 95–98.

43 Boosting Information Extraction Systems with Character-level Neural Networks and Free Noisy Supervision

Philipp Meerkamp Zhengyi Zhou Bloomberg LP AT&T Labs Research 731 Lexington Avenue 33 Thomas Street New York, NY 10022 New York, NY 10007 USA USA [email protected] [email protected]

Abstract We present an architecture to boost the precision of existing information extrac- tion systems. This is achieved by aug- menting the existing parser, which may Figure 1: Tweet containing financial data. be constraint-based or hybrid statistical, with a character-level neural network. Our architecture combines the ability of e.g. ts tick abs (US Unemployment, 4.9%), or constraint-based or hybrid extraction sys- ts tick rel (TS symbol, change in num. value), tems to easily incorporate domain knowl- e.g. ts tick abs (US Hourly Earnings, 2.8%). edge with the ability of deep neural net- For these business applications, the extraction works to leverage large amounts of data needs to be fast, accurate, and low-cost. to learn complex features. The network There are several challenges to extracting infor- is trained using a measure of consis- mation from financial language text. tency between extracted data and existing databases as a form of cheap, noisy super- Financial text can be very ambiguous. Lan- • vision. Our architecture does not require guage in the financial domain often trades large scale manual annotation or a system grammatical correctness for brevity. A mul- rewrite. It has led to large precision im- titude of numeric tokens need to be disam- provements over an existing, highly-tuned biguated into entities such as prices, rates, production information extraction system percentage changes, quantities, dates, times, used at Bloomberg LP for financial lan- and others. Finally, many words could be guage text. company names, stock symbols, domain- specific terms, or have entirely different 1 Introduction meanings. 1.1 Information extraction in finance Large-scale annotated data with ground • Unstructured textual data is abundant in the finan- truths is typically not available. Domain ex- cial domain (see Figure1). This type of text is pert manual annotations are expensive and usually not in a format that lends itself to imme- limited in scale. This is especially a prob- diate processing. Hence, information extraction lem because the size of the problem domain is an essential step in many business applications is large, with many types of financial instru- that wish to use the financial text data. Examples ments, language conventions, and text for- of such business applications include creating time mats. series databases for macroeconomic forecasting, or real-time extraction of time series data to in- These two challenges lead to a high risk of extract- form algorithmic trading strategies. For example, ing false positives. consider extracting data from Figure1 into numer- Bloomberg has mature information extraction ical relations of the form systems for financial language text, that are the re- ts tick abs (TS symbol, numerical value), sult of nearly a decade of efforts. When improving

44

Proceedings of the 2nd Workshop on Structured Prediction for Natural Language Processing, pages 44–51 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics upon such large industrial extraction systems, it is pling of candidate-generation and the neu- invaluable if the proposed improvement does not ral network offers advantages: the candidate- require a complete system rewrite. generating parser can easily enforce con- The existing extraction systems leverage straints that would be difficult to incorporate both constraint-based and statistical methods. in a system relying entirely on neural net- Constraint-based methods can effectively incor- works. porate domain knowledge (e.g. unemployment numbers cannot be negative numbers, while We are not aware of alternative approaches that changes in unemployment numbers can be neg- achieve the above: purely neural network or purely ative). These constraints can be complex, and statistical approaches require substantial amounts may involve multiple entities within an extraction of human annotation, while our method does not. candidate. Adding a single constraint can in many Constraint-based or hybrid statistical approaches cases vastly improve the system’s accuracy. In a are competitors to the existing extraction systems purely statistical system, achieving the equivalent rather than our work; our work could also be used behavior would require large amounts of labeled to boost the precision of other state-of-the-art con- training data. straint or hybrid methods. We believe that our ap- This existing systems generate extractions with proach, with small modifications, can be applied high recall, and in this work, we propose to boost to extraction systems in many other applications. the precision of the existing systems using a deep Compared to the existing, client-serving extrac- neural network. tion engine, our system reduced the number of false positive extractions by > 90%. Our system 1.2 Our contribution constituted a substantial improvement and is being deployed to production. We present an information extraction architecture We review some related work in Section 1.3. that boosts the precision of an existing parser using Section2 details the design, training, and method a deep neural network. The architecture gains the supervision of our system. We present results in neural network’s ability to leverage large amounts Section3 and conclude with some discussions in of data to learn complex features specific to the Section4. application at hand. At the same time, the existing parser may leverage constraints to easily incorpo- 1.3 Related Work rate domain knowledge. Our method uses poten- tially noisy but cheaply available source of super- Deep neural networks have been applied to several vision, e.g. via consistency checks of extracted problems in natural language processing recently. data against existing databases (e.g. an extrac- Mikolov introduced the recurrent network lan- tion ts tick abs (US Unemployment, 49%) would guage model (Kombrink et al., 2011), which can be implausible given recent US employment his- for instance consume the beginning of a sentence tory), or via human interaction (e.g., clicks on on- word by word, encoding the sentence in its state line advertisements). vector, and predict a likely continuation of the sen- tence. Long Short Term Memory (LSTM) archi- Our extraction system has two main advantages tectures (Gehrs, 2001; Hochreiter and Schmidhu- We improve the existing extraction systems ber, 1997) have resulted in large improvements in • cheaply using large amounts of free noisy machine translation (Sutskever et al., 2014). There labels, without the need for manual annota- is a growing literature of LSTMs and deep convo- tion. This is particularly valuable in large lutional architectures being used for text classifi- application domains, where manual annota- cation (e.g. (Zhang et al., 2015; Kim, 2014; dos tions covering the entire domain would be Santos and Gatti, 2014)) and language modeling prohibitively expensive. problems (Kim et al., 2016; Karpathy, 2015). Several studies have considered using deep neu- Our method leverages existing codebases ral networks for information extraction problems. • fully, and requires no system rewrite. This Socher et al introduce compositional vector gram- is critical in large industrial extraction sys- mars, which combine probabilistic context free tems, where a rewrite would take many man- grammars with a recursive neural network (2013). years. Even for new systems, the decou- Nguyen et al use convolutional neural networks

45 (CNN) and recurrent neural networks with word- taridis et al., 2004). and entity-position-embeddings for relation ex- Our approach is fundamentally different from traction and event detection (Nguyen et al., 2016; all of the above. We propose to boost the precision Nguyen and Grishman, 2015). Zhou and Xu use of an existing information extraction system using a bidirectional (Schuster and Paliwal, 1997) word- a neural network trained with free noisy supervi- level LSTM combined with a conditional random sion. Our method does not require large amounts field (CRF) (Lafferty et al., 2001) for semantic of human annotations as purely statistical or neu- role labeling (2015). In some cases, providing a ral network-based approaches would. Our work neural network character-level rather than word- could also be used to boost the precision of other level input can be beneficial. Chiu and Nichols state-of-the-art constraint or hybrid methods. use a bidirectional LSTM-CNN with character embeddings for named entity recognition (2015), 2 Design and Ballesteros et al found that character-level 2.1 Overview LSTM improve dependency parsing accuracy in The information extraction pipeline we developed many languages, relative to word-level approaches consists of four stages (see right pane “execution” (2015). Recently, Miwa and Bansal presented in Figure2). an end-to-end LSTM-based architecture for rela- tion extraction, achieving state-of-the-art results 1. Generate candidate extractions: The docu- on SemEval-2010 Task 8 (2016). ment is parsed using a potentially constraint- based or hybrid statistical parser, which out- While much of the recent work on informa- puts a set of candidate extractions. Each can- tion extraction focuses on statistical methods and didate extraction consists of the character off- neural networks, constraint-based information ex- sets of all extracted constituent entities, and traction systems are still commonly used in in- a representation of the extracted relation. It dustry applications. Chiticariu et al suggest that may additionally contain auxiliary informa- this might be because constraint-based systems tion that the parser may have generated, such can easily incorporate domain knowledge (2013). as part-of-speech tags. The need to incorporate constraints into informa- tion extraction systems, as well as the difficulty 2. Score extractions using trained neural net- of doing this efficiently, have been recognized for work: Each candidate extraction generated, some time. together with the section of the document it Several hybrid approaches combine constraint- was found in, is encoded into feature data X. based and statistical methods. Toutanova et al A deep neural network is used to compute a model dependencies between semantic frame ar- neural network correctness score s˜ for each guments with a CRF (2008). Chang et al in- extraction candidate. We detail the input and corporate declarative constraints into statistical architecture of the neural network in Section sequence-based models to obtain constrained con- 2.2 and its training in Section 2.3. ditional models (2012). When making a predic- 3. Score extractions based on cheap, noisy su- tion (inference step), their model solves an inte- pervision: We compute a consistency score s ger linear program. This typically involves maxi- for the candidate extraction, measuring if the mizing a scoring function, based e.g. on a CRF, extracted relation is consistent with cheaply over all outputs that satisfy the constraints, re- available, but potentially noisy supervision sulting in high computational costs. Tackstr¨ om¨ from e.g. an existing database. We discuss et al recently proposed a more computationally how s is computed in Section 2.4. efficient approach to semantic role labeling with constraints, introducing a dynamic programming 4. Classify extractions as correct or incor- approach for inference for log-linear models with rect: A linear classifier classifies extraction constraints (2015). candidates as correct and incorrect extrac- There are several purely statistical approaches, tions, based on neural network correctness s˜ s such as those using hidden Markov models (Baum score , database consistency score , and po- and Petrie, 1966), CRFs (Lafferty et al., 2001) tentially other features. Candidates classified or structured support vector machines (Tsochan- as incorrect are discarded.

46 Training Execution Candidate Database with Candidate Database with Documents extractions reference data Document extraction reference data

Feature vector encoding Feature vectors encoding Candidate extraction Candidate extraction extraction candidates and extraction candidates and database consistency score database consistency score source text source text

Binary database consistency labels … Training data obtained via thresholding L … L F Correct extraction Trained neural Neural network network F Linear … trained to score classifier candidate extractions Network Incorrect L … L F Network correctness extraction, correctness estimate score discard F

Figure 2: Training (left) and execution (right) set-up. Blocks marked “L” are neural network LSTM cells, while blocks marked “F” are fully-connected layers.

Document text I B M u p 7 % Parser output TS symbol change in num. value Network input 0 ...... Document- or Indicator for ...... relation-level letter I at this 0 ...... features character 1 . . . . • Document offset 0 ...... author . . . … . . . . . features Indicator for 0 . 1 . . . . • Document parser found . . . . 0 . . . . BOW n-gram . TS symbol at . . . . . features this character . . . . . 0 . . . . • etc offset

L L L L S S S S F T T T T C M M M M Network FC correctness estimate

Figure 3: We use a neural network comprised of (i) an LSTM processing encoded parser output and document text character-by-character (labeled LSTM, green), (ii) a fully-connected layer (FC, blue) that takes document-level features as input, and (iii) a fully-connected layer (FC, grey) that takes the output vectors of the layers (i) and (ii) as input to generate a correctness estimate for the extraction candidate. Layer (iii) uses a sigmoid activation function to generate a correctness estimate y˜ (0, 1), from which 1 ∈ we compute the network correctness score as s˜ := σ− (˜y).

2.2 Neural network input and architecture ness of an extracted candidate, the network is pro- vided with two pieces of input (see Figure3 for The neural network processes each candidate ex- the full structure of the neural network). First, the traction independently. To estimate the correct-

47 network is provided with a vector g (right box in be a squared relative error, an absolute error, or Figure3) containing document- or extraction-level a more complex error function. In many applica- features, such as attributes of the document’s au- tions, the score s will be noisy (see Section 2.4 thor, or word-level n-gram features. The second for further discussion). We threshold s to obtain piece of input consists of a sequence of vectors binary correctness labels y 0, 1 . We then ∈ { } fi (left box in Figure3), encoding the document use the binary correctness labels y for supervised text and the parser’s output at the character level. neural network training, with binary cross-entropy There is one vector fi for each character ci of the loss as the loss function. This allows us to train document section where the extraction candidate a network that can compute a pseudo-likelihood is found. y˜ (0, 1) of a given extraction candidate to agree ∈ The vectors fi are a concatenation of (i) a one- with the database. Thus, y˜ estimates how likely hot encoding of the character, and (ii) information the extraction candidate is correct. about entities the parser identified at the position The neural network’s training data consists of of ci. For (i) we use a restricted character set candidates generated by the candidate-generating of size 94, including [a-zA-Z0-9] and several parser, and noisy binary consistency labels y. whitespace, special characters, and an indicator to In our application, the database labeled a large represent characters not present in our character majority of candidates as correct. To obtain bal- set. For (ii), we include in fi a vector of indica- anced training data, we generated 6 sets of training tors specifying whether or not any of the entities data, each containing the same set of around 1 mil- appearing in the relations supported by the parser lion negative cases and disjoint sets of 1 million were found in the position of character ci. positive cases each. Correspondingly, we trained The input vectors fi feed into an LSTM, which an ensemble of 6 networks, and averaged their net- accumulates state until the input sequence has work scores s˜. We found this to work much better been exhausted. The global feature vector g feeds than oversampling the smaller class. into a fully-connected layer, which is then con- catenated with the output of the LSTM, and passed 2.4 Noisy supervision through another fully-connected layer with sig- We assume that the noise in the source of super- moid activation σ to produce a network correct- vision is limited in magnitude, e.g. < 5%. We ness estimate y˜. We subsequently compute the moreover assume that there are no strong patterns network correctness score s˜ for the candidate ex- in the distribution of the noise: if the noise cor- traction via s˜ = σ 1(˜y). − relates with certain attributes of the candidate ex- We regularize the LSTM weight matrices and traction, the pseudo-likelihoods y˜ might no longer state vector using dropout (see (Srivastava et al., be a good estimate of the candidate extraction’s 2014) and (Zaremba et al., 2015)). Similar to probability of being a correct extraction. (Sutskever et al., 2014), we found that reversing There are two sources of noise in our applica- the LSTM’s input resulted in a substantial increase tion’s database supervision. First, there is a high in accuracy. rate of false positives. It is not rare for the parser to generate an extraction candidate ts tick abs (TS 2.3 Neural network training symbol, numerical value) in which the numerical We train the neural network by referencing can- value fits into the time series of the time series didates extracted by the high-recall candidate- symbol, but the extraction is nonetheless incor- generating parser against a potentially noisy ref- rect. False negatives are also a problem: many erence source (see Figure2, left panel on “train- financial time series are sparse and are rarely ob- ing”). In our application, this reference is served. As a result, it is common for differences Bloomberg’s proprietary database of historical between reference numerical values and extracted time series data, which enables us to check how numerical values to be large even for correct ex- well the extracted numerical data fits into time se- tractions. Limits for acceptable differences be- ries in the database. Concretely, we compute a tween extracted data and reference data are incor- consistency score s ( , ) that measures the porated into the computation of the database con- ∈ −∞ ∞ degree of consistency with the database. Depend- sistency score s. The formula for computing s is ing on the application, the score may for instance application dependent, and may involve auxiliary

48 data related to the extracted entities.

3 Results 0.945

0.940 Compared to using the existing, highly-tuned, network_size client-facing extraction systems, this work re- 256 128 duced the number of false positive extractions accuracy 0.935 by more than 90%, while the drop in recall was smaller than 1%. Our character-level neural net- 0.930 work considerably boosted the precision of the ex- isting information extraction systems, and is being 0 500000 1000000 1500000 2000000 deployed to production. size of training data We are not aware of alternative approaches that can improve an existing extraction engine in a way Figure 4: Neural network accuracies for different that is directly comparable to ours. Purely neural network and training data sizes. network-based or purely statistical approaches re- quire large amounts of human annotation, while unit of the neural network with a single fully- our method does not. Constraint-based or hybrid connected layer. At the same time, we replaced statistical approaches are competitors to the exist- the character-level input sequence f , . . . , f with ing extraction system rather than our work; in- 0 k a single binary vector representing a bag-of- deed, our work could also be used to boost the words of n-grams. More precisely, we computed precision of state-of-the-art constraint or hybrid character-level feature vectors f , . . . , f analo- methods. 00 k0 gously to the fi, except that the f 0 contain en- When trained with 256 LSTM hidden units, and i codings of character classes (upper case letter, 2 million samples for 150 epochs, the neural net- lower case letter, digit, and the same list of special work alone achieved a training accuracy of 95.8% characters used in the fi) rather than of charac- and a test accuracy of 94.9%, relative to the noisy ters themselves. Each binary vector f 0 was then database supervision. Note that since the supervi- i mapped to a string f˜ containing the same se- sion provided by the databases is imperfect, it is 0i quence of 0s and 1s as f 0. Finally, we computed not unexpected that the network’s accuracy is sub- i n-grams over the sequence of strings f˜ . All n- stantially below 100%. We examined the errors 0i gram features were tf-idf normalized. We cross- in the training set, and found that they are primar- validated on a validation set to tune various hy- ily due to the network generalizing correctly, with perparameters, and found that using 2-grams up the network being correct almost always when to 12-grams worked best. The resulting classi- strongly disagreeing with the database’s label. fier achieve an accuracy of 96.9% on the training We include in Figure4 how data size and net- set, and 90.4% on the test set. This constitutes work size affect network accuracy. Note that the a big gap to the 94.9% test accuracy achieved by network’s accuracy decreases substantially if the the neural network, especially since we estimate network size (number of LSTM units) is reduced. the accuracy of the database supervision to be Accuracy also decreases if smaller quantities of around 95%. This suggests that the recurrent neu- training data are available. To achieve accept- ral network was able to internally compute higher- ably small latencies in client-serving scenarios, we quality features than the bag-of-words n-gram fea- 256 limit the network size to LSTM hidden units, tures. even though larger networks could achieve slightly greater accuracies. We moreover limited the docu- 4 Discussion ment input text provided to the LSTM neural net- work to the lines on which the extraction candidate In this work, we presented an architecture for in- was found. formation extraction from text using a combina- Besides the neural network architecture de- tion of an existing parser and a deep neural net- scribed in this paper, we also considered an ap- work. The deep neural network can boost the pre- proach based on character-level n-grams (see Fig- cision of an existing high-recall parser. To train the ure5). In this approach, we replaced the LSTM neural network, we use measures of consistency

49 Document text I B M u p 7 % Document- or Parser output TS symbol change in num. value relation-level Binary vectors features • Document Uppercase letter 1 1 1 0 . . author Lowercase letter 0 0 0 0 . features “%” character 1 . 0 0 0 ...... • Document ...... … . . BOW n-gram Parser: TS symbol 1 1 1 0 . features . Parser: change in 0 0 0 1 • . . . . . etc . . . . . num value ......

Binary strings

“100…10..” “100…10..” “100…10..” … “001…01..” F C

N-grams of binary strings FC

Network FC correctness estimate

Figure 5: Extraction candidate correctness estimation with an n-gram-based classifier. It differs from Figure3 only in the green fully connected layer labeled FC, and its input. between extracted data and existing databases as be located far from where the extraction candidate a form of noisy supervision. The architecture re- was found. Other potentially useful features could sulted in substantial improvements over mature be based on document length, document embed- and highly-tuned information extraction systems dings, document creator features, and more. for financial language text at Bloomberg. While we used time series databases to derive measures We experimented with other neural network ar- of consistency for candidate extractions, our set- chitectures. A slight variation would be to use a up can easily be applied to a variety of other infor- bidirectional LSTM instead of a simple LSTM. mation extraction tasks for which potentially noisy In addition to LSTM architectures, we experi- reference data is cheaply available. mented with character-level convolutional neural We believe our encoding of document text networks. In this setting, we concatenated the vec- and parser output makes learning particularly tors fi into a single matrix for all character indices, easy for the neural network. We encode the and performed 1D convolutions and pooling oper- candidate-generating parser’s document annota- ations 3 times, and pass the result through a fully- tions character-by-character into vectors fi that connected layer. We found the performance of this also include a one-hot encoding of the charac- approach to be very similar to that of the LSTM ter itself. We believe that this encoding makes we used. Finally, a hybrid approach, stacking an it easier for the network to learn character-level LSTM on a single convolutional layer, gave very characteristics of the entities that are part of the similar results as the LSTM and convolutional ar- semantic frame. As an additional benefit, our chitectures. encoding lends itself well to processing both by recurrent architectures (processing character-by- One could use a variety of sources of informa- character input vectors fi) and convolutional ar- tion as distant and cheap supervision. While we chitectures (performing 1D convolutions over an used existing databases, other applications may input matrix whose columns are vectors fi). use supervision e.g. from user interactions with Our architecture can easily incorporate global the extracted data, say if the extracted data is pre- attributes of the document. In our application, we sented to a user who can accept, modify, or reject found it useful to add one-hot encoded n-gram and the extraction. In such a system, the linear clas- word shape features of the document, allowing the sifier classifying candidate extractions would have neural network to consider information that might only a single feature, i.e. the neural network score.

50 Acknowledgments Makoto Miwa and Mohit Bansal. 2016. End-to-end re- lation extraction using lstms on sequences and tree We would like to thank my managers Alex Bozic, structures. In ACL. Assoc. Comp. Ling., pages Tim Phelan, and Joshwini Pereira for supporting 1105–1116. this project, as well as David Rosenberg from the Huu Thien Nguyen, Kyunghyun Cho, and Ralph Gr- CTO’s office for providing access to GPU infras- ishman. 2016. Joint event extraction via recurrent tructure. neural networks. In NAACL. pages 300–309. Huu Thien Nguyen and Ralph Grishman. 2015. Event detection and domain adaptation with convolutional References neural networks. In ACL. pages 365–371. Miguel Ballesteros, Chris Dyer, and A. Noah Smith. 2015. Improved transition-based parsing by mod- M. Schuster and K. K. Paliwal. 1997. Bidirectional re- eling characters instead of words with lstms. In current neural networks. IEEE Trans. Signal Pro- EMNLP. pages 349–359. cessing 45(11):2673–2681. Leonard E. Baum and Ted Petrie. 1966. Statistical Richard Socher, John Bauer, D. Christopher Manning, inference for probabilistic functions of finite state and Ng Andrew Y. 2013. Parsing with composi- Markov chains. Annals of Math. Statistics . tional vector grammars. In ACL. pages 455–465. Ming-Wei Chang, Lev Ratinov, and Dan Roth. 2012. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Structured learning with constrained conditional Ilya Sutskever, and Ruslan Salakhutdinov. 2014. models. Machine Learning 88(3):399–431. Dropout: A simple way to prevent neural networks from overfitting. JMLR 15(1):1929–1958. Laura Chiticariu, Yunyao Li, and R. Frederick Reiss. 2013. Rule-based information extraction is dead! Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. long live rule-based information extraction systems! Sequence to sequence learning with neural net- In EMNLP. pages 827–832. works. In NIPS. pages 3104–3112. Jason PC Chiu and Eric Nichols. 2015. Named Oscar Tackstr¨ om,¨ Kuzman Ganchev, and Dipanjan entity recognition with bidirectional LSTM-CNNs. Das. 2015. Efficient inference and structured learn- http://arxiv.org/pdf/1511.08308v4.pdf. ing for semantic role labeling. Transactions of the Association of Computational Linguistics 3:29–41. Cicero dos Santos and Maira Gatti. 2014. Deep con- volutional neural networks for sentiment analysis of Kristina Toutanova, Aria Haghighi, and Christopher short texts. In COLING. pages 69–78. D. Manning. 2008. A global joint model for seman- tic role labeling. Comp. Ling. 34(2). Felix Gehrs. 2001. Long Short-Term Memory in Re- current Neural Networks. Ph.D. thesis, Ecole´ poly- Ioannis Tsochantaridis, Thomas Hofmann, Thorsten technique Fed´ erale´ de Lausanne. Joachims, and Yasemin Altun. 2004. Support vector machine learning for interdependent and structured Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long output spaces. In ICML. pages 104–. short-term memory. Neural Comput. 9(8):1735– 1780. Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2015. Recurrent neural network regularization. Andrej Karpathy. 2015. The unreasonable ef- https://arxiv.org/abs/1409.2329. fectiveness of recurrent neural networks. http://karpathy.github.io/2015/05/21/rnn effectiveness/. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text clas- Yoon Kim. 2014. Convolutional neural networks for sification. In NIPS. pages 649–657. sentence classification. In EMNLP. pages 1746– 1751. Jie Zhou and Wei Xu. 2015. End-to-end learning of semantic role labeling using recurrent neural net- Yoon Kim, Yacine Jernite, David Sontag, and Alexan- works. In ACL. pages 1127–1137. der M. Rush. 2016. Character-aware neural lan- guage models. In AAAI. pages 2741–2749. Stefan Kombrink, Tomas Mikolov, Martin Karafiat,´ and Lukas´ Burget. 2011. Recurrent neural network based language modeling in meeting recognition. In INTERSPEECH. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling se- quence data. In ICML. pages 282–289.

51 Piecewise Latent Variables for Neural Variational Text Processing

1 2 3 1 Iulian V. Serban ∗ and Alexander G. Ororbia II ∗ and Joelle Pineau and Aaron Courville 1 Department of Computer Science and Operations Research, Universite de Montreal 2College of Information Sciences & Technology, Penn State University 3School of Computer Science, McGill University iulian [DOT] vlad [DOT] serban [AT] umontreal [DOT] ca ago109 [AT] psu [DOT] edu jpineau [AT] cs [DOT] mcgill [DOT] ca aaron [DOT] courville [AT] umontreal [DOT] ca

Abstract Serban et al., 2017b). It is hoped that this frame- work will enable the learning of generative pro- Advances in neural variational inference cesses of real-world data — including text, audio have facilitated the learning of power- and images — by disentangling and representing ful directed graphical models with con- the underlying latent factors in the data. How- tinuous latent variables, such as varia- ever, latent factors in real-world data are often tional . The hope is that highly complex. For example, topics in newswire such models will learn to represent rich, text and responses in conversational dialogue of- multi-modal latent factors in real-world ten posses latent factors that follow non-linear data, such as natural language text. How- (non-smooth), multi-modal distributions (i.e. dis- ever, current models often assume simplis- tributions with multiple local maxima). tic priors on the latent variables — such as the uni-modal Gaussian distribution — Nevertheless, the majority of current models as- which are incapable of representing com- sume a simple prior in the form of a multivariate plex latent factors efficiently. To over- Gaussian distribution in order to maintain mathe- come this restriction, we propose the sim- matical and computational tractability. This is of- ple, but highlyflexible, piecewise constant ten a highly restrictive and unrealistic assumption distribution. This distribution has the ca- to impose on the structure of the latent variables. pacity to represent an exponential num- First, it imposes a strong uni-modal structure on ber of modes of a latent target distribution, the latent variable space; latent variable samples while remaining mathematically tractable. from the generating model (prior distribution) all Our results demonstrate that incorporating cluster around a single mean. Second, it forces this new latent distribution into different the latent variables to follow a perfectly symmet- models yields substantial improvements in ric distribution with constant kurtosis; this makes natural language processing tasks such as it difficult to represent asymmetric or rarely occur- document modeling and natural language ring factors. Such constraints on the latent vari- generation for dialogue. ables increase pressure on the down-stream gen- erative model, which in turn is forced to carefully 1 Introduction partition the probability mass for each latent factor The development of the variational throughout its intermediate layers. For complex, framework (Kingma and Welling, 2014; Rezende multi-modal distributions — such as the distribu- et al., 2014) has paved the way for learning large- tion over topics in a text corpus, or natural lan- scale, directed latent variable models. This has led guage responses in a dialogue system — the uni- to significant progress in a diverse set of machine modal Gaussian prior inhibits the model’s ability learning applications, ranging from computer vi- to extract and represent important latent structure sion (Gregor et al., 2015; Larsen et al., 2016) to in the data. In order to learn more expressive latent natural language processing tasks (Mnih and Gre- variable models, we therefore need moreflexible, gor, 2014; Miao et al., 2016; Bowman et al., 2015; yet tractable, priors.

∗ Thefirst two authors contributed equally. In this paper, we introduce a simple,flexible

52

Proceedings of the 2nd Workshop on Structured Prediction for Natural Language Processing, pages 52–62 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics prior distribution based on the piecewise constant ables (Salakhutdinov and Hinton, 2009; Hinton distribution. We derive an analytical, tractable and Salakhutdinov, 2009; Srivastava et al., 2013; form that is applicable to the variational autoen- Larochelle and Lauly, 2012; Uria et al., 2014; coder framework and propose a differentiable Lauly et al., 2016; Bornschein and Bengio, 2015; parametrization for it. We then evaluate the ef- Mnih and Gregor, 2014). The success of such dis- fectiveness of the distribution when utilized both crete latent variable models — which are able to as a prior and as approximate posterior across partition probability mass into separate regions — variational architectures in two natural language serves as one of our main motivations for investi- processing tasks: document modeling and natu- gating models with moreflexible continuous latent ral language generation for dialogue. We show variables for document modeling. More recently, that the piecewise constant distribution is able to Miao et al. (2016) proposed to use continuous la- capture elements of a target distribution that can- tent variables for document modeling. not be captured by simpler priors — such as the Researchers have also investigated latent vari- uni-modal Gaussian. We demonstrate state-of- able models for dialogue modeling and dialogue the-art results on three document modeling tasks, natural language generation (Bangalore et al., and show improvements on a dialogue natural lan- 2008; Crook et al., 2009; Zhai and Williams, guage generation. Finally, we illustrate qualita- 2014). The success of discrete latent variable tively how the piecewise constant distribution rep- models in this task also motivates our investi- resents multi-modal latent structure in the data. gation of moreflexible continuous latent vari- ables. Closely related to our proposed ap- 2 Related Work proach is the Variational Hierarchical Recur- rent Encoder-Decoder (VHRED, described below) The idea of using an artificial neural network to (Serban et al., 2017b), a neural architecture with approximate an inference model dates back to the latent multivariate Gaussian variables. early work of Hinton and colleagues (Hinton and Researchers have explored moreflexible dis- Zemel, 1994; Hinton et al., 1995; Dayan and Hin- tributions for the latent variables in VAEs, such ton, 1996). Researchers later proposed Markov as autoregressive distributions, hierarchical prob- chain Monte Carlo methods (MCMC) (Neal, abilistic models and approximations based on 1992), which do not scale well and mix slowly, MCMC sampling (Rezende et al., 2014; Rezende as well as variational approaches which require and Mohamed, 2015; Kingma et al., 2016; Ran- a tractable, factored distribution to approximate ganath et al., 2016; Maaløe et al., 2016; Salimans the true posterior distribution (Jordan et al., 1999). et al., 2015; Burda et al., 2016; Chen et al., 2017; Others have since proposed using feed-forward in- Ruiz et al., 2016). These are all complimentary ference models to initialize the mean-field infer- to our approach; it is possible to combine them ence algorithm for training Boltzmann architec- with the piecewise constant latent variables. In tures (Salakhutdinov and Larochelle, 2010; Oror- parallel to our work, multiple research groups have bia II et al., 2015). Recently, the variational also proposed VAEs with discrete latent variables autoencoder framework (VAE) was proposed by (Maddison et al., 2017; Jang et al., 2017; Rolfe, Kingma and Welling (2014) and Rezende et al. 2017; Johnson et al., 2016). This is a promising (2014), closely related to the method proposed by line of research, however these approaches often Mnih and Gregor (2014). This framework allows require approximations which may be inaccurate the joint training of an inference network and a di- when applied to larger scale tasks, such as docu- rected generative model, maximizing a variational ment modeling or natural language generation. Fi- lower-bound on the data log-likelihood and facil- nally, discrete latent variables may be inappropri- itating exact sampling of the variational posterior. ate for certain natural language processing tasks. Our work extends this framework. With respect to document modeling, neural ar- 3 Neural Variational Models chitectures have been shown to outperform well- established topic models such as Latent Dirich- We start by introducing the neural variational let Allocation (LDA) (Hofmann, 1999; Blei et al., learning framework. We focus on modeling dis- 2003). Researchers have successfully proposed crete output variables (e.g. words) in the context several models involving discrete latent vari- of natural language processing applications. How-

53 ever, the framework can easily be adapted to han- Model parameters are learned by maximizing dle continuous output variables. the variational lower-bound in eq. (1) using gra- dient descent, where the expectation is computed 3.1 Neural Variational Learning using samples from the approximate posterior. The majority of work on VAEs propose to Letw 1, . . . , wN be a sequence ofN tokens (words) conditioned on a continuous latent vari- parametrizez as multivariate Gaussian distrib- ablez. Further, letc be an additional observed tions. However, this unrealistic assumption may critically hurt the expressiveness of the latent vari- variable which conditions bothz andw 1, . . . , wN . Then, the distribution over words is: able model. See Appendix A for a detailed dis- cussion. This motivates the proposed piecewise N constant latent variable distribution. P (w , . . . , w c) = P (w w , z, c)P (z c)dz, θ 1 N | θ n| 0, fori=1, . . . , n. The logP (w , . . . , w c) θ 1 N | normalization constant is: E z Q (z w ,...,w ,c)[logP θ(wn w2 pieces P (z w , . . . , w , c).Q is called the encoder, is capable of representing a bi-modal distribution. θ | 1 N or sometimes the recognition model or inference Whenn>2, a vectorz of piecewise constant model, and it is parametrized byψ. The distri- variables can represent a probability density with butionP (z c) is the prior model forz, where 2 z modes. Figure 1 illustrates how these variables θ | | | the only available information isc. The VAE help model complex, multi-modal distributions. framework further employs the re-parametrization In order to compute the variational bound, we trick, which allows one to move the derivative of need to draw samples from the piecewise constant the lower-bound inside the expectation. To ac- distribution using its inverse cumulative distribu- complish this,z is parametrized as a transforma- tion function (CDF). Further, we need to compute tion of afixed, parameter-free random distribu- the KL divergence between the prior and posterior. tionz=f θ(�), where� is drawn from a ran- The inverse CDF and KL divergence quantities are dom distribution. Here,f is a transformation of both derived in Appendix B. During training we �, parametrized byθ, such thatf (�) P (z c). must compute derivatives of the variational bound θ ∼ θ | For example,� might be drawn from a standard in eq. (1). These expressions involve derivatives Gaussian distribution andf might be defined as of indicator functions, which have derivatives zero fθ(�) =µ+σ�, whereµ andσ are in the param- everywhere except for the changing points where eter setθ. In this case,z is able to represent any the derivative is undefined. However, the proba- Gaussian with meanµ and varianceσ 2. bility of sampling the value exactly at its changing

54 prior prior dimensions. The matricesH µ ,Hσ and vec- prior prior torsb µ , bσ are learnable parameters. For the posterior distribution, previous work has shown it is better to parametrize the posterior distribution as a linear interpolation of the prior distribution mean and variance and a new estimate of the mean and variance based on the observationx( Fraccaro et al., 2016). The interpolation is controlled by a gating mechanism, allowing the model to turn on/off latent dimensions: Figure 1: Joint density plot of a pair of Gaussian µpost =(1 α )µprior +α HpostEnc(c, x) +b post , and piecewise constant variables. The horizontal − µ µ µ µ 2,post 2,prior axis corresponds toz 1, which is a univariate Gaus- σ =(1 α σ)σ � � − sian variable. The vertical axis corresponds toz 2, post post +α σdiag(log(1 + exp(Hσ Enc(c, x) +b σ ))), which is a piecewise constant variable. where Enc(c, x) is an embedding of bothc and post post x. The matricesH µ ,Hσ and the vectors point is effectively zero. Thus, wefix these deriva- post post bµ , bσ ,α µ,α σ are parameters to be learned. tives to zero. Similar approximations are used in The interpolation mechanism is controlled byα µ training networks with rectified linear units. andα σ, which are initialized to zero (i.e. initial- 4 Latent Variable Parametrizations ized such that the posterior is equal to the prior). In this section, we develop the parametrization 4.2 Piecewise Constant Parametrization of both the Gaussian variable and our proposed We parametrize the piecewise prior parameters us- piecewise constant latent variable. ing an exponential function applied to a linear Letx be the current output sequence, which the transformation of the conditioning information: model must generate (e.g.w 1, . . . , wN ). Letc be prior prior prior the observed conditioning information. If the task ai = exp(Ha,i Enc(c) +b a,i ), i=1, . . . , n, contains additional conditioning information this prior prior will be embedded byc. For example, for dialogue where matrixH a and vectorb a are learnable. natural language generationc represents an em- As before, we define the posterior parameters as a bedding of the dialogue history, while for docu- function of bothc andx: c= ment modeling . post post post ∅ ai = exp(Ha,i Enc(c, x) +b a,i ), i=1, . . . , n, 4.1 Gaussian Parametrization post post Letµ prior andσ 2,prior be the prior mean and vari- whereH a andb a are parameters. µ post σ 2,post ance, and let and be the approximate 5 Variational Text Modeling posterior mean and variance. For Gaussian la- tent variables, the prior distribution mean and vari- We now introduce two classes of VAEs. The mod- ances are encoded using linear transformations of els are extended by incorporating the Gaussian and a hidden state. In particular, the prior distribu- piecewise latent variable parametrizations. tion covariance is encoded as a diagonal covari- ance matrix using a softplus function: 5.1 Document Model The neural variational document model (NVDM) prior prior prior µ =H µ Enc(c) +b µ , model has previously been proposed for document 2,prior prior prior modeling (Mnih and Gregor, 2014; Miao et al., σ = diag(log(1 + exp(Hσ Enc(c) +b σ ))), 2016), where the latent variables are Gaussian. where Enc(c) is an embedding of the conditioning Since the original NVDM uses Gaussian latent informationc (e.g. for dialogue natural language variables, we will refer to it as G-NVDM. We pro- generation this might, for example, be produced pose two novel models building on G-NVDM. The by an LSTM encoder applied to the dialogue his- first model we propose uses piecewise constant la- tory), which is shared across all latent variable tent variables instead of Gaussian latent variables.

55 We refer to this model as P-NVDM. The second werefixed to a standard Gaussian in previous model we propose uses a combination of Gaus- work. This increases theflexibility of the model sian and piecewise constant latent variables. The and makes optimization easier. In addition, we models sample the Gaussian and piecewise con- use a gating mechanism for the approximate pos- stant latent variables independently and then con- terior of the Gaussian variables. This gating mech- catenates them together into one vector. We refer anism allows the model to turn off latent vari- to this model as H-NVDM. able (i.e.fix the approximate posterior to equal the LetV be the vocabulary of document words. prior for specific latent variables) when computing LetW represent a document matrix, where roww i thefinal posterior parameters. Furthermore, Miao is the 1-of- V binary encoding of thei’th word in et al. (2016) alternated between optimizing the ap- | | the document. Each model has an encoder com- proximate posterior parameters and the generative ponent Enc(W), which compresses a document model parameters, while we optimize all parame- vector into a continuous distributed representa- ters simultaneously. tion upon which the approximate posterior is built. For document modeling, word order information 5.2 Dialogue Model is not taken into account and no additional condi- The variational hierarchical recurrent encoder- tioning information is available. Therefore, each decoder (VHRED) model has previously been pro- model uses a bag-of-words encoder, defined as a posed for dialogue modeling and natural language multi-layer perceptron (MLP) Enc(c= , x) = generation (Serban et al., 2017b, 2016a). The ∅ Enc(x). Based on preliminary experiments, we model decomposes dialogues using a two-level hi- choose the encoder to be a two-layered MLP with erarchy: sequences of utterances (e.g. sentences), parametrized rectified linear activation functions and sub-sequences of tokens (e.g. words). Letw n (we omit these parameters for simplicity). For the be then’th utterance in a dialogue withN utter- approximate posterior, each model has the param- ances. Letw n,m be them’th word in then’th utter- eter matrixW post and vectorb post for the piece- ance from vocabularyV given as a 1-of- V binary a a | | wise latent variables, and the parameter matrices encoding. LetM n be the number of words in the post post post post Wµ ,Wσ and vectorsb µ , bσ for the Gaus- n’th utterance. For each utterancen=1,...,N, sian means and variances. For the prior, each the model generates a latent variablez n. Condi- prior model has parameter vectorb a for the piece- tioned on this latent variable, the model then gen- prior prior wise latent variables, and vectorsb µ , bσ for erates the next utterance: the Gaussian means and variances. We initialize N the bias parameters to zero in order to start with P (w , z ,...,w , z ) = P (z w ) θ 1 1 N N θ n|

56 to the decoder RNN, which then computes the out- Model 20-NG RCV1 CADE put probabilities of the words in the next utterance. LDA 1058 −− −− The model is trained by maximizing the varia- docNADE 896 −− −− tional lower-bound, which factorizes into indepen- NVDM 836 dent terms for each sub-sequence (utterance): −− −− G-NVDM 651 905 339 H-NVDM-3 607 865 258 logP θ(w1,...,w N ) H-NVDM-5 566 833 294 N KL[Q (z w ,...,w ) P (z w )] Table 1: Test perplexities on three document mod- ≥ − ψ n | 1 n || θ n |

57 G-NVDM H-NVDM-3 H-NVDM-5 environment project science project gov built flight major high lab based technology mission earth world launch include form field science scale working nasa sun build systems special gov technical area

Table 2: Word query similarity test on 20 News- Groups: for the query ‘space”, we retrieve the top 10 nearest words in word embedding space based on Euclidean distance. H-NVDM-5 asso- ciates multiple meanings to the query, whileG- NVDM only associates the most frequent meaning.

of hidden units was chosen via preliminary exper- imentation with smaller models. On 20-NG, we use the same set-up as (Hinton and Salakhutdi- nov, 2009) and therefore report the perplexities of a topic model (LDA,(Hinton and Salakhutdinov, 2009)), the document neural auto-regressive esti- mator (docNADE,(Larochelle and Lauly, 2012)), and a neural variational document model with a fixed standard Gaussian prior (NVDM, lowest re- Figure 2: Latent variable approximate poste- ported perplexity, (Miao et al., 2016)). rior means t-SNE visualization on 20-NG forG- Results In Table 1, we report the test docu- NVDM and H-NVDM-5. Colors correspond to the 1 1 ment perplexity: exp( logP θ(xn). We topic labels assigned to each document. − D n Ln use the variational lower-bound as an approxima- � tion based on 10 samples, as was done in (Mnih and Gregor, 2014). First, we note that the best approximate posterior is learned for each test ex- baseline model (i.e. the NVDM) is more competi- ample in order to tighten the variational lower- tive when both the prior and posterior models are bound. H-NVDM also performed best in this eval- learnt together (i.e. the G-NVDM), as opposed to uation across all three datasets, which confirms thefixed prior of ( Miao et al., 2016). Next, we that the performance improvement is due to the observe that integrating our proposed piecewise piecewise components. See appendix for details. variables yields even better results in our docu- In Table 2, we examine the top ten highest ment modeling experiments, substantially improv- ranked words given the query term “space”, using ing over the baselines. More importantly, in the the decoder parameter matrix. The piecewise vari- 20-NG and Reuters datasets, increasing the num- ables appear to have a significant effect on what is ber of pieces from 3 to 5 further reduces perplex- uncovered by the model.In the case of “space”, the ity. Thus, we have achieved a new state-of-the- hybrid with 5 pieces seems to value two senses of art perplexity on 20 News-Groups task and — to the word–one related to “outer space” (e.g., “sun”, the best of our knowledge – better perplexities on “world”, etc.) and another related to the dimen- the CADE12 and RCV1 tasks compared to us- sions of depth, height, and width within which ing a state-of-the-art model like the G-NVDM. We things may exist and move (e.g., “area”, “form”, also evaluated the converged models using an non- “scale”, etc.). On the other hand, G-NVDM ap- parametric inference procedure, where a separate pears to only capture the “outer space” sense of

58 Model Activity Entity as a conversation between a user and a technical HRED4.77 2.43 support assistant — the model must generate the G-VHRED9.242.49 next appropriate response in the dialogue. For ex- P-VHRED52.49 ample, when it is the turn of the technical support H-VHRED8.413.72 assistant, the model must generate an appropriate response helping the user resolve their problem. Table 3: Ubuntu evaluation using F1 metrics w.r.t. We evaluate the models using the activity- and activities and entities. G-VHRED, P-VHRED and entity-based metrics designed specifically for the H-VHRED all outperform the baseline HRED. Ubuntu domain (Serban et al., 2017a). These G-VHRED performs best w.r.t. activities andH- metrics compare the activities and entities in the VHRED performs best w.r.t. entities. model generated responses with those of the ref- erence responses; activities are verbs referring to high-level actions (e.g. download, install, unzip) the word. More examples are in the appendix. and entities are nouns referring to technical ob- Finally, we visualized the means of the approx- jects (e.g. Firefox, GNOME). The more activities imate posterior latent variables on 20-NG through and entities a model response overlaps with the a t-SNE projection. As shown in Figure 2, both reference response (e.g. expert response) the more G-NVDM and H-NVDM-5 learn representations likely the response will lead to a solution. which disentangle the topic clusters on 20-NG. The models were trained to maxi- However, G-NVDM appears to have more dis- Training mize the log-likelihood of training examples us- persed clusters and more outliers (i.e. data points ing a learning rate of0.0002 and mini-batches in the periphery) compared to H-NVDM-5. Al- of size 80. We use a variant of truncated back- though it is difficult to draw conclusions based on propagation. We terminate the training procedure these plots, thesefindings could potentially be ex- for each model using early stopping, estimated plained by the Gaussian latent variablesfitting the using one stochastic sample per validation exam- latent factors poorly. ple. We evaluate the models by generating dia- 6.2 Dialogue Modeling logue responses: conditioned on a dialogue con- text, wefix the model latent variables to their me- Task We evaluate VHRED on a natural language dian values and then generate the response using a generation task, where the goal is to generate re- beam search with size 5. We select model hyper- sponses in a dialogue. This is a difficult prob- parameters based on the validation set using the F1 lem, which has been extensively studied in the activity metric, as described earlier. recent literature (Ritter et al., 2011; Lowe et al., 2015; Sordoni et al., 2015; Li et al., 2016; Ser- It is often difficult to train generative models ban et al., 2016a,b). Dialogue response generation for language with stochastic latent variables (Bow- has recently gained a significant amount of atten- man et al., 2015; Serban et al., 2017b). For the tion from industry, with high-profile projects such latent variable models, we therefore experiment as Google SmartReply (Kannan et al., 2016) and with reweighing the KL divergence terms in the Microsoft Xiaoice (Markoff and Mozur, 2015). variational lower-bound with values0.25,0.50, Even more recently, Amazon has announced the 0.75 and1.0. In addition to this, we linearly in- Alexa Prize Challenge for the research community crease the KL divergence weights starting from with the goal of developing a natural and engaging zero to theirfinal value over thefirst 75000 train- chatbot system (Farber, 2016). ing batches. Finally, we weaken the decoder RNN We evaluate on the technical support response by randomly replacing words inputted to the de- generation task for the Ubuntu operating system. coder RNN with the unknown token with 25% We use the well-known Ubuntu Dialogue Corpus probability. These steps are important for effec- (Lowe et al., 2015, 2017), which consists of about tively training the models, and the latter two have 1/2 million natural language dialogues extracted been used in previous work by Bowman et al. from the #Ubuntu Internet Relayed Chat (IRC) (2015) and Serban et al. (2017b). channel. The technical problems discussed span HRED (Baseline): We compare to the HRED a wide range of software-related and hardware- model (Serban et al., 2016a): a sequence-to- related issues. Given a dialogue history — such sequence model, shown to outperform other es-

59 tablished models on this task, such as the LSTM This suggests that the Gaussian latent variables RNN language model (Serban et al., 2017a). The learn useful latent representations for frequent ac- HRED model’s encoder RNN uses a bidirectional tions. On the other hand, H-VHRED performs GRU RNN encoder, where the forward and back- best w.r.t. entities (e.g. Firefox, GNOME), which ward RNNs each have 1000 hidden units. The are often much rarer and mutually exclusive in context RNN is a GRU encoder with 1000 hidden the dataset. This suggests that the combination of units, and the decoder RNN is an LSTM decoder Gaussian and piecewise latent variables help learn with 2000 hidden units.2 The encoder and con- useful representations for entities, which could text RNNs both use layer normalization (Ba et al., not be learned by Gaussian latent variables alone. 2016).3 We also experiment with an additional We further conducted a qualitative analysis of the rectified linear layer applied on the inputs to the model responses, which supports these conclu- decoder RNN. As with other hyper-parameters, sions. See Appendix G.4 we choose whether to include this additional layer based on the validation set performance. HRED, 7 Conclusions as well as all other models, use a word embedding In this paper, we have sought to learn rich and dimensionality of size 400. flexible multi-modal representations of latent vari- G-HRED: We compare to G-VHRED, which ables for complex natural language processing is VHRED with Gaussian latent variables (Serban tasks. We have proposed the piecewise constant et al., 2017b). G-VHRED uses the same hyper- distribution for the frame- parameters for the encoder, context and decoder work. We have derived closed-form expressions RNNs as the HRED model. The model has 100 for the necessary quantities required for in the au- Gaussian latent variables per utterance. toencoder framework, and proposed an efficient, P-HRED: Thefirst model we propose isP- differentiable implementation of it. We have in- VHRED, which is VHRED model with piecewise corporated the proposed piecewise constant dis- constant latent variables. We usen=3 number tribution into two model classes — NVDM and of pieces for each latent variable. P-VHRED also VHRED — and evaluated the proposed models on uses the same hyper parameters for the encoder, document modeling and dialogue modeling tasks. context and decoder RNNs as the HRED model. We have achieved state-of-the-art results on three Similar to G-VHRED, P-VHRED has 100 piece- document modeling tasks, and have demonstrated wise constant latent variables per utterance. substantial improvements on a dialogue modeling H-HRED: The second model we propose isH- task. Overall, the results highlight the benefits VHRED, which has 100 piecewise constant (with of incorporating theflexible, multi-modal piece- n=3 pieces per variable) and 100 Gaussian la- wise constant distribution into variational autoen- tent variables per utterance. H-VHRED also uses coders. Future work should explore other natural the same hyper-parameters for the encoder, con- language processing tasks, where the data is likely text and decoder RNNs as HRED. to arise from complex, multi-modal latent factors. Results: The results are given in Table 3. All latent variable models outperform HRED w.r.t. Acknowledgments both activities and entities. This strongly suggests The authors acknowledge NSERC, Canada Re- that the high-level concepts represented by the search Chairs, CIFAR, IBM Research, Nuance latent variables help generate meaningful, goal- Foundation and Microsoft Maluuba for fund- directed responses. Furthermore, each type of ing. Alexander G. Ororbia II was funded latent variable appears to help with a different by a NACME-Sloan scholarship. The authors aspects of the generation task. G-VHRED per- thank Hugo Larochelle for sharing the News- forms best w.r.t. activities (e.g. download, install Group 20 dataset. The authors thank Lau- and so on), which occur frequently in the dataset. rent Charlin, Sungjin Ahn, and Ryan Lowe for constructive feedback. This research was en- 2Since training lasted between 1-3 weeks for each model, we had tofix the number of hidden units during preliminary abled in part by support provided by Calcul experiments on the training and validation datasets. Qubec (www.calculquebec.ca) and Com- 3We did not apply layer normalization to the decoder pute Canada (www.computecanada.ca). RNN, because several of our colleagues have found that this may hurt the performance of generative language models. 4Results on a Twitter dataset are given in the appendix.

60 References G. E. Hinton and R. S. Zemel. 1994. Autoencoders, minimum description length and helmholtz free en- J. L. Ba, J. R. Kiros, and G. E. Hinton. 2016. Layer ergy. In NIPS, pages 3–10. NIPS. normalization. arXiv preprint arXiv:1607.06450. T. Hofmann. 1999. Probabilistic latent semantic index- S. Bangalore, G. Di Fabbrizio, and A. Stent. 2008. ing. In ACM SIGIR Conference on Research and Learning the structure of task-driven human–human Development in Information Retrieval, pages 50–57. dialogs. IEEE Transactions on Audio, Speech, and ACM. Language Processing, 16(7):1249–1259.

D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent E. Jang, S. Gu, and B. Poole. 2017. Categorical repa- dirichlet allocation. JAIR, 3:993–1022. rameterization with gumbel-softmax. In ICLR.

J. Bornschein and Y. Bengio. 2015. Reweighted wake- M. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. sleep. In ICLR. Adams, and S. R. Datta. 2016. Composing graphical models with neural networks for structured repre- S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, sentations and fast inference. In NIPS, pages 2946– R. Jozefowicz, and S. Bengio. 2015. Generating 2954. sentences from a continuous space. In Conference on Computational Natural Language Learning. M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. 1999. An introduction to variational Y. Burda, R. Grosse, and R. Salakhutdinov. 2016. Im- methods for graphical models. Machine Learning, portance weighted autoencoders. ICLR. 37(2):183–233.

A. Cardoso-Cachopo. 2007. Improving Methods for A. Kannan, K. Kurach, et al. 2016. Smart Reply: Au- Single-label Text Categorization. PdD Thesis, Insti- tomated Response Suggestion for Email. In KDD. tuto Superior Tecnico, Universidade Tecnica de Lis- boa. D. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In ICLR. X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhari- wal, J. Schulman, I. Sutskever, and P. Abbeel. 2017. D. P. Kingma, T. Salimans, and M. Welling. 2016. Im- Variational lossy autoencoder. In ICLR. proving variational inference with inverse autore- gressiveflow. NIPS, pages 4736–4744. N. Crook, R. Granell, and S. Pulman. 2009. Unsuper- vised classification of dialogue acts using a dirichlet D. P. Kingma and M. Welling. 2014. Auto-encoding process mixture model. In Special Interest Group variational Bayes. ICLR. on Discourse and Dialogue (SIGDIAL), pages 341– 348. H. Larochelle and S. Lauly. 2012. A neural autoregres- sive topic model. In NIPS, pages 2708–2716. P. Dayan and G. E. Hinton. 1996. Varieties of helmholtz machine. Neural Networks, 9(8):1385– A. B. Lindbo Larsen, S. K. Sønderby, and O. Winther. 1403. 2016. Autoencoding beyond pixels using a learned similarity metric. In ICML, pages 1558–1566. L. Devroye. 1986. Sample-based non-uniform random variate generation. In Proceedings of the 18th con- S. Lauly, Y. Zheng, A. Allauzen, and H. Larochelle. ference on Winter simulation, pages 260–265. ACM. 2016. Document neural autoregressive distribution arXiv preprint arXiv:1603.05962 M. Farber. 2016. Amazon’s ’Alexa Prize’ Will Give estimation. . College Students Up To $2.5M To Create A Social- bot. Fortune. J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan. 2016. A diversity-promoting objective function for M. Fraccaro, S. K. Sønderby, U. Paquet, and neural conversation models. In The North American O. Winther. 2016. Sequential neural models with Chapter of the Association for Computational Lin- stochastic layers. In NIPS, pages 2199–2207. guistics (NAACL), pages 110–119.

K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle 2015. DRAW: A recurrent neural network for image Pineau. 2015. The Ubuntu Dialogue Corpus: A generation. In ICLR. Large Dataset for Research in Unstructured Multi- Turn Dialogue Systems. In Special Interest Group G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. on Discourse and Dialogue (SIGDIAL). 1995. The” wake-sleep” algorithm for unsupervised neural networks. Science, 268(5214):1158–1161. Ryan T. Lowe, Nissan Pow, Iulian V. Serban, Lau- rent Charlin, Chia-Wei Liu, and Joelle Pineau. 2017. G. E. Hinton and R. Salakhutdinov. 2009. Replicated Training End-to-End Dialogue Systems with the softmax: an undirected topic model. In NIPS, pages Ubuntu Dialogue Corpus. Dialogue & Discourse, 1607–1614. 8(1).

61 Lars Maaløe, Casper Kaae Sønderby, Søren Kaae R. Sennrich, B. Haddow, and A. Birch. 2016. Neu- Sønderby, and Ole Winther. 2016. Auxiliary deep ral machine translation of rare words with subword generative models. In ICML, pages 1445–1453. units. In Association for Computational Linguistics (ACL). C. J. Maddison, A. Mnih, and Y. W. Teh. 2017. The concrete distribution: A continuous relaxation of I. V. Serban, T. Klinger, G. Tesauro, K. Talamadupula, discrete random variables. In ICLR. B. Zhou, Y. Bengio, and A. Courville. 2017a. Mul- tiresolution recurrent neural networks: An applica- J. Markoff and P. Mozur. 2015. For Sympathetic Ear, tion to dialogue response generation. In Thirty-First More Chinese Turn to Smartphone Program. New AAAI Conference (AAAI). York Times. I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and Y. Miao, L. Yu, and P. Blunsom. 2016. Neural varia- J. Pineau. 2016a. Building end-to-end dialogue sys- tional inference for text processing. In ICML, pages tems using generative hierarchical neural network 1727–1736. models. In Thirtieth AAAI Conference (AAAI). A. Mnih and K. Gregor. 2014. Neural variational in- I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, ference and learning in belief networks. In ICML, J. Pineau, A. Courville, and Y. Bengio. 2017b. A pages 1791–1799. hierarchical latent variable encoder-decoder model R. M. Neal. 1992. Connectionist learning of belief net- for generating dialogues. In Thirty-First AAAI Con- works. Artificial intelligence, 56(1):71–113. ference (AAAI). A. G. Ororbia II, C. L. Giles, and D. Reitter. 2015. Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, and Online semi-supervised learning with deep hybrid Joelle Pineau. 2016b. Generative deep neural net- boltzmann machines and denoising autoencoders. works for dialogue: A short review. In NIPS, Let’s arXiv preprint arXiv:1511.06964. Discuss: Learning Methods for Dialogue Workshop. R. Pascanu, T. Mikolov, and Y. Bengio. 2012. On A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, the difficulty of training recurrent neural networks. M. Mitchell, J. Nie, J. Gao, and B. Dolan. 2015. ICML, 28:1310–1318. A neural network approach to context-sensitive gen- eration of conversational responses. In Conference Rajesh Ranganath, Dustin Tran, and David Blei. 2016. of the North American Chapter of the Association Hierarchical variational models. In ICML, pages for Computational Linguistics (NAACL-HLT 2015), 324–333. pages 196–205.

D. J. Rezende and S. Mohamed. 2015. Variational in- N. Srivastava, R. R Salakhutdinov, and G. E. Hin- ference with normalizingflows. In ICML, pages ton. 2013. Modeling documents with deep boltz- 1530–1538. mann machines. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence D. J. Rezende, S. Mohamed, and D. Wierstra. 2014. (UAI), pages 616–624. Stochastic backpropagation and approximate infer- ence in deep generative models. In ICML, pages B. Uria, I. Murray, and H. Larochelle. 2014. A deep 1278–1286. and tractable density estimator. In ICML, pages A. Ritter, C. Cherry, and W. B. Dolan. 2011. Data- 467–475. driven response generation in social media. In Pro- K. Zhai and J. D. Williams. 2014. Discovering latent ceedings of the Conference on Empirical Methods in structure in task-oriented dialogues. In Association Natural Language Processing, pages 583–593. for Computational Linguistics (ACL), pages 36–46. J. T. Rolfe. 2017. Discrete variational autoencoders. In ICLR. Francisco R Ruiz, Michalis Titsias RC AUEB, and David Blei. 2016. The generalized reparameteriza- tion gradient. In NIPS, pages 460–468. R. Salakhutdinov and G. E. Hinton. 2009. Semantic hashing. International Journal of Approximate Rea- soning, 50(7):969–978. R. Salakhutdinov and H. Larochelle. 2010. Efficient learning of deep boltzmann machines. In AISTATs, pages 693–700. T. Salimans, D. P Kingma, and M. Welling. 2015. Markov chain monte carlo and variational inference: Bridging the gap. In ICML, pages 1218–1226.

62 Author Index

Bradbury, James, 12

Chang, Baobao, 27 Courville, Aaron, 52

Daumé III, Hal, 17

Kordjamshidi, Parisa, 33

Liu, LuChen, 27

Manzoor, Umar, 33 McCallum, Andrew,1 Meerkamp, Philipp, 44

Ororbia II, Alexander, 52

Pineau, Joelle, 52

Qian, Feng, 27

Rahgooy, Taher, 33

Serban, Iulian Vlad, 52 Sha, Lei, 27 Sharaf, Amr, 17 Socher, Richard, 12 Stratos, Karl,7 Strubell, Emma,1

Zhang, Ming, 27 Zhou, Zhengyi, 44

63