<<

Maximum Margin Reward Networks for Learning from Explicit and Implicit Supervision

Haoruo Peng1 Ming-Wei Chang2 Wen-tau Yih2 1University of Illinois, Urbana-Champaign 2Microsoft Research, Redmond [email protected] 2 minchang,scottyih @microsoft.com { }

Abstract Q: Who played Meg in Season 1 of ? 휆푥. ∃푦. 푐푎푠푡 FamilyGuySeason1, 푦 ∧ 푎푐푡표푟 푦, 푥 Neural networks have achieved state-of- the-art performance on several structured- KB Lacey Chabert, Seth MacFarlane, Alex Borstein, output prediction tasks, trained in a fully , John Viener, Alec Sulkin supervised fashion. However, annotated examples in structured domains are of- ten costly to obtain, which thus limits A: Lacey Chabert the applications of neural networks. In Figure 1: Learning a semantic parser using im- this work, we propose Maximum Mar- plicit supervision signals (labeled answers). Since gin Reward Networks, a neural network- there are no gold parses, a model needs to explore based framework that aims to learn from different parses, where their quality can only be both explicit (full structures) and implicit indirectly verified by comparing retrieved answers supervision signals (delayed feedback on and the labeled answers. the correctness of the predicted structure). On named entity recognition and seman- tic parsing, our model outperforms previ- comparing the derived answers from the knowl- ous systems on the benchmark datasets, edge base and the provided labeled answers. CoNLL-2003 and WebQuestionsSP. This setting of implicit supervision increases the difficulty of learning a neural model, not only 1 Introduction because the signals are vague and noisy, but also Structured-output prediction problems, where the delayed. For instance, among different semantic goal is to determine values of a set of inter- parses that result in the same answers, typically dependent variables, are ubiquitous in NLP. Struc- only few of them correctly represent the meaning tures of such problems can range from simple se- of the question. Moreover, the correctness of an- quences like part-of-speech tagging (Ling et al., swers corresponding to a parse can only be eval- 2015) and named entity recognition (Lample et al., uated through an external oracle (e.g., executing 2016), to complex syntactic or semantic analysis the query on the knowledge base) after the parse such as dependency parsing (Dyer et al., 2015) and is fully constructed. Early model update before the semantic parsing (Dong and Lapata, 2016). State- search of a full semantic parse is complete is gen- of-the-art methods of these tasks are often neu- erally infeasible.1 It is also not clear how to lever- ral network models trained using fully annotated age implicit and explicit signals integrally during structures, which can be costly or time-consuming learning when both kinds of labels are present. to obtain. Weakly supervised learning settings, In this work, we propose Maximum Margin Re- where the algorithm assumes only the existence of ward Networks (MMRN), which is a general neu- implicit signals on whether a prediction is correct, ral network-based framework that is able to learn are thus more appealing in many scenarios. from both implicit and explicit supervision sig- For example, Figure 1 shows a weakly super- nals. By casting structured-output learning as a vised setting of learning semantic parsers using search problem, the key insight in MMRN is the only question–answer pairs. When the system 1Existing weakly supervised methods (Clarke et al., 2010; generates a candidate semantic parse during train- Artzi and Zettlemoyer, 2013) often leverage domain-specific ing, the quality needs to be indirectly measured by heuristics, which are not always available.

2368 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2368–2378 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics special mechanism of rewards. Rewards can be posed (Daume´ and Marcu, 2005; Daume´ et al., viewed as the training signals that drive the model 2009), which casts the structured prediction task to explore the search space and to find the cor- as a general search problem. Most recently, rect structure. The explicit supervision signals can recurrent neural networks such as LSTM mod- be viewed as a source of immediate rewards, as els (Hochreiter and Schmidhuber, 1997) have been we can often instantly know the correctness of the used as a general tool for structured output mod- current action. On the other hand, the implicit su- els (Vinyals et al., 2015). pervision can be viewed as a source of delayed re- Latent structured learning algorithms address wards, where the reward of the actions can only be the problem of learning from incomplete labeled revealed later. We unify these two types of reward data (Yu and Joachims, 2009; Quattoni et al., signals by using a maximum margin update, in- 2007). The main difference compared to our spired by structured SVM (Joachims et al., 2009). framework is the existence of the external envi- The effectiveness of MMRN is demonstrated on ronment when learning from implicit signals. three NLP tasks: named entity recognition, entity Upadhyay et al. (2016) first proposed the idea of linking and semantic parsing. MMRN outperforms learning from implicit supervision, and is the most the current best results on CoNLL-2003 named related paper to our work. Compared to their lin- entity recognition dataset (Tjong Kim Sang and ear algorithm, our framework is more principled De Meulder, 2003), reaching 91.4% F1, in the and general as we integrate the concept of margin close setting where no gazetteer is allowed. It also in our method. Furthermore, we also extend the performs comparably to the existing state-of-the- framework using neural models. art systems on entity linking. Models for these two tasks are trained using explicit supervision. 3 Search-based Inference For semantic parsing, where only implicit super- vision signals are provided, MMRN is able to learn In our framework, predicting the best structured from delayed rewards, improving the entity link- output, inference, is formulated as a state/action ing component and the overall semantic parsing search problem. Our search space can be de- framework jointly, and outperforms the best pub- scribed as follows. The initial state, s0, is the lished system by 1.4% absolute on the WebQSP starting point of the search process. We define dataset (Yih et al., 2016). γ(s) as the set of all feasible actions that can In the rest of the paper, we survey the most be taken at s, and denote s0 = τ(s, a) as the related work in Sec. 2 and give an in-depth dis- transition function, where s0 is the new state af- cussion on comparing MMRN and other learning ter taking action a from s. A path h is a se- frameworks in Sec. 7. We start the description of quence of state–action pairs, starting with the ini- tial state: h = (s , a ),..., (s , a ) , where our method from the search formulation and the { 0 0 k k } state–action spaces in our targeted tasks in Sec. 3, si = τ(si 1, ai 1), i = 1, . . . , k. We denote ; − − ∀ followed by the reward and learning algorithm in h sˆ, if sˆ = τ(sk, ak), the final state which the Sec. 4 and the detailed neural model design in path h leads to. A path essentially is a partial or Sec. 5. Sec. 6 reports the experimental results and complete structured prediction. For each input x, we define (x) to be the set of all possible paths Sec. 8 concludes the paper. H for the input. We also define (x) = h h E { | ∈ (x), h ; s,ˆ γ(ˆs) = , which is all possible 2 Related Work H ∅} paths that lead to terminal states. Structured output prediction tasks have been stud- Given a state s and an action a, the scoring func- ied extensively in the field of natural language pro- tion fθ(s, a) measures the quality of an immediate cessing (NLP). Many supervised structured learn- action with respect to the current state, where θ is ing algorithms has been proposed for capturing the model parameters. The score of a path h is the relationships between output variables. These defined as the sum of the scores for state-action k models include structured perceptron (Collins, pairs in h: fθ(h) = i=0 fθ(si, ai). During test 2002; Collins and Roark, 2004), conditional ran- time, inference is to find the best path in (x): P E dom fields (Lafferty et al., 2001), and structured arg maxh (x) fθ(h; x). In practice, inference is SVM (Taskar et al., 2004; Joachims et al., 2009). often approximated∈E by beam search when no effi- Later, the learning to search framework is pro- cient algorithm exists.

2369 In the remaining of this section, we describe 휆푥. ∃푦. 푐푎푠푡 FamilyGuySeason1, 푦 ∧ 푎푐푡표푟 푦, 푥 the states and actions in the targeted tasks in this ∧ 푐ℎ푎푟푎푐푡푒푟(푦,MegGriffin) work: named entity recognition, entity linking and Meg Griffin semantic parsing. The the model and learning al- gorithm will be discussed in Sec. 4 and Sec. 5. Family Guy Season 1 cast y actor x

3.1 Named entity recognition Figure 2: Semantic parses in λ-calculus (top) and The task of named entity recognition (NER) is to query graph (bottom) of the question “who played identify entity mentions in a sentence, as well as meg in season 1 of family guy?” to assign their types, such as Person or Location. Following the conventional setting, we treat it as a sequence labeling problem using the standard 3.3 Semantic parsing BIOES encoding. For instance, a “B-LOC” tag Our third targeted task is semantic parsing (SP), on a word means that the word is the beginning of which is a task of mapping a text utterance to a for- a multi-word location entity. mal meaning representation. In this paper, we fo- Given a sentence as input, the states represent cus on a specific type of semantic parsing problem the tags assigned to the words. Starting from the that maps a natural language question to a struc- initial state, s0, where no tag has been assigned, tured query, which is executed on a knowledge the search process explores the sequence tagging base to retrieve the answer to the original question. from the left-to-right order. For each word, the Figure 2 shows the semantic parses of an ex- actions are the legitimate tags that can be assigned ample question “who played meg in season 1 of to it, which depend on previous actions. For exam- family guy”, assuming the knowledge base is Free- ple, if the “S-PER” tag (“S” means a single word base (Bollacker et al., 2008). An entity linking entity) has been assigned to the previous word, component plays an important role by mapping then an action of labeling the current word with “meg” to MegGriffin and “season 1 of family either “I-PER” or “E-PER” cannot can be taken. guy” to FamilyGuySeason1. Predicates like The search reaches a terminal state when all words cast, actor and character are also from the in the sentence have been tagged. knowledge base that define the relationships be- tween these entities and the answer. Together the 3.2 Entity linking semantic parse in λ-calculus is shown in the top of The problem of entity linking (EL) is similar to Figure 2. Equivalently, the semantic parse can be NER, but instead of tagging the mention using a represented as a query graph (Figure 2 bottom), small set of generic entity types, the goal here is which is used in the STAGG system (Yih et al., to ground the mention to a specific entity, stored 2015). The nodes are either grounded entities or in a knowledge base or described by a Wikipedia variables, where x is the answer entity. The edges page. For example, consider the sentence “nfl denote the relationship between two entities. news: draft results for giants” and assume that the Regardless of the choice of the formal language, mention candidates “nfl” and “giants” are given. A the process of constructing the semantic parse is state reflects how we have assigned the entity la- typically formulated as a search problem. A state bels to these candidates. Following the same left- is essentially a partial or complete semantic parse, to-right order and starting from the empty assign- and an action is to extend the current semantic ment s0, the first action to take is to assign the parse by adding a new relation or constraint. entity label to the first candidate “nfl”. A legit- Different from previous systems which treat en- imate action set can be all the entities that have tity linking as a static component, our search space been associated with this mention in the training consists of the search space of both entity linking set (e.g., “National Football League” or “National and semantic parsing. That is, the search space is Fertilizers Limited”). Once the action is com- the union of the search space of entity linking de- pleted, the transition function will bring the focus scribed in Section 3.2 and the search space of the to the next mention candidate (i.e., “giants”). The semantic parses, which we describe below. Inte- search reaches a terminal state when all the candi- grating search spaces allows the model to use im- date mentions in the sentence have been linked. plicit signals to update both the semantic parsing

2370 and the entity linking systems. To the best of our 푠 = Family Guy Season 1 cast y actor x knowledge, this is the first work that jointly learns the entity linking and semantic parsing systems. 푌 푠 = {Lacey Chabert, Seth MacFarlane, Alex Borstein, Our search space is defined as follows. Start- Seth Green, John Viener, Alec Sulkin} ing from the initial state s0, the model first ex- 퐴 = Lacey Chabert plores the entity linking search space. Once the entity linking assignment are assigned (e.g. Figure 3: For the question “who played meg in FamilyGuySeason1 in Figure 2.) The sec- season 1 of family guy?”, the candidate semantic ond phase is then to determine the main rela- parse s lists all the actors in “Family Guy Season tionship between the topic entity and the an- 1” (Y (s)). By comparing Y (s) to the answer set 1 swer (e.g., the cast-actor chain between A, the precision is 6 and the recall is 1. Therefore, 2 FamilyGuySeason1 and x). Constraints (e.g., the F1 score used for the reward is 7 . the character is MegGriffin) that describe the additional properties that the answer needs to action a to state s0. Let Y (s) be the set of predicted have are added last. In this case, any state that is a answers generated from state s, and Y (s) = {} legitimate semantic parse (consisting of one topic when s is not a legitimate semantic parse. The entity and one main relationship, as well as zero or reward function R(s0, a) can be defined by com- more constraints) can lead to a terminal state. paring Y (s) and the labeled answers, A, to the in- put question. While a set similarity function like 4 Maximum Margin Reward Networks the Jaccard coefficient can be used as the reward In this section, we introduce the learning frame- function, we chose the F1 score in this work as work of MMRN, which includes two main compo- it was used as the evaluation metric in previous nents: reward and max-margin loss. The former is work (Berant et al., 2013). Figure 3 shows an ex- a mechanism for using implicit and explicit super- ample of this reward function. vision signals in a unified way; the latter formally 4.2 Max-Margin Loss & Learning Algorithm defines the learning objective. The MMRN learning algorithm can be viewed as 3 4.1 Reward an extension of M N (Taskar et al., 2004) and Structured SVM (Joachims et al., 2009; Yu and The key insight of MMRN is that different types of Joachims, 2009). The learning algorithm takes supervision signals can be represented using the three steps, where the first two involve two differ- appropriate design of the reward function. A re- ent search procedures. The final step is to update ward function is defined over a state–action pair the models with respect to the inference results. R(s, a), representing the true quality of taking ac- tion a in the state s. The reward for a path can Finding the best path The first search step is k be formally defined as: R(h) = i=0 R(si, ai). to find the best path h∗ by solving the following Intuitively, when the annotated action sequences optimization problem: P (explicit supervision signals) exist, the model only needs to learn to imitate the annotated sequence. h∗ = arg max R(h; y) + fθ(h). (1) h (x) For instance, when learning NER in the fully su- ∈E pervised setting, the equivalent way of using Ham- The first term defines the path that has the highest ming distance is to define the reward R(s, a) to be reward. Because it is possible that several paths 1 if a matches the annotated sequence at the cur- share the same reward, the second term leverages rent state, and 0 otherwise. the current model and serves as the tie-breaker, In the setting where only implicit supervision where  is a hyper-parameter that is set to a small is available, the reward function can still be de- positive number in our experiments. signed to capture the signals. For instance, when When explicit supervision is available, solving only the question–answer pairs exist for learning Eq. (1) is trivial – the search simply returns the the semantic parser, the reward can be defined by annotated sequence. In the case of implicit super- comparing the answers derived from a candidate vision, where true rewards are only revealed for parse and the labeled answers. More formally, as- complete action sequences, the search problem be- sume that s = τ(s0, a) is the state after applying comes difficult as the rewards of early state–action

2371 pairs are zeros. In this situation, the search algo- Algorithm 1 Maximum Margin Reward Networks rithm uses the model score fθ to guide the search. 1: for a random labeled data (x, y) do One possible design is to use beam search for the 2: h∗ arg max R(h; y) + fθ(h) ← h (x) optimization problem, where the search procedure ∈E follows the current model in the early stage (given 3: hˆ arg max fθ(h) R(h; y) ← h (x) − that R(h) = 0). After generating several complete ∈E 4: update θ by minimizing (hˆ, h ) action sequences, the true reward function is then L ∗ used to find h∗. The tie-breaker also picks the best 5: end for sequence when there are multiple sequences that lead to the same reward. Note that h can change ∗ 4.3 Practical Considerations between iterations because of the tie-breaker. Although the learning algorithm of MMRN is sim- Finding the most violated path Once h∗ is ple and general, the quality of the learned model is found, it is used as our reference path. We would dictated by the effectiveness of the search proce- like to update the model so that the scoring func- dure. Increasing the beam size generally helps im- tion fθ will behave similarly to the reward R. prove the model, but also slows down the training, More formally, we aim to update the model pa- and has a limited effect when dealing with a large rameters θ to satisfy the following constraint. search space. Domain-specific heuristics for prun- ing search space should thus be used when avail- fθ(h∗) fθ(h) R(h∗) R(h), h. able. For instance, in the task of semantic parsing, − ≥ − ∀ when the reward of a legitimate semantic parse is The constraint implies that the “best” action se- 0, it implies that none of the derived answers is in- quence should rank higher than any other se- cluded in the labeled set of answers. When all the quence by a margin computed from rewards possible follow-up actions can only make the se- as R(h ) R(h). The degree of violation mantic parse stricter (e.g., adding constraints), and ∗ − of this constraint, with respect to h, is thus result in a subset of the current derived answers, it (R(h ) R(h)) (f (h ) f (h)) = f (h) is clear that the rewards of all these new states are ∗ − − θ ∗ − θ θ − R(h) f (h ) + R(h ). The max-margin loss is 0 as well. Paths from this state can thus be pruned. − θ ∗ ∗ defined accordingly: Another strategy for improving search quality is to use approximated reward in the early stage of (h, h∗) = max(fθ(h) R(h) fθ(h∗)+R(h∗), 0) search. Very often the true rewards at this stage L − − are 0, and are not useful to guide the search to find (h, h ) is our optimization goal, where we want the best path. The approximated reward function L ∗ to update the model by fixing the biggest violation. can be thought of as estimating whether there ex- Note that the associated constraint is only violated ists a high-reward state that is reachable from the when (h, h ) is positive. To find the path h in current state. The effectiveness of this strategy has L ∗ this step that maximizes the violation is equivalent been demonstrated successfully by several recent to maximizing f (h) R(h), given that the rest efforts (Mnih et al., 2013; Krishnamurthy et al., θ − of the terms are constant with respect to h. 2015; Silver et al., 2016; Narasimhan et al., 2016). When there exist only explicit supervision sig- nals, our objective function reduces to the one 5 Neural Architectures for optimizing structured SVM without regular- While the learning algorithm of MMRN described in ization. For implicit signals, we find h∗ approxi- Sec. 4 is general, the exact model design is task- mately before we optimize the margin loss. In this dependent. In this section, we describe in detail case, the search is not exact as the reward signals the neural network architectures of the three tar- are delayed. Nevertheless, we found the margin geted tasks, named entity recognition, entity link- loss worked well empirically, as it kept decreasing ing and semantic parsing. in general until being stable. Algorithm 1 summarizes the learning procedure 5.1 Named Entity Recognition of MMRN. Search is used in both Line 2 and 3. In Recall that NER is formulated as a sequence la- Line 4, the algorithm performs a gradient update beling problem, and each action is to label a word to modify all the model parameters. with a tag using the BIOES encoding (cf. Sec. 3.1).

2372 f (푠, 푎) f (푠, 푎) Previous action embedding 휃 휃 Action 푎 Two hidden layers determines the

tag type = Avg.{ Input 푥 … Statistic features word Input 푥 Mention 푚 Average of entity type embeddings State 푠 determines the word index 푚

State 푠 determines the mention index 푚 Action 푎 determines the entity index Figure 4: The action scoring model for NER. Figure 5: The action scoring model for EL.

The model of the action scoring function fθ(s, a) is depicted in Figure 4, which is basically the dot initialized using pre-trained embeddings. Adding product of the action embedding and state em- these two types of embeddings has shown to im- bedding. The action embedding is initialized ran- prove the performance in our experiments. domly for each action, but can be fine-tuned dur- ing training (i.e. back-propagate the error through 5.3 Semantic Parsing the network and update the word/entity type em- Our semantic parsing model follows the STAGG beddings). The state embedding is the concate- system (Yih et al., 2015), which uses a stage- nation of bi-LSTM word embeddings of the cur- wise search procedure to expand the candidate rent word, the character-based word embeddings, semantic parses gradually (cf. Sec. 3.3). Com- and the embedding of the previous action. We pared to the original system, we make two notable also include the orthographic embeddings pro- changes. First, we use a two-layer feed-forward posed by Limsopatham and Collier (2016). neural network to replace the original linear ranker that scores the candidate semantic parses. Second, 5.2 Entity Linking instead of using a separately trained entity link- An action in entity linking is to determine whether ing system, we incorporate our entity linking net- a mention should be linked to a particular entity works described in Sec. 5.2 as part of the semantic (cf. Sec. 3.2). As shown in Figure 5, we design the parsing model. The training process will thus fine scoring function as a feed-forward neural network tune the entity linking component to improve the that takes as input three different input vectors: (1) semantic parsing system. surface features from hand-crafted mention-entity statistics that are similar to the ones used in (Yang 6 Experiments and Chang, 2015); (2) mention context embed- It is important to have a general machine learn- dings from a bidirectional LSTM module; (3) en- ing model working for both implicit and explicit tity embeddings constructed from entity type em- supervision signals. We valid our learning frame- beddings. All these embeddings, except the fea- work when the explicit supervision signals are pre- ture vectors, are fine-tuned during training. sented, as well as demonstrate the support of the Some unique properties of our entity linking scenario where supervision signals are mixed. model are worth noticing. First, we add mention Specifically, in this section, we report the exper- context embeddings from a bidirectional LSTM imental results of MMRN on named entity recogni- module as additional input. While using LSTMs tion and entity linking, both using explicit super- is a common practice for sequence labeling, it is vision, and on semantic parsing, using implicit su- not usually used for short-text entity linking. For pervision. In all our experiments, we tuned hyper- each mention, we only extract the output from the parameters on the development set (each task re- bi-LSTM module at the start and end tokens of spectively), and then re-trained the models on the the mention, and concatenate them as the men- combination of the training and development set. tion context embeddings. Second, we construct entity embeddings using the average of its Free- 6.1 Named entity recognition base (Bollacker et al., 2008) type embeddings2, We use the CoNLL-2003 shared task data for the 2We use only the 358 most frequent Freebase entity types. NER experiments, where the standard evaluation

2373 System F1 and Chang, 2014; Yang and Chang, 2015; Yang et al., 2016). We follow prior works (Guo et al., Collobert et al. (2011) 89.59 2013; Yang and Chang, 2015) and perform the Huang et al. (2015) 90.10 standard evaluation for an end-to-end entity link- Chiu and Nichols (2015) 90.77 ing system by computing precision, recall, and F1 Ratinov and Roth (2009) 90.88 scores, according to the entity references and the Lample et al. (2016) 90.94 system output. An output entity is considered cor- Ma and Hovy (2016) 91.21 rect if it matches the gold entity and the mention MMRN-NER Beam = 5 90.03 boundary overlaps with the gold mention bound- MMRN-NER Beam = 20 91.39 ary. Interested readers can refer to (Carmel et al., 2014) for more detail. Table 1: Explicit Supervision: Named Entity We initialize the word embeddings from pre- Recognition. Our MMRN with beam size 20 out- trained GloVe vectors trained on the twitter cor- performs current best systems, which are based on pus, and type embeddings from the pre-trained neural networks. skip-gram model (Mikolov et al., 2013)5. Sizes of both word embeddings are set to 200. Inference NEEL-Test TACL is done using a dynamic programming algorithm. F1 F1 Results of entity linking experiments are pre- S-MART 77.7 63.6 sented in Table 2, which are compared with 6 NTEL 77.9 68.1 those of S-MART (Yang and Chang, 2015) and 7 MMRN-EL 78.5 67.5 NTEL (Yang et al., 2016) , two state-of-the-art en- MMRN-EL - Entity 77.4 66.5 tity linking systems for short texts. Our MMRN-EL MMRN-EL - LSTM 76.6 66.0 is comparable to the best system. We also con- ducted two ablation studies by removing the entity Table 2: Explicit Supervision: Entity Linking. type vectors (MMRN-EL - Entity), and by removing Our system trained with MMRN is comparable to the LSTM vectors (MMRN-EL - LSTM). Both show the state-of-art NTEL system. significant performance drops, which validates the importance of these two additional input vectors. metric is the F1 score. The pre-trained word em- 6.3 Semantic parsing beddings are 100-dimension GloVe vectors trained For semantic parsing, we use the dataset We- 3 on 6 billion tokens (Pennington et al., 2014) . The bQSP8 (Yih et al., 2016) in our experiments. This search procedure is conducted using beam search, dataset is a clean and enhanced version of the and the reward function is simply the number of widely used WebQuestions dataset (Berant et al., correct tag assignments to the words. 2013), which consists of pairs of questions and an- The results are shown in Table 1, compared swers found in Freebase. Compared to WebQues- with recently proposed systems based on neural tions, WebQSP excludes questions with ambigu- models. When the beam size is set to 20, MMRN ous intent, and provides verified answers and full achieves 91.4, which is the best published result semantic parses to the remaining 4,737 questions. so far (without using any gazetteers). Notice that We follow the implicit supervision setting when beam size is 5, the performance drops to in (Yih et al., 2016), using 3, 098 question–answer 90.03. This demonstrates the importance of search pairs for training, and 1, 639 for testing. A subset quality when applying MMRN. of 620 pairs from the training set is used for hyper- parameter tuning. Because there can be multiple 6.2 Entity linking answers to a question, the quality of a semantic For entity linking, we adopt two publicly avail- parser is measured using the averaged F1 score of able datasets for tweet entity linking: NEEL (Cano the predicted answers. et al., 2014)4 and TACL (Guo et al., 2013; Fang 5Available at https://code.google.com/archive/p/word2vec/ 3Available at http://nlp.stanford.edu/projects/glove/ 6The winning system of the NEEL challenge. 4NEEL dataset was originally created for an entity link- 7To have a fair comparison, we compare to the results of ing competition: http://microposts2016.seas. NTEL which do not use pretrained user embedding. upenn.edu/challenge.html 8Available at http://aka.ms/WebQSP

2374 We experiment with two configurations of in- SP EL corporating the entity linking component. MMRN- Avg. F1 PRF1 PIPELINE trains an MMRN-EL model using the en- MMRN-PIPELINE 62.5 85.6 77.5 81.3 tity linking labels in WebQSP separately. Given a MMRN-JOINT 89.3 78.9 question, the entities in it are first predicted, and 68.1 83.7 used as input to the semantic parsing system. In REINFORCE 62.9 87.5 76.6 81.7 contrast, MMRN-JOINT incorporates the MMRN-EL REINFORCE+ 66.7 91.1 76.9 83.4 model in the whole framework. During this joint STAGG 66.8 ––– training process, 15 entity link results are sam- pled according to the current MMRN-EL model, Table 3: Implicit Supervision: Semantic Pars- and passed to the downstream networks. In both ing. By updating the entity linking and semantic cases, we use the previous entity linking model parsing models jointly, MMRN-JOINT improves over trained on the NEEL dataset to initialize the pa- MMRN-PIPELINE by 5 points in F1 and outperforms rameters. As discussed in Sec. 4.1, in this implicit REINFORCE+ (SP). It also improves the entity supervision setting, we directly set the (delayed) linking result on the WebQSP questions (EL). reward function to be the F1 score, which can be obtained by comparing the annotated answers with predicted answers. ble 3). The F1 score of MMRN-JOINT on entity link- Table 3 summarizes the results of the MMRN- ing is 2.4 points higher than the baseline MMRN- based semantic parsing systems and other strong PIPELINE. Note that the entity linking results of baselines. The SP column reports the aver- MMRN-PIPELINE (line 1) are exactly the results of aged F1 scores. Compared to the pipeline ap- the entity linking component MMRN-EL. The result proach (MMRN-PIPELINE), the joint learning frame- is also better than REINFORCE, and comparable work (MMRN-JOINT) improves significantly, reach- to REINFORCE+. ing 68.1% F1. To compare different learning Recently Liang et al. (2016) proposed Neural methods, we also apply REINFORCE (Williams, Symbolic Machine (NSM) and reported the best 1992), a popular policy gradient algorithm, to train result of 69.0 F1 score on the WebQSP dataset us- our joint model using the same setting and re- ing the weak supervision settings.10 The NSM 9 ward function. MMRN-JOINT outperforms REIN- architecture for semantic parsing is significantly FORCE and its variant, REINFORECE+, which different from the architecture used in (Yih et al., re-normalizes the probabilities of the sampled can- 2016) and the one used in this paper. In contrast, didate sequences. Its result is also better than MMRN is a general learning framework that allows the state-of-the-art STAGG system. Note that joint training on existing models (i.e. entity link- we use the same architectures and initialization ing and semantic parsing modules). This allows procedures for MMRN-PIPELINE/JOINT and REIN- MMRN to use the labels of semantic parsing task as FORCE/REINFORCE+. Therefore, the superior implicit supervision signals for the entity linking performance of MMRN-JOINT shows that the joint module. It would be interesting to apply MMRN on learning plays a crucial role in addition to the the newly proposed architectures as well. choices of architecture. Comparing to STAGG, note that Yih et al. (2016) did not jointly train the 7 Discussion entity linker and semantic parser together, but they did improve the results by taking the top 10 predic- We discuss several issues that are highly related to MMRN tions of their entity linking system for re-ranking in this section. parses. Our algorithm further allows to update the Learning to Search There are two main dif- entity linker with the labels for semantic parsing ferences between MMRN and search-based algo- and shows superior performance. rithms, such as SEARN (Daume´ et al., 2009) Our joint model also improves the entity link- and DAGGER (Ross et al., 2011). First, both ing prediction on the questions in WebQSP us- SEARN and DAGGER focus on imitation learn- ing the implicit signals (the EL columns in Ta- ing, assuming explicit supervision signals exist. 9The REINFORCE algorithm uses warm initialization— They use a two-step model learning approach: the entity linking parameters are initialized using the model trained on the NEEL dataset. 10The paper is published after the submission of this paper.

2375 (1) create cost-sensitive examples by listing state– parsing is to reduce the supervision cost. In (Yih action pairs and their corresponding (estimated) et al., 2016), the authors demonstrated that label- losses; (2) apply cost-aware training algorithms. ing semantic parses is possible and often more In contrast, MMRN directly updates the parameters effective with a sophisticated labeling interface. using back-propagation based on search results of However, collecting answers may still be easier or each example. Second, SEARN mixes the op- faster for certain problems or annotators. This sug- timal and current policies during learning, while gests that we could allow the annotators to choose MMRN performs search twice and simply pushes to label semantic parses or answers in order to the current policy towards the optimal one. Re- minimize the supervision cost. MMRN would be cently, Chang et al. (2015) extend this line of work an ideal learning algorithm for this scenario. and discuss different roll-in and roll-out strate- gies during training for structured contextual ban- 8 Conclusion dit settings. As MMRN uses two search procedures, This paper proposes Maximum Margin Reward there is no need to mix different search policies. Networks, a structured learning framework that can learn from both explicit and implicit supervi- In many reinforce- Reinforcement Learning sion signals. In the future, we plan to apply Max- ment learning scenarios, the search space is not imum Margin Reward Networks on other struc- fully controllable by the agent. For example, a tured learning tasks. Improving MMRN for dealing chess playing agent cannot control the move made with large search space is an important future di- by its opponent, and has to commit a single move rection as well. and wait for the opponent. Note that the agent can still think ahead and build a search tree, but only Acknowledgments one move can be made in the end. In contrast, in scenarios like semantic parsing, the whole search We thank the anonymous reviewers for their in- space is controlled by the agent itself. Therefore, sightful comments. The first author is partly from the initial state, we can explore several search sponsored by DARPA under agreement number paths and get their real rewards. This may ex- FA8750-13-2-0008. The U.S. Government is plain why MMRN can be more efficient than RE- authorized to reproduce and distribute reprints INFORCE, as MMRN can use the reward signals of for Governmental purposes notwithstanding any multiple paths more effectively. In addition, MMRN copyright notation thereon. The views and con- is not a probabilistic model, so it does not need clusions contained herein are those of the authors to handle normalization issues, which often causes and should not be interpreted as necessarily rep- large variance in estimating the gradient direction resenting the official policies or endorsements, ei- when optimizing the expected reward. ther expressed or implied, of DARPA or the U.S. Government. Semantic Parsing MMRN can be applied for many semantic parsing tasks. One key step is to References design the right approximated reward for a given task to guide the beam search to nd the reference Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su- parses in MMRN, given that the actual reward is of- pervised learning of semantic parsers for mapping instructions to actions. TACL. ten very sparse. In our companion paper, (Iyyer et al., 2017), we used a simple form of approx- Jonathan Berant, Andrew Chou, Roy Frostig, and Percy imated reward to get feedback as early as possi- Liang. 2013. Semantic parsing on Freebase from question-answer pairs. In EMNLP. ble during search. In other words, the semantic parse will be executed as soon as the parse is ex- Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim ecutable (even if the parse is still not completed) Sturge, and Jamie Taylor. 2008. Freebase: a col- during search. The execution results will be used laboratively created graph database for structuring human knowledge. In ICDM. to calculate the Jaccard coefficient with respect to the labeled answers as the approximated rewards. Amparo E Cano, Giuseppe Rizzo, Andrea Varga, The use of approximated reward has been proven Matthew Rowe, Milan Stankovic, and Aba-Sah Dadzie. 2014. Making sense of microposts:(# mi- to be effective in (Iyyer et al., 2017). croposts2014) named entity extraction & linking An important research direction for semantic challenge. In CEUR Workshop.

2376 David Carmel, Ming-Wei Chang, Evgeniy Gabrilovich, Akshay Krishnamurthy, CMU EDU, Hal Daume´ III, Bo-June Paul Hsu, and Kuansan Wang. 2014. and UMD EDU. 2015. Learning to search better ERD’14: entity recognition and disambiguation than your teacher. arXiv preprint arXiv:1502.02206. challenge. In ACM SIGIR Forum. John Lafferty, Andrew McCallum, and Fernando Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agar- Pereira. 2001. Conditional random fields: Prob- wal, Hal Daume´ III, and John Langford. 2015. abilistic models for segmenting and labeling se- Learning to search better than your teacher. In quence data. In ICML. ICML. Guillaume Lample, Miguel Ballesteros, Sandeep Sub- Jason PC Chiu and Eric Nichols. 2015. Named entity ramanian, Kazuya Kawakami, and Chris Dyer. 2016. recognition with bidirectional LSTM-CNNs. arXiv Neural architectures for named entity recognition. preprint arXiv:1511.08308. arXiv preprint arXiv:1603.01360.

James Clarke, Dan Goldwasser, Ming-Wei Chang, and Chen Liang, Jonathan Berant, Quoc Le, Kenneth D Dan Roth. 2010. Driving semantic parsing from the Forbus, and Ni Lao. 2016. Neural symbolic world’s response. In CoNLL. machines: Learning semantic parsers on free- base with weak supervision. arXiv preprint Michael Collins. 2002. Discriminative training meth- arXiv:1611.00020. ods for hidden markov models: Theory and experi- ments with perceptron algorithms. In SIGDAT. Nut Limsopatham and Nigel Collier. 2016. Bidirec- Michael Collins and Brian Roark. 2004. Incremental tional LSTM for named entity recognition in twitter parsing with the perceptron algorithm. In ACL. messages. In WNUT. Ronan Collobert, Jason Weston, Leon´ Bottou, Michael Wang Ling, Tiago Lu´ıs, Lu´ıs Marujo, Ramon´ Fernan- Karlen, Koray Kavukcuoglu, and Pavel Kuksa. dez Astudillo, Silvio Amir, Chris Dyer, Alan W 2011. Natural language processing (almost) from Black, and Isabel Trancoso. 2015. Finding function scratch. JMLR. in form: Compositional character models for open vocabulary word representation. In EMNLP. Hal Daume,´ John Langford, and Daniel Marcu. 2009. Search-based structured prediction. Machine learn- Xuezhe Ma and Eduard Hovy. 2016. End-to-end se- ing. quence labeling via bi-directional LSTM-CNNS- CRF. arXiv preprint arXiv:1603.01354. Hal Daume´ and Daniel Marcu. 2005. Learning as search optimization: approximate large margin Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- methods for structured prediction. In ICML. rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositional- Li Dong and Mirella Lapata. 2016. Language to logical ity. In NIPS. form with neural attention. In ACL. Volodymyr Mnih, Koray Kavukcuoglu, David Sil- Chris Dyer, Miguel Ballesteros, Wang Ling, Austin ver, Alex Graves, Ioannis Antonoglou, Daan Wier- Matthews, and Noah A Smith. 2015. Transition- stra, and Martin Riedmiller. 2013. Playing atari based dependency parsing with stack long short- with deep reinforcement learning. arXiv preprint term memory. In ACL. arXiv:1312.5602.

Yuan Fang and Ming-Wei Chang. 2014. Entity link- Karthik Narasimhan, Adam Yala, and Regina Barzilay. ing on microblogs with spatial and temporal signals. 2016. Improving information extraction by acquir- TACL. ing external evidence with reinforcement learning. Yuhang Guo, Bing Qin, Ting Liu, and Sheng Li. 2013. In EMNLP. Microblog entity linking by leveraging extra posts. Jeffrey Pennington, Richard Socher, and Christo- In EMNLP. pher D. Manning. 2014. Glove: Global vectors for Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long word representation. In EMNLP. short-term memory. Neural computation. Ariadna Quattoni, Sybor Wang, Louis-Philippe Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi- Morency, Morency Collins, and Trevor Darrell. rectional LSTM-CRF models for sequence tagging. 2007. Hidden conditional random fields. PAMI. arXiv preprint arXiv:1508.01991. Lev Ratinov and Dan Roth. 2009. Design challenges Mohit Iyyer, Wen-tau Yih, and Ming-Wei Chang. 2017. and misconceptions in named entity recognition. In Search-based neural structured learning for sequen- CoNLL. tial question answering. In ACL. Stephane´ Ross, Geoffrey J Gordon, and Drew Bagnell. Thorsten Joachims, Thomas Finley, and Chun- 2011. A reduction of imitation learning and struc- Nam John Yu. 2009. Cutting-plane training of struc- tured prediction to no-regret online learning. In AIS- tural svms. Machine Learning. TATS.

2377 David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Ju- lian Schrittwieser, Ioannis Antonoglou, Veda Pan- neershelvam, Marc Lanctot, et al. 2016. Mastering the game of go with deep neural networks and tree search. Nature. Ben Taskar, Carlos Guestrin, and Daphne Koller. 2004. Max-margin markov networks. In NIPS. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In NAACL. Shyam Upadhyay, Ming-Wei Chang, Kai-Wei Chang, and Wen-tau Yih. 2016. Learning from explicit and implicit supervision jointly for algebra word prob- lems. In EMNLP. Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Gram- mar as a foreign language. In NIPS. Ronald J Williams. 1992. Simple statistical gradient- following algorithms for connectionist reinforce- ment learning. Machine learning. Yi Yang and Ming-Wei Chang. 2015. S-mart: Novel tree-based structured learning algorithms applied to tweet entity linking. In ACL. Yi Yang, Ming-Wei Chang, and Jacob Eisenstein. 2016. Toward socially-infused information extrac- tion: Embedding authors, mentions, and entities. In EMNLP. Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. 2015. Semantic parsing via staged query graph generation: Question answering with knowledge base. In ACL. Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. 2016. The value of semantic parse labeling for knowledge base question answering. In ACL. Chun-Nam John Yu and Thorsten Joachims. 2009. Learning structural SVMs with latent variables. In ICML.

2378