Benchmarking Approximate Inference Methods for Neural Structured Prediction

Lifu Tu Kevin Gimpel Toyota Technological Institute at Chicago, Chicago, IL, 60637, USA {lifu,kgimpel}@ttic.edu

Abstract The second approach is to retain computationally-intractable scoring functions Exact structured inference with neural net- but then use approximate methods for inference. work scoring functions is computationally challenging but several methods have been For example, some researchers relax the struc- proposed for approximating inference. One tured output space from a discrete space to a approach is to perform gradient descent continuous one and then use gradient descent to with respect to the output structure di- maximize the score function with respect to the rectly (Belanger and McCallum, 2016). An- output (Belanger and McCallum, 2016). Another other approach, proposed recently, is to train approach is to train a neural network (an “infer- a neural network (an “inference network”) to ence network”) to output a structure in the relaxed perform inference (Tu and Gimpel, 2018). In this paper, we compare these two families of space that has high score under the structured inference methods on three sequence label- scoring function (Tu and Gimpel, 2018). This ing datasets. We choose sequence labeling idea was proposed as an alternative to gradient because it permits us to use exact inference descent in the context of structured prediction as a benchmark in terms of speed, accuracy, energy networks (Belanger and McCallum, 2016). and search error. Across datasets, we demon- In this paper, we empirically compare exact in- strate that inference networks achieve a better ference, gradient descent, and inference networks speed/accuracy/search error trade-off than gra- dient descent, while also being faster than ex- for three sequence labeling tasks. We train condi- act inference at similar accuracy levels. We tional random fields (CRFs) for sequence labeling find further benefit by combining inference with neural networks used to define the potentials. networks and gradient descent, using the for- We choose a scoring function that permits exact mer to provide a warm start for the latter.1 inference via Viterbi so that we can benchmark the approximate methods in terms of search error in 1 Introduction addition to speed and accuracy. We consider three Structured prediction models commonly involve families of neural network architectures to serve complex inference problems for which finding ex- as inference networks: convolutional neural net- act solutions is intractable (Cooper, 1990). There works (CNNs), recurrent neural networks (RNNs), are generally two ways to address this difficulty. and sequence-to-sequence models with attention One is to restrict the model family to those for (seq2seq; Sutskever et al., 2014; Bahdanau et al., which inference is feasible. For example, state-of- 2015). We also use multi-task learning while train- the-art methods for sequence labeling use struc- ing inference networks, combining the structured tured energies that decompose into label-pair po- scoring function with a local cross entropy loss. tentials and then use rich neural network archi- Our empirical findings can be summarized as tectures to define the potentials (Collobert et al., follows. Gradient descent works reasonably well 2011; Lample et al., 2016; Ma and Hovy, 2016, for tasks with small label sets and primarily local inter alia). Exact dynamic programming algo- structure, like part-of-speech tagging. However, rithms like the can be used for gradient descent struggles on tasks with long- inference. distance dependencies, even with small label set 1Code is available at github.com/lifu-tu/ sizes. For tasks with large label set sizes, infer- BenchmarkingApproximateInference ence networks and Viterbi perform comparably,

3313 Proceedings of NAACL-HLT 2019, pages 3313–3324 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics with Viterbi taking much longer. In this regime, one-hot vector, but this energy is generalized to it is difficult for gradient descent to find a good be able to use both discrete labels and continu- solution, even with many iterations. ous relaxations of the label space, which we will In comparing inference network architectures, introduce below. Also, we use f(x,t) ∈ Rd (1) CNNs are the best choice for tasks with pri- to denote the “input feature vector” for position d marily local structure, like part-of-speech tagging; t, ui ∈ R is a label-specific parameter vector (2) RNNs can handle longer-distance dependen- used for modeling the local scoring function, and cies while still offering high decoding speeds; and W ∈ RL×L is a parameter matrix learned to (3) seq2seq networks consistently work better than model label transitions. For the feature vectors RNNs, but are also the most computationally ex- we use a bidirectional long short-term memory pensive. (BLSTM; Hochreiter and Schmidhuber, 1997), so We also compare search error between gradient this forms a BLSTM-CRF (Lample et al., 2016; descent and inference networks and measure cor- Ma and Hovy, 2016). relations with input likelihood. We find that infer- For training, we use the standard conditional ence networks achieve lower search error on in- log-likelihood objective for CRFs, using the for- stances with higher likelihood (under a pretrained ward and backward dynamic programming algo- language model), while for gradient descent the rithms to compute gradients. For a given input x correlation between search error and likelihood is at test time, prediction is done by choosing the out- closer to zero. This shows the impact of the use of put with the lowest energy: dataset-based learning of inference networks, i.e., they are more effective at amortizing inference for argmin EΘ(x, y) y∈Y(x) more common inputs. Finally, we experiment with two refinements of The Viterbi algorithm can be used to solve this inference networks. The first fine-tunes the infer- problem exactly for the energy defined above. ence network parameters for a single test exam- 2.1 Modeling Improvements: BLSTM-CRF+ ple to minimize the energy of its output. The sec- ond uses an inference network to provide a warm For our experimental comparison, we consider start for gradient descent. Both lead to reductions two CRF variants. The first is the basic model in search error and higher accuracies for certain described above, which we refer to as BLSTM- tasks, with the warm start method leading to a bet- CRF. Below we describe three additional tech- ter speed/accuracy trade-off. niques that we add to the basic model. We will refer to the CRF with these three techniques as 2 Sequence Models BLSTM-CRF+. Using these two models permits For sequence labeling tasks, given an input se- us to assess the impact of model complexity and performance level on the inference method com- quence x = hx1,x2,...,x x i, we wish to output | | parison. a sequence y = hy1, y2,..., y|x|i∈Y(x). Here Y(x) is the structured output space for x. Each la- Fine-Tuning. We used pre- bel yt is represented as an L-dimensional one-hot trained, fixed word embeddings when using the vector where L is the number of labels. BLSTM-CRF model, but for the more complex Conditional random fields (CRFs; BLSTM-CRF+ model, we fine-tune the pretrained Lafferty et al., 2001) form one popular class word embeddings during training. of methods for structured prediction, especially for sequence labeling. We define our structured Character-Based Embeddings. Character- energy function to be similar to those often used based word embeddings provide consistent im- in CRFs for sequence labeling: provements in sequence labeling (Lample et al., 2016; Ma and Hovy, 2016). In addition to EΘ(x, y)= pretrained word embeddings, we produce a L character-based embedding for each word using ⊤ ⊤ − yt,i ui f(x,t) + yt−1Wyt a character convolutional network like that of t i t ! X X=1   X Ma and Hovy (2016). The filter size is 3 charac- where yt,i is the ith entry of the vector yt. In ters and the character embedding dimensionality the standard discrete-label setting, each yt is a is 30. We use max pooling over the character

3314 sequence in the word and the resulting embedding 4 Inference Networks is concatenated with the word embedding before Tu and Gimpel (2018) define an inference net- being passed to the BLSTM. work (“infnet”) AΨ : X →YR and train it with Dropout. We also add dropout during train- the goal that ing (Hinton et al., 2012). Dropout is applied be- AΨ(x) ≈ argmin EΘ(x, y) fore the character embeddings are fed into the y∈YR(x) CNNs, at the final word embedding layer before the input to the BLSTM, and after the BLSTM. where YR is the relaxed continuous output space The dropout rate is 0.5 for all experiments. as defined in Section 3. For sequence labeling, for example, an inference network AΨ takes a se- 3 Gradient Descent for Inference quence x as input and outputs a distribution over labels for each position in x. Below we will con- To use gradient descent (GD) for structured infer- sider three families of neural network architectures ence, researchers typically relax the output space for AΨ. from a discrete, combinatorial space to a continu- For training the inference network parameters ous one and then use gradient descent to solve the Ψ, Tu and Gimpel (2018) explored stabilization following optimization problem: and regularization terms and found that a local cross entropy loss consistently worked well for se- argmin EΘ(x, y) quence labeling. We use this local cross entropy y∈YR(x) loss in this paper, so we perform learning by solv- ing the following: where YR is the relaxed continuous output space. For sequence labeling, YR(x) consists of length- argmin EΘ(x, AΨ(x))+λℓtoken(y, AΨ(x)) |x| sequences of probability distributions over out- Ψ hx,yi put labels. To obtain a discrete labeling for eval- X uation, the most probable label at each position is where the sum is over hx, yi pairs in the training returned. set. The token-level loss is defined: There are multiple settings in which gradi- |y| ent descent has been used for structured in- ℓ (y, A(x)) = CE(y , A(x) ) (2) ference, e.g., image generation (Johnson et al., token t t t=1 2016), structured prediction energy networks X (Belanger and McCallum, 2016), and machine where yt is the L-dimensional one-hot label vec- translation (Hoang et al., 2017). Gradient descent tor at position t in y, A(x)t is the inference net- has the advantage of simplicity. Standard autodif- work’s output distribution at position t, and CE ferentiation toolkits can be used to compute gradi- stands for cross entropy. We will give more details ents of the energy with respect to the output once on how ℓtoken is defined for different inference net- the output space has been relaxed. However, one work families below. It is also the loss used in our challenge is maintaining constraints on the vari- non-structured baseline models. ables being optimized. Therefore, we actually per- form gradient descent in an even more relaxed out- 4.1 Inference Network Architectures put space YR′ (x) which consists of length-|x| se- We now describe options for inference network ar- L quences of vectors, where each vector yt ∈ R . chitectures for sequence labeling. For each, we When computing the energy, we use a softmax optionally include the modeling improvements de- transformation on each yt, solving the following scribed in Section 2.1. When doing so, we append optimization problem with gradient descent: “+” to the setting’s name to indicate this (e.g., infnet+). argmin E (x, softmax(y)) (1) Θ 4.1.1 Convolutional Neural Networks y∈YR′ (x) CNNs are frequently used in NLP to ex- where the softmax operation above is applied in- tract features based on symbol subsequences, dependently to each vector yt in the output struc- whether words or characters (Collobert et al., ture y. 2011; Kalchbrenner et al., 2014; Kim, 2014;

3315 Kim et al., 2016; Zhang et al., 2015). CNNs use When using this inference network, we redefine filters that are applied to symbol sequences and are the local loss to the standard training criterion for typically followed by some sort of pooling opera- seq2seq models, namely the sum of the log losses tion. We apply filters over a fixed-size window for each output conditioned on the previous out- centered on the word being labeled and do not use puts in the sequence. We always use the previ- pooling. The feature maps fn(x,t) for (2n + 1)- ous predicted label as input (as used in “scheduled gram filters are defined: sampling,” Bengio et al., 2015) during training be- cause it works better for our tasks. In our experi- f (x,t)= g(W [v − ; ...; v ]+ b ) n n xt n xt+n n ments, the forward and backward encoder LSTMs use hidden dimension H, as does the LSTM de- where g is a nonlinearity, vxt is the embedding coder. Thus the model becomes similar to the of word xt, and Wn and bn are filter parameters. We consider two CNN configurations: one uses BLSTM tagger except with conditioning on pre- n = 0 and n = 1 and the other uses n = 0 and vious labeling decisions in a left-to-right manner. n = 2. For each, we concatenate the two feature We also experimented with the use of beam maps and use them as input to the softmax layer search for both the seq2seq baseline and infer- over outputs. In each case, we use H filters for ence networks and did not find much differ- each feature map. ence in the results. Also, as alternatives to the deterministic position-based attention described 4.1.2 Recurrent Neural Networks above, we experimented with learned local atten- For sequence labeling, it is common to use a tion (Luong et al., 2015) and global attention, but BLSTM that runs over the input sequence and pro- they did not work better on our tasks. duces a softmax distribution over labels at each 4.2 Methods to Improve Inference Networks position in the sequence. We use this “BLSTM tagger” as our RNN inference network architec- To further improve the performance of an infer- ture. The parameter H refers to the size of the hid- ence network for a particular test instance x, we den vectors in the forward and backward LSTMs, propose two novel approaches that leverage the so the full dimensionality passed to the softmax strengths of inference networks to provide effec- layer is 2H. tive starting points and then use instance-level fine-tuning in two different ways. 4.1.3 Sequence-to-Sequence Models 4.2.1 Instance-Tailored Inference Networks Sequence-to-sequence (seq2seq; Sutskever et al. 2014) models have been successfully used for For each test example x, we initialize an instance- many sequential modeling tasks. It is com- specific inference network AΨ(x) using the mon to augment models with an attention mech- trained inference network parameters, then run anism that focuses on particular positions of gradient descent on the following loss: the input sequence while generating the out- put sequence (Bahdanau et al., 2015). Since se- argmin EΘ(x, AΨ(x)) (3) Ψ quence labeling tasks have equal input and out- put sequence lengths and a strong connection This procedure fine-tunes the inference network between corresponding entries in the sequences, parameters for a single test example to minimize Goyal et al. (2018) used fixed attention that deter- the energy of its output. For each test exam- ministically attends to the ith input when decoding ple, the process is repeated, with a new instance- the ith output, and hence does not learn any atten- specific inference network being initialized from tion parameters. It is shown as follows: the trained inference network parameters. 4.2.2 Warm-Starting Gradient Descent with P (yt | y

We perform experiments on three tasks: Twit- Local Baselines. We consider local (non- ter part-of-speech tagging (POS), named entity structured) baselines that use the same architec- recognition (NER), and CCG supersense tagging tures as the inference networks but train using (CCG). only the local loss ℓtoken. Structured Baselines. We train the BLSTM- 5.1 Datasets CRF and BLSTM-CRF+ models with the standard conditional log-likelihood objective. We tune hy- POS. We use the annotated data from perparameters on the development sets. Gimpel et al. (2011) and Owoputi et al. (2013) which contains 25 POS tags. For training, we Gradient Descent for Inference. We use gra- combine the 1000-tweet OCT27TRAIN set and dient descent for structured inference by solving the 327-tweet OCT27DEV set. For validation, Eq. 1. We randomly initialize y ∈ YR′ (x) and, we use the 500-tweet OCT27TEST set and for for N iterations, we compute the gradient of the testing we use the 547-tweet DAILY547 test energy with respect to y, then update y using gra- set. We use the 100-dimensional skip-gram dient descent with momentum, which we found to embeddings from Tu et al. (2017) which were generally work better than constant step size. We trained on a dataset of 56 million English tweets tune N and the via instance-specific using (Mikolov et al., 2013). The oracle tuning, i.e., we choose them separately for evaluation metric is tagging accuracy. each input to maximize performance (accuracy or F1 score) on that input. Even with this oracle tun- NER. We use the CoNLL 2003 English data ing, we find that gradient descent struggles to com- (Tjong Kim Sang and De Meulder, 2003). There pete with the other methods. are four entity types: PER, LOC, ORG, and MISC. There is a strong local dependency between neigh- Inference Networks. To train the inference net- boring labels because this is a labeled segmenta- works, we first train the BLSTM-CRF or BLSTM- tion task. We use the BIOES tagging scheme, so CRF+ model with the standard conditional log- there are 17 labels. We use 100-dimensional pre- likelihood objective. The hidden sizes H are tuned trained GloVe (Pennington et al., 2014) embed- in that step. We then fix the energy function and dings. The task is evaluated with micro-averaged train the inference network AΨ using the com- F1 score using the conlleval script. bined loss from Section 4. For instance-tailored inference networks and CCG. We use the standard splits from CCG- when using inference networks as a warm start for bank (Hockenmaier and Steedman, 2002). We gradient descent, we tune the number of epochs N only keep sentences with length less than 50 in and the learning rate on the development set, and the original training data when training the CRF. report the performance on the test set, using the The training data contains 1,284 unique labels, but same values of N and the learning rate for all test because the label distribution has a long tail, we examples. use only the 400 most frequent labels, replacing the others by a special tag ∗. The percentages of 6 BLSTM-CRF Results ∗ in train/development/test are 0.25/0.23/0.23%. When the gold standard tag is ∗, the prediction is This first section of results uses the simpler always evaluated as incorrect. We use the same BLSTM-CRF modeling configuration. In Sec- GloVe embeddings as in NER. Because of the tion 7 below we present results with the stronger compositional nature of supertags, this task has BLSTM-CRF+ configuration and also apply the more non-local dependencies. The task is evalu- same modeling improvements to the baselines and ated with per-token accuracy. inference networks.

3317 Twitter POS Tagging NER CCG Supertagging CNN BLSTM seq2seq CNN BLSTM seq2seq CNN BLSTM seq2seq local baseline 89.6 88.0 88.9 79.9 85.0 85.3 90.6 92.2 92.7 infnet 89.9 89.5 89.7 82.2 85.4 86.1 91.3 92.8 92.9 gradient descent 89.1 84.4 89.0 Viterbi 89.2 87.2 92.4

Table 1: Test results for all tasks. Inference networks, gradient descent, and Viterbi are all optimizing the BLSTM- CRF energy. Best result per task is in bold.

Table 1 shows test results for all tasks and ar- 92 93.5 chitectures. The inference networks use the same 93 90 architectures as the corresponding local baselines, 92.5 88 92 but their parameters are trained with both the local Baseline(CNNs) 91.5 Baseline(BLSTM) 86 Baseline(seq2seq) loss and the BLSTM-CRF energy, leading to con- F1 91 InfNet(CNNs) InfNet(BLSTM) Accuracy(%) InfNet(seq2seq) sistent improvements. CNN inference networks 84 90.5 work well for POS, but struggle on NER and CCG 90 82 compared to other architectures. BLSTMs work 89.5 80 89 well, but are outperformed slightly by seq2seq 0 100 200 300 200 400 600 800 1000 models across all three tasks. Using the Viterbi Hidden size Hidden size algorithm for exact inference yields the best per- (a) NER (b) CCG Supertagging formance for NER but is not best for the other two Figure 1: Development results for inference networks tasks. with different architectures and hidden sizes (H). It may be surprising that an inference network trained to mimic Viterbi would outperform Viterbi {1,3}-gram {1,5}-gram in terms of accuracy, which we find for the CNN local baseline 89.2 88.7 POS for POS tagging and the seq2seq inference net- infnet 89.6 89.0 local baseline 84.6 85.4 NER work for CCG. We suspect this occurs for two infnet 86.7 86.8 reasons. One is due to the addition of the lo- local baseline 89.5 90.4 CCG cal loss in the inference network objective; the infnet 90.3 91.4 inference networks may be benefiting from this Table 2: Development results for CNNs with two filter multi-task training. Edunov et al. (2018) similarly sets (H = 100). found benefit from a combination of token-level and sequence-level losses. The other potential rea- son is beneficial inductive bias with the inference 6.1 Speed Comparison network architecture. For POS tagging, the CNN architecture is clearly well-suited to this task given Asymptotically, Viterbi takes O(nL2) time, where the strong performance of the local CNN baseline. n is the sequence length. The BLSTM and our Nonetheless, the CNN inference network is able to deterministic-attention seq2seq models have time improve upon both the CNN baseline and Viterbi. complexity O(nL). CNNs also have complex- Hidden Sizes. For the test results in Table 1, we ity O(nL) but are more easily parallelizable. Ta- did limited tuning of H for the inference networks ble 3 shows test-time inference speeds for infer- based on the development sets. Figure 1 shows ence networks, gradient descent, and Viterbi for the impact of H on performance. Across H val- the BLSTM-CRF model. We use GPUs and a ues, the inference networks outperform the base- minibatch size of 10 for all methods. CNNs are lines. For NER and CCG, seq2seq outperforms 1-2 orders of magnitude faster than the others. the BLSTM which in turn outperforms the CNN. BLSTMs work almost as well as seq2seq models and are 2-4 times faster in our experiments. Viterbi Tasks and Window Sizes. Table 2 shows that is actually faster than seq2seq when L is small, CNNs with smaller windows are better for POS, but for CCG, which has L = 400, it is 4-5 times while larger windows are better for NER and slower. Gradient descent is slower than the others CCG. This suggests that POS has more local de- because it generally needs many iterations (20-50) pendencies among labels than NER and CCG. for competitive performance.

3318 CNN BLSTM seq2seq Viterbi GD POS NER CCG POS 12500 1250 357 500 20 local baseline 91.3 90.5 94.1 NER 10000 1000 294 360 23 infnet+ 91.3 90.8 94.2 CCG 6666 1923 1000 232 16 gradient descent 90.8 89.8 90.4 Viterbi 90.9 91.6 94.3 Table 3: Speed comparison of inference networks across tasks and architectures (examples/sec). Table 4: Test results with BLSTM-CRF+. For local baseline and inference network architectures, we use CNN for POS, seq2seq for NER, and BLSTM for CCG. 6.2 Search Error F1 We can view inference networks as approximate local baseline (BLSTM) 90.3 search algorithms and assess characteristics that infnet+ (1-layer BLSTM) 90.7 infnet+ (2-layer BLSTM) 91.1 affect search error. To do so, we train two LSTM Viterbi 91.6 language models (one on word sequences and one on gold label sequences) on the Twitter POS data. Table 5: NER test results (for BLSTM-CRF+) with We also compute the difference in the BLSTM- more layers in the BLSTM inference network. CRF energies between the inference network out- put yinf and the Viterbi output yvit as the search improve over the local baselines on 2 of 3 tasks. error: EΘ(x, yinf ) − EΘ(x, yvit ). We compute the same search error for gradient descent. POS. As in the BLSTM-CRF setting, the local For the BLSTM inference network, Spearman’s CNN baseline and the CNN inference network ρ between the word sequence perplexity and outperform Viterbi. This is likely because the search error is 0.282; for the label sequence per- CRFs use BLSTMs as feature networks, but our plexity, it is 0.195. For gradient descent infer- results show that CNN baselines are consistently ence, Spearman’s ρ between the word sequence better than BLSTM baselines on this task. As in perplexity and search error is 0.122; for the la- the BLSTM-CRF setting, gradient descent works bel sequence perplexity, it is 0.064. These positive quite well on this task, comparable to Viterbi, correlations mean that for frequent sequences, in- though it is still much slower. ference networks and gradient descent exhibit less NER. We see slightly higher BLSTM-CRF+ search error. We also note that the correlations are results than several previous state-of-the-art higher for the inference network than for gradi- results (cf. 90.94; Lample et al., 2016 and ent descent, showing the impact of amortization 91.37; Ma and Hovy, 2016). The stronger during learning of the inference network parame- BLSTM-CRF+ configuration also helps the infer- ters. That is, since we are learning to do inference ence networks, improving performance from 90.5 from a dataset, we would expect search error to be to 90.8 for the seq2seq architecture over the local smaller for more frequent sequences, and we do baseline. Though gradient descent reached high indeed see this correlation. accuracies for POS tagging, it does not perform 7 BLSTM-CRF+ Results well on NER, possibly due to the greater amount of non-local information in the task. We now compare inference methods when using While we see strong performance with infnet+, the improved modeling techniques described in it still lags behind Viterbi in F1. We consider addi- Section 2.1 (i.e., the setting we called BLSTM- tional experiments in which we increase the num- CRF+). We use these improved techniques for ber of layers in the inference networks. We use all models, including the CRF, the local base- a 2-layer BLSTM as the inference network and lines, gradient descent, and the inference net- also use weight annealing of the local loss hyper- works. When training inference networks, both parameter λ, setting it to λ = e−0.01t where t is the inference network architectures and the struc- the epoch number. Without this annealing, the tured energies use the techniques from Section 2.1. 2-layer inference network was difficult to train. So, when referring to inference networks in this The weight annealing was helpful for encourag- section, we use the name infnet+. ing the inference network to focus more on the The results are shown in Table 4. With a more non-local information in the energy function rather powerful local architecture, structured prediction than the token-level loss. As shown in Table 5, is less helpful overall, but inference networks still these changes yield an improvement of 0.4 in F1.

3319 Twitter POS Tagging NER CCG Supertagging N Acc. (↑) Energy (↓) F1 (↑) Energy (↓) Acc. (↑) Energy (↓) gold standard 100 -159.65 100 -230.63 100 -480.07 BLSTM-CRF+/Viterbi 90.9 -163.20 91.6 -231.53 94.3 -483.09 10 89.2 -161.69 81.9 -227.92 65.1 -412.81 20 90.8 -163.06 89.1 -231.17 74.6 -414.81 30 90.8 -163.02 89.6 -231.30 83.0 -447.64 40 90.7 -163.03 89.8 -231.34 88.6 -471.52 gradient descent 50 90.8 -163.04 89.8 -231.35 90.0 -476.56 100 - - - - 90.1 -476.98 500 - - - - 90.1 -476.99 1000 - - - - 90.1 -476.99 infnet+ 91.3 -162.07 90.8 -231.19 94.2 -481.32 discretized output from infnet+ 91.3 -160.87 90.8 -231.34 94.2 -481.95 3 91.0 -162.59 91.3 -231.32 94.3 -481.91 instance-tailored infnet+ 5 90.9 -162.81 91.2 -231.37 94.3 -482.23 10 91.3 -162.85 91.5 -231.39 94.3 -482.56 3 91.4 -163.06 91.4 -231.42 94.4 -482.62 infnet+ as warm start for 5 91.2 -163.12 91.4 -231.45 94.4 -482.64 gradient descent 10 91.2 -163.15 91.5 -231.46 94.4 -482.78 Table 6: Test set results of approximate inference methods for three tasks, showing performance metrics (accuracy and F1) as well as average energy of the output of each method. The inference network architectures in the above experiments are: CNN for POS, seq2seq for NER, and BLSTM for CCG. N is the number of epochs for GD inference or instance-tailored fine-tuning.

CCG. Our BLSTM-CRF+ reaches an accuracy sumably due to the larger number of labels in of 94.3%, which is comparable to several recent the task. Even with 1000 iterations, the accuracy results (93.53, Xu et al., 2016; 94.3, Lewis et al., is 4% lower than Viterbi and the inference net- 2016; and 94.50, Vaswani et al., 2016). The lo- works. Unlike POS and NER, the inference net- cal baseline, the BLSTM inference network, and work reaches much lower energies than gradient Viterbi are all extremely close in accuracy. Gradi- descent on CCG, suggesting that the inference net- ent descent struggles here, likely due to the large work may not suffer from the same challenges of number of candidate output labels. searching high-dimensional label spaces as those faced by gradient descent. 7.1 Speed, Accuracy, and Search Error Inference Networks Across Tasks. For POS, Table 6 compares inference methods in terms of the inference network does not have lower energy both accuracy and energies reached during infer- than gradient descent with ≥ 20 iterations, but it ence. For each number N of gradient descent it- does have higher accuracy. This may be due in erations in the table, we tune the learning rate per- part to our use of multi-task learning for inference sentence and report the average accuracy/F1 with networks. The discretization of the inference net- that fixed number of iterations. We also report the work outputs increases the energy on average for average energy reached. For inference networks, this task, whereas it decreases the energy for the we report energies both for the output directly and other two tasks. For NER, the inference network when we discretize the output (i.e., choose the reaches a similar energy as gradient descent, es- most probable label at each position). pecially when discretizing the output, but is con- siderably better in F1. The CCG tasks shows the Gradient Descent Across Tasks. The number largest difference between gradient descent and of gradient descent iterations required for compet- the inference network, as the latter is much better itive performance varies by task. For POS, 20 iter- in both accuracy and energy. ations are sufficient to reach accuracy and energy close to Viterbi. For NER, roughly 40 iterations Instance Tailoring and Warm Starting. are needed for gradient descent to reach its high- Across tasks, instance tailoring and warm start- est F1 score, and for its energy to become very ing lead to lower energies than infnet+. The close to that of the Viterbi outputs. However, its improvements in energy are sometimes joined F1 score is much lower than Viterbi. For CCG, by improvements in accuracy, notably for NER gradient descent requires far more iterations, pre- where the gains range from 0.4 to 0.7 in F1.

3320 100 level and sequence-level losses. 50 100 We focused on structured prediction in this pa- 80 Viterbi per, but inference networks are useful in other set- 10 GD tings as well. For example, it is common to use infnet+ 60 Instance-Tailored infnet+

Accuracy(%) a particular type of inference network to approx- infnet+ warm start for GD 5 imate posterior inference in neural approaches 40 0 50 100 150 200 250 300 350 to latent-variable probabilistic modeling, such Time(s) as variational (Kingma and Welling, Figure 2: CCG test results for inference methods (GD 2013) and, more closely related to this paper, vari- = gradient descent). The x-axis is the total inference ational sequential labelers (Chen et al., 2018). In time for the test set. The numbers on the GD curve are such settings, Kim et al. (2018) have found benefit the number of gradient descent iterations. with instance-specific updating of inference net- work parameters, which is related to our instance- level fine-tuning. There are also connections be- Warm starting gradient descent yields the lowest tween structured inference networks and amor- energies (other than Viterbi), showing promise tized structured inference (Srikumar et al., 2012) for the use of gradient descent as a local search as well as methods for neural knowledge distilla- method starting from inference network output. tion and model compression (Hinton et al., 2015; Wall Clock Time Comparison. Figure 2 shows Ba and Caruana, 2014; Kim and Rush, 2016). the speed/accuracy trade-off for the inference Gradient descent is used for inference in sev- methods, using wall clock time for test set infer- eral settings, e.g., structured prediction energy ence as the speed metric. On this task, Viterbi networks (Belanger and McCallum, 2016), image is time-consuming because of the larger label set generation applications (Mordvintsev et al., 2015; size. The inference network has comparable accu- Gatys et al., 2015), finding adversarial examples racy to Viterbi but is much faster. Gradient descent (Goodfellow et al., 2015), learning paragraph em- needs much more time to get close to the others beddings (Le and Mikolov, 2014), and machine but plateaus before actually reaching similar accu- translation (Hoang et al., 2017). Gradient descent racy. Instance-tailoring and warm starting reside has started to be replaced by inference networks between infnet+ and Viterbi, with warm starting in some of these settings, such as image transfor- being significantly faster because it does not re- mation (Johnson et al., 2016; Li and Wand, 2016). quire updating inference network parameters. Our results provide more evidence that gradient descent can be replaced by inference networks or 8 Related Work improved through combination with them.

The most closely related prior work is that of 9 Conclusion Tu and Gimpel (2018), who experimented with RNN inference networks for sequence labeling. We compared several methods for approximate We compared three architectural families, showed inference in neural structured prediction, find- the relationship between optimal architectures and ing that inference networks achieve a better downstream tasks, compared inference networks speed/accuracy/search error trade-off than gradi- to gradient descent, and proposed novel variations. ent descent. We also proposed instance-level in- ference network fine-tuning and using inference We focused in this paper on sequence label- networks to initialize gradient descent, finding fur- ing, in which CRFs with neural network po- ther reductions in search error and improvements tentials have emerged as a state-of-the-art ap- in performance metrics for certain tasks. proach (Lample et al., 2016; Ma and Hovy, 2016; Strubell et al., 2017; Yang et al., 2018). Our re- Acknowledgments sults suggest that inference networks can provide a feasible way to speed up test-time inference over We would like to thank Ke Li for suggesting exper- Viterbi without much loss in performance. The iments that combine inference networks and gra- benefits of inference networks may be coming in dient descent, the anonymous reviewers for their part from multi-task training; Edunov et al. (2018) feedback, and NVIDIA for donating GPUs used similarly found benefit from combining token- in this research.

3321 References Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversar- Jimmy Ba and Rich Caruana. 2014. Do deep nets really ial examples. In Proceedings of International Con- need to be deep? In Z. Ghahramani, M. Welling, ference on Learning Representations (ICLR). C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Kartik Goyal, Graham Neubig, Chris Dyer, and Tay- Systems 27, pages 2654–2662. lor Berg-Kirkpatrick. 2018. A continuous relaxation of beam search for end-to-end training of neural se- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- quence models. In Proceedings of the Thirty-Second gio. 2015. Neural machine translation by jointly AAAI Conference on Artificial Intelligence (AAAI), learning to align and translate. In Proceedings of New Orleans, Louisiana. International Conference on Learning Representa- tions (ICLR). Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. David Belanger and Andrew McCallum. 2016. Struc- In NIPS Workshop. tured prediction energy networks. In Proceedings of the 33rd International Conference on Machine Geoffrey E. Hinton, Nitish Srivastava, Alex Learning - Volume 48, ICML’16, pages 983–992. Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- dinov. 2012. Improving neural networks by Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and preventing co-adaptation of feature detectors. Noam Shazeer. 2015. Scheduled sampling for CoRR, abs/1207.0580. sequence prediction with recurrent neural net- works. In C. Cortes, N. D. Lawrence, D. D. Lee, Cong Duy Vu Hoang, Gholamreza Haffari, and Trevor M. Sugiyama, and R. Garnett, editors, Advances in Cohn. 2017. Towards decoding as continuous opti- Neural Information Processing Systems 28, pages misation in neural machine translation. In Proceed- 1171–1179. ings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 146–156, Mingda Chen, Qingming Tang, Karen Livescu, and Copenhagen, Denmark. Kevin Gimpel. 2018. Variational sequential labelers for semi-. In Proceedings of the Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long 2018 Conference on Empirical Methods in Natural short-term memory. Neural Computation. Language Processing, pages 215–226. Julia Hockenmaier and Mark Steedman. 2002. Acquir- Ronan Collobert, Jason Weston, Leon´ Bottou, Michael ing compact lexicalized grammars from a cleaner Karlen, Koray Kavukcuoglu, and Pavel Kuksa. treebank. In Proceedings of the Third International 2011. Natural language processing (almost) from Conference on Language Resources and Evaluation, scratch. Journal of Research, 12. LREC 2002, May 29-31, 2002, Las Palmas, Canary Islands, Spain. Gregory F. Cooper. 1990. The computational complex- Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. ity of probabilistic inference using Bayesian belief Perceptual losses for real-time style transfer and networks (research note). Artif. Intell., 42(2-3). super-resolution. In Proceedings of European Con- ference on (ECCV). Sergey Edunov, Myle Ott, Michael Auli, David Grang- ier, and Marc’Aurelio Ranzato. 2018. Classical Nal Kalchbrenner, Edward Grefenstette, and Phil Blun- structured prediction losses for sequence to se- som. 2014. A convolutional neural network for quence learning. In Proceedings of the 2018 Con- modelling sentences. In Proceedings of the 52nd ference of the North American Chapter of the Asso- Annual Meeting of the Association for Computa- ciation for Computational Linguistics: Human Lan- tional Linguistics (Volume 1: Long Papers), pages guage Technologies, Volume 1 (Long Papers), pages 655–665, Baltimore, Maryland. 355–364, New Orleans, Louisiana. Yoon Kim. 2014. Convolutional neural networks Leon A. Gatys, Alexander S. Ecker, and Matthias for sentence classification. In Proceedings of the Bethge. 2015. A neural algorithm of artistic style. 2014 Conference on Empirical Methods in Natural CoRR, abs/1508.06576. Language Processing (EMNLP), pages 1746–1751, Doha, Qatar. Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Yoon Kim, Yacine Jernite, David Sontag, and Alexan- Michael Heilman, Dani Yogatama, Jeffrey Flani- der M. Rush. 2016. Character-aware neural lan- gan, and Noah A. Smith. 2011. Part-of-speech tag- guage models. In Proceedings of the Thirtieth AAAI ging for Twitter: annotation, features, and experi- Conference on Artificial Intelligence, AAAI’16, ments. In Proceedings of the 49th Annual Meet- pages 2741–2749. AAAI Press. ing of the Association for Computational Linguis- tics: Human Language Technologies, pages 42–47, Yoon Kim and Alexander M. Rush. 2016. Sequence- Portland, Oregon, USA. level knowledge distillation. In Proceedings of the

3322 2016 Conference on Empirical Methods in Natu- Z. Ghahramani, and K. Q. Weinberger, editors, Ad- ral Language Processing, pages 1317–1327, Austin, vances in Neural Information Processing Systems Texas. 26, pages 3111–3119. Yoon Kim, Sam Wiseman, Andrew Miller, David Son- Alexander Mordvintsev, Christopher Olah, and Mike tag, and Alexander Rush. 2018. Semi-amortized Tyka. 2015. DeepDream-a code example for visual- variational autoencoders. In Proceedings of the 35th izing neural networks. Google Research. International Conference on Machine Learning, vol- ume 80 of Proceedings of Machine Learning Re- Olutobi Owoputi, Brendan O’Connor, Chris Dyer, search, pages 2678–2687, Stockholmsmssan, Stock- Kevin Gimpel, Nathan Schneider, and Noah A. holm Sweden. Smith. 2013. Improved part-of-speech tagging for online conversational text with word clusters. In Diederik Kingma and Max Welling. 2013. Auto- Proceedings of the 2013 Conference of the North encoding variational Bayes. CoRR, abs/1312.6114. American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, John D. Lafferty, Andrew McCallum, and Fernando pages 380–390, Atlanta, Georgia. C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling se- Jeffrey Pennington, Richard Socher, and Christopher quence data. In Proceedings of the Eighteenth Inter- Manning. 2014. GloVe: Global vectors for word national Conference on Machine Learning, ICML representation. In Proceedings of the 2014 Con- ’01, pages 282–289. ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Guillaume Lample, Miguel Ballesteros, Sandeep Sub- Qatar. ramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. Vivek Srikumar, Gourab Kundu, and Dan Roth. 2012. In Proceedings of the 2016 Conference of the North On amortizing inference cost for structured predic- American Chapter of the Association for Computa- tion. In Proceedings of the 2012 Joint Confer- tional Linguistics: Human Language Technologies, ence on Empirical Methods in Natural Language pages 260–270, San Diego, California. Processing and Computational Natural Language Learning, pages 1114–1124, Jeju Island, Korea. Quoc Le and Tomas Mikolov. 2014. Distributed repre- Emma Strubell, Patrick Verga, David Belanger, and sentations of sentences and documents. In Proceed- Andrew McCallum. 2017. Fast and accurate en- ings of the 31st International Conference on Inter- tity recognition with iterated dilated convolutions. national Conference on Machine Learning - Volume In Proceedings of the 2017 Conference on Empiri- 32, ICML’14. cal Methods in Natural Language Processing, pages Mike Lewis, Kenton Lee, and Luke Zettlemoyer. 2016. 2670–2680, Copenhagen, Denmark. LSTM CCG parsing. In Proceedings of the 2016 Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Conference of the North American Chapter of the Sequence to sequence learning with neural net- Association for Computational Linguistics: Human works. In Z. Ghahramani, M. Welling, C. Cortes, Language Technologies, pages 221–231, San Diego, N. D. Lawrence, and K. Q. Weinberger, editors, Ad- California. vances in Neural Information Processing Systems Chuan Li and Michael Wand. 2016. Precomputed 27, pages 3104–3112. real-time texture synthesis with Markovian gener- Erik F. Tjong Kim Sang and Fien De Meulder. ative adversarial networks. In Proceedings of Eu- 2003. Introduction to the CoNLL-2003 shared task: ropean Conference on Computer Vision (ECCV), Language-independent named entity recognition. In pages 702–716. Proceedings of the Seventh Conference on Natu- ral Language Learning at HLT-NAACL 2003, pages Thang Luong, Hieu Pham, and Christopher D. Man- 142–147. ning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the Lifu Tu and Kevin Gimpel. 2018. Learning ap- 2015 Conference on Empirical Methods in Natu- proximate inference networks for structured predic- ral Language Processing, pages 1412–1421, Lisbon, tion. In Proceedings of International Conference on Portugal. Learning Representations (ICLR). Xuezhe Ma and Eduard Hovy. 2016. End-to-end se- Lifu Tu, Kevin Gimpel, and Karen Livescu. 2017. quence labeling via bi-directional lstm-cnns-crf. In Learning to embed words in context for syntactic Proceedings of the 54th Annual Meeting of the As- tasks. In Proceedings of the 2nd Workshop on Rep- sociation for Computational Linguistics (Volume 1: resentation Learning for NLP, pages 265–275, Van- Long Papers), pages 1064–1074, Berlin, Germany. couver, Canada. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Ashish Vaswani, Yonatan Bisk, Kenji Sagae, and Ryan rado, and Jeff Dean. 2013. Distributed representa- Musa. 2016. Supertagging with LSTMs. In Pro- tions of words and phrases and their composition- ceedings of the 2016 Conference of the North Amer- ality. In C. J. C. Burges, L. Bottou, M. Welling, ican Chapter of the Association for Computational

3323 Linguistics: Human Language Technologies, pages maximize performance (accuracy or F1 score) on 232–237, San Diego, California. that input. Wenduan Xu, Michael Auli, and Stephen Clark. 2016. Inference Networks. To train the inference Expected F-measure training for shift-reduce pars- ing with recurrent neural networks. In Proceed- networks, we first train the BLSTM-CRF or ings of the 2016 Conference of the North Ameri- BLSTM-CRF+ model with the standard con- can Chapter of the Association for Computational ditional log-likelihood objective. The hidden Linguistics: Human Language Technologies, pages sizes H are tuned in that step. We then fix 210–220, San Diego, California. the energy function and train the inference Jie Yang, Shuailong Liang, and Yue Zhang. 2018. De- network AΨ using the combined loss from sign challenges and misconceptions in neural se- Section 4. We tune the learning rate over the set quence labeling. In Proceedings of the 27th Inter- {5, 1, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005} national Conference on Computational Linguistics, pages 3879–3889, Santa Fe, New Mexico, USA. for the inference network and the local loss weight λ over the set {0.2, 0.5, 1, 2, 5}. We use early Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. stopping on the development sets and report the Character-level convolutional networks for text clas- results on the test sets using the trained inference sification. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in networks. Neural Information Processing Systems 28, pages 649–657.

A Appendix Local Baselines. We consider local (non- structured) baselines that use the same architec- tures as the inference networks but train using only the local loss ℓtoken. We tune the learning rate ({5, 1, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005}). We train on the training set, use the development sets for tuning and early stopping, and report results on the test sets. Structured Baselines. We train the BLSTM- CRF and BLSTM-CRF+ models with the stan- dard conditional log-likelihood objective. We tune hyperparameters on the development sets. The tuned BLSTM hidden size H for BLSTM-CRF is 100 for POS/NER and 512 for CCG; for BLSTM- CRF+ the tuned hidden size is 100 for POS, 200 for NER, and 400 for CCG. Gradient Descent for Inference. For the num- ber of epochs N, we consider values in the set {5, 10, 20, 30, 40, 50, 100, 500, 1000}. For each N, we tune the learning rate over the set {1e4, 5e3, 1e3, 500, 100, 50, 10, 5, 1}). These learning rates may appear extremely large when we are accustomed to choosing rates for empiri- cal risk minimization, but we generally found that the most effective learning rates for structured in- ference are orders of magnitude larger than those effective for learning. To provide as strong perfor- mance as possible for the gradient descent method, we tune N and the learning rate via oracle tuning, i.e., we choose them separately for each input to

3324