Benchmarking Approximate Inference Methods for Neural Structured Prediction
Total Page:16
File Type:pdf, Size:1020Kb
Benchmarking Approximate Inference Methods for Neural Structured Prediction Lifu Tu Kevin Gimpel Toyota Technological Institute at Chicago, Chicago, IL, 60637, USA {lifu,kgimpel}@ttic.edu Abstract The second approach is to retain computationally-intractable scoring functions Exact structured inference with neural net- but then use approximate methods for inference. work scoring functions is computationally challenging but several methods have been For example, some researchers relax the struc- proposed for approximating inference. One tured output space from a discrete space to a approach is to perform gradient descent continuous one and then use gradient descent to with respect to the output structure di- maximize the score function with respect to the rectly (Belanger and McCallum, 2016). An- output (Belanger and McCallum, 2016). Another other approach, proposed recently, is to train approach is to train a neural network (an “infer- a neural network (an “inference network”) to ence network”) to output a structure in the relaxed perform inference (Tu and Gimpel, 2018). In this paper, we compare these two families of space that has high score under the structured inference methods on three sequence label- scoring function (Tu and Gimpel, 2018). This ing datasets. We choose sequence labeling idea was proposed as an alternative to gradient because it permits us to use exact inference descent in the context of structured prediction as a benchmark in terms of speed, accuracy, energy networks (Belanger and McCallum, 2016). and search error. Across datasets, we demon- In this paper, we empirically compare exact in- strate that inference networks achieve a better ference, gradient descent, and inference networks speed/accuracy/search error trade-off than gra- dient descent, while also being faster than ex- for three sequence labeling tasks. We train condi- act inference at similar accuracy levels. We tional random fields (CRFs) for sequence labeling find further benefit by combining inference with neural networks used to define the potentials. networks and gradient descent, using the for- We choose a scoring function that permits exact mer to provide a warm start for the latter.1 inference via Viterbi so that we can benchmark the approximate methods in terms of search error in 1 Introduction addition to speed and accuracy. We consider three Structured prediction models commonly involve families of neural network architectures to serve complex inference problems for which finding ex- as inference networks: convolutional neural net- act solutions is intractable (Cooper, 1990). There works (CNNs), recurrent neural networks (RNNs), are generally two ways to address this difficulty. and sequence-to-sequence models with attention One is to restrict the model family to those for (seq2seq; Sutskever et al., 2014; Bahdanau et al., which inference is feasible. For example, state-of- 2015). We also use multi-task learning while train- the-art methods for sequence labeling use struc- ing inference networks, combining the structured tured energies that decompose into label-pair po- scoring function with a local cross entropy loss. tentials and then use rich neural network archi- Our empirical findings can be summarized as tectures to define the potentials (Collobert et al., follows. Gradient descent works reasonably well 2011; Lample et al., 2016; Ma and Hovy, 2016, for tasks with small label sets and primarily local inter alia). Exact dynamic programming algo- structure, like part-of-speech tagging. However, rithms like the Viterbi algorithm can be used for gradient descent struggles on tasks with long- inference. distance dependencies, even with small label set 1Code is available at github.com/lifu-tu/ sizes. For tasks with large label set sizes, infer- BenchmarkingApproximateInference ence networks and Viterbi perform comparably, 3313 Proceedings of NAACL-HLT 2019, pages 3313–3324 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics with Viterbi taking much longer. In this regime, one-hot vector, but this energy is generalized to it is difficult for gradient descent to find a good be able to use both discrete labels and continu- solution, even with many iterations. ous relaxations of the label space, which we will In comparing inference network architectures, introduce below. Also, we use f(x,t) ∈ Rd (1) CNNs are the best choice for tasks with pri- to denote the “input feature vector” for position d marily local structure, like part-of-speech tagging; t, ui ∈ R is a label-specific parameter vector (2) RNNs can handle longer-distance dependen- used for modeling the local scoring function, and cies while still offering high decoding speeds; and W ∈ RL×L is a parameter matrix learned to (3) seq2seq networks consistently work better than model label transitions. For the feature vectors RNNs, but are also the most computationally ex- we use a bidirectional long short-term memory pensive. (BLSTM; Hochreiter and Schmidhuber, 1997), so We also compare search error between gradient this forms a BLSTM-CRF (Lample et al., 2016; descent and inference networks and measure cor- Ma and Hovy, 2016). relations with input likelihood. We find that infer- For training, we use the standard conditional ence networks achieve lower search error on in- log-likelihood objective for CRFs, using the for- stances with higher likelihood (under a pretrained ward and backward dynamic programming algo- language model), while for gradient descent the rithms to compute gradients. For a given input x correlation between search error and likelihood is at test time, prediction is done by choosing the out- closer to zero. This shows the impact of the use of put with the lowest energy: dataset-based learning of inference networks, i.e., they are more effective at amortizing inference for argmin EΘ(x, y) y∈Y(x) more common inputs. Finally, we experiment with two refinements of The Viterbi algorithm can be used to solve this inference networks. The first fine-tunes the infer- problem exactly for the energy defined above. ence network parameters for a single test exam- 2.1 Modeling Improvements: BLSTM-CRF+ ple to minimize the energy of its output. The sec- ond uses an inference network to provide a warm For our experimental comparison, we consider start for gradient descent. Both lead to reductions two CRF variants. The first is the basic model in search error and higher accuracies for certain described above, which we refer to as BLSTM- tasks, with the warm start method leading to a bet- CRF. Below we describe three additional tech- ter speed/accuracy trade-off. niques that we add to the basic model. We will refer to the CRF with these three techniques as 2 Sequence Models BLSTM-CRF+. Using these two models permits For sequence labeling tasks, given an input se- us to assess the impact of model complexity and performance level on the inference method com- quence x = hx1,x2,...,x x i, we wish to output | | parison. a sequence y = hy1, y2,..., y|x|i∈Y(x). Here Y(x) is the structured output space for x. Each la- Word Embedding Fine-Tuning. We used pre- bel yt is represented as an L-dimensional one-hot trained, fixed word embeddings when using the vector where L is the number of labels. BLSTM-CRF model, but for the more complex Conditional random fields (CRFs; BLSTM-CRF+ model, we fine-tune the pretrained Lafferty et al., 2001) form one popular class word embeddings during training. of methods for structured prediction, especially for sequence labeling. We define our structured Character-Based Embeddings. Character- energy function to be similar to those often used based word embeddings provide consistent im- in CRFs for sequence labeling: provements in sequence labeling (Lample et al., 2016; Ma and Hovy, 2016). In addition to EΘ(x, y)= pretrained word embeddings, we produce a L character-based embedding for each word using ⊤ ⊤ − yt,i ui f(x,t) + yt−1Wyt a character convolutional network like that of t i t ! X X=1 X Ma and Hovy (2016). The filter size is 3 charac- where yt,i is the ith entry of the vector yt. In ters and the character embedding dimensionality the standard discrete-label setting, each yt is a is 30. We use max pooling over the character 3314 sequence in the word and the resulting embedding 4 Inference Networks is concatenated with the word embedding before Tu and Gimpel (2018) define an inference net- being passed to the BLSTM. work (“infnet”) AΨ : X →YR and train it with Dropout. We also add dropout during train- the goal that ing (Hinton et al., 2012). Dropout is applied be- AΨ(x) ≈ argmin EΘ(x, y) fore the character embeddings are fed into the y∈YR(x) CNNs, at the final word embedding layer before the input to the BLSTM, and after the BLSTM. where YR is the relaxed continuous output space The dropout rate is 0.5 for all experiments. as defined in Section 3. For sequence labeling, for example, an inference network AΨ takes a se- 3 Gradient Descent for Inference quence x as input and outputs a distribution over labels for each position in x. Below we will con- To use gradient descent (GD) for structured infer- sider three families of neural network architectures ence, researchers typically relax the output space for AΨ. from a discrete, combinatorial space to a continu- For training the inference network parameters ous one and then use gradient descent to solve the Ψ, Tu and Gimpel (2018) explored stabilization following optimization problem: and regularization terms and found that a local cross entropy loss consistently worked well for se- argmin EΘ(x, y) quence labeling. We use this local cross entropy y∈YR(x) loss in this paper, so we perform learning by solv- ing the following: where YR is the relaxed continuous output space.