Structured Prediction Energy Networks

Structured Prediction Energy Networks David Belanger [email protected] Andrew McCallum [email protected] College of Information and Computer Sciences, University of Massachusetts Amherst Abstract structured output y, such as a labeling of time steps, a col- lection of attributes for an image, a parse of a sentence, or We introduce structured prediction energy net- a segmentation of an image into objects. Such problems works (SPENs), a flexible framework for struc- are challenging because the number of candidate y is expo- tured prediction. A deep architecture is used nential in the number of output variables that comprise it. to define an energy function of candidate labels, As a result, practitioners encounter computational consid- and then predictions are produced by using back- erations, since prediction requires searching an enormous propagation to iteratively optimize the energy space, and also statistical considerations, since learning ac- with respect to the labels. This deep architecture curate models from limited data requires reasoning about captures dependencies between labels that would commonalities between distinct structured outputs. There- lead to intractable graphical models, and per- fore, structured prediction is fundamentally a problem of forms structure learning by automatically learn- representation, where the representation must capture both ing discriminative features of the structured out- the discriminative interactions between x and y and also al- put. One natural application of our technique is low for efficient combinatorial optimization over y. With multi-label classification, which traditionally has this perspective, it is unsurprising that there are natural required strict prior assumptions about the inter- combinations of structured prediction and deep learning, actions between labels to ensure tractable learn- a powerful framework for representation learning. ing and prediction. We are able to apply SPENs to multi-label problems with substantially larger We consider two principal approaches to structured predic- label sets than previous applications of struc- tion: (a) as a feed-forward function y = f(x), and (b) using 0 tured prediction, while modeling high-order in- an energy-based viewpoint y = arg miny0 Ex(y ) (LeCun teractions using minimal structural assumptions. et al., 2006). Feed-forward approaches include, for exam- Overall, deep learning provides remarkable tools ple, predictors using local convolutions plus a classifica- for learning features of the inputs to a prediction tion layer (Collobert et al., 2011), fully-convolutional net- problem, and this work extends these techniques works (Long et al., 2015), or sequence-to-sequence predic- to learning features of structured outputs. Our tors (Sutskever et al., 2014). Here, end-to-end learning can experiments provide impressive performance on be performed easily using gradient descent. In contrast, the a variety of benchmark multi-label classification energy-based approach may involve non-trivial optimiza- tasks, demonstrate that our technique can be used tion to perform predictions, and includes, for example, con- to provide interpretable structure learning, and ditional random fields (CRFs) (Lafferty et al., 2001). From arXiv:1511.06350v3 [cs.LG] 23 Jun 2016 illuminate fundamental trade-offs between feed- a modeling perspective, energy-based approaches are desir- forward and iterative structured prediction. able because directly parametrizing Ex(·) provides practitioners with better opportunities to utilize domain knowl- edge about properties of the structured output. Further- more, such a parametrization may be more parsimonious, 1. Introduction resulting in improved generalization from limited data. On the other hand, prediction and learning are more complex. Structured prediction is an important problem in a variety For energy-based prediction, prior applications of deep of machine learning domains. Consider an input x and learning have mostly followed a two-step construction: rd first, choose an existing model structure for which the Proceedings of the 33 International Conference on Machine 0 search problem y = arg min 0 E (y ) can be performed ef- Learning, New York, NY, USA, 2016. JMLR: W&CP volume y x 48. Copyright 2016 by the author(s). ficiently, and then express the dependence of Ex(·) on x via Structured Prediction Energy Networks a deep architecture. For example, the tables of potentials of SPEN prediction lacks algorithmic guarantees, since it only an undirected graphical model can be parametrized via a performs local optimization of the energy. deep network applied to x (LeCun et al., 2006; Collobert SPENs are particularly well suited to multi-label classifi- et al., 2011; Jaderberg et al., 2014; Huang et al., 2015; cation problems. These are naturally posed as structured Schwing & Urtasun, 2015; Chen et al., 2015). The advan- prediction, since the labels exhibit rich interaction struc- tage of this approach is that it employs deep architectures ture. However, unlike problems with grid structure, where to perform representation learning on x, while leveraging there is a natural topology for prediction variables, the in- existing algorithms for combinatorial prediction, since the teractions between labels must be learned from data. Prior dependence of E (y0) on y0 remains unchanged. In some x applications of structured prediction, eg. using CRFs, have of these examples, exact prediction is intractable, such as been limited to small-scale problems, since the techniques’ for loopy graphical models, and standard techniques for complexity, both in terms of the number of parameters to learning with approximate inference are employed. An al- estimate and the per-iteration cost of algorithms like be- ternative line of work has directly maximized the perfor- lief propagation, grows at least quadratically in the num- mance of iterative approximate prediction algorithms by ber of labels L (Ghamrawi & McCallum, 2005; Finley & performing back-propagation through the iterative proce- Joachims, 2008; Meshi et al., 2010; Petterson & Caetano, dure (Stoyanov et al., 2011; Domke, 2013; Hershey et al., 2011). For SPENs, though, both the per-iteration predic- 2014; Zheng et al., 2015). tion complexity and the number of parameters scale lin- All of these families of deep structured prediction tech- early in L. We only impose mild prior assumptions about niques assume a particular graphical model for Ex(·) a- labels’ interactions: they can be encoded by a deep ar- priori, but this construction perhaps imposes an excessively chitecture. Motivated by recent compressed sensing ap- strict inductive bias. Namely, practitioners are unable to proaches to multi-label classification (Hsu et al., 2009; use the deep architecture to perform structure learning, Kapoor et al., 2012), we further assume that the first layer representation learning that discovers the interaction be- of the network performs a small set of linear projections of tween different parts of y. In response, this paper explores the prediction variables. This provides a particularly parsi- Structured Prediction Energy Networks (SPENs), where a monious representation of the energy function and an inter- deep architecture encodes the dependence of the energy on pretable tool for structure learning. y, and predictions are obtained by approximately minimiz- On a selection of benchmark multi-label classification ing the energy iteratively via gradient descent. tasks, the expressivity of our deep energy function pro- Using gradient-based methods to predict structured outputs vides accuracy improvements against a variety of competi- was mentioned in LeCun et al.(2006), but applications tive baselines, including a novel adaptation of the ‘CRF as have been limited since then. Mostly, the approach has RNN’ approach of Zheng et al.(2015). We also offer exper- been applied for alternative goals, such as generating adver- iments contrasting SPEN learning with alternative SSVM- sarial examples (Szegedy et al., 2014; Goodfellow et al., based techniques and analyzing the convergence behavior 2015), embedding examples in low-dimensional spaces (Le and speed-accuracy tradeoffs of SPEN prediction in prac- & Mikolov, 2014), or image synthesis (Mordvintsev et al., tice. Finally, experiments on synthetic data with rigid mu- 2015; Gatys et al., 2015a;b). This paper provides a concrete tual exclusivity constraints between labels demonstrate the extension of the implicit regression approach of LeCun power of SPENs to perform structure learning and illumi- et al.(2006) to structured objects, with a target applica- nate important tradeoffs in the expressivity and parsimony tion (multi-label classification), a family of candidate archi- of SPENs vs. feed-forward predictors. We encourage fur- tectures (Section3), and a training algorithm (a structured ther application of SPENs in various domains. SVM (Taskar et al., 2004; Tsochantaridis et al., 2004)). Overall, SPENs offer substantially different tradeoffs than 2. Structured Prediction Energy Networks prior applications of deep learning to structured predic- For many structured prediction problems, an x ! y map- tion. Most energy-based approaches form predictions us- ping can be defined by posing y as the solution to a poten- ing optimization algorithms that are tailored to the prob- tially non-linear combinatorial optimization problem, with lem structure, such as message passing for loopy graphical parameters dependent on x (LeCun et al., 2006): models. Since

Load more