Learning and Inference for Structured Prediction: a Unifying Perspective

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Learning and Inference for Structured Prediction: A Unifying Perspective Aryan Deshwal1 , Janardhan Rao Doppa1 , Dan Roth2 1School of EECS, Washington State University 2Department of Computer and Information Science, University of Pennsylvania faryan.deshwal,[email protected], [email protected] Abstract learning for probabilistic SP is #P-hard due to the computation of partition function. Specifically, the time complexity In a structured prediction problem, one needs to of exact inference depends on the complexity of model that learn a predictor that, given a structured input, pro- tries to capture the dependency structure between input and duces a structured object, such as a sequence, tree, output variables. Efficient solutions exist only when this de- or clustering output. Prototypical structured predic- pendency structure forms a tree with small width. Therefore, tion tasks include part-of-speech tagging (predict- the core research challenge in structured prediction has been ing POS tag sequence for an input sentence) and se- to achieve a balance between two conflicting goals: 1) It must mantic segmentation of images (predicting seman- be flexible to allow for complex and accurate predictors to be tic labels for pixels of an input image). Unlike sim- learned, and 2) It must support computationally-efficient in- ple classification problems, here there is a need to ference of outputs. assign values to multiple output variables account- There are different structured prediction frameworks that ing for the dependencies between them. Conse- make varying trade-offs between the above two goals. In quently, the prediction step itself (aka “inference” this paper, we present a unifying perspective of the differ- or “decoding”) is computationally-expensive, and ent frameworks to solve structured prediction problems and so is the learning process, that typically requires compare them in terms of their strengths and weaknesses. making predictions as part of it. The key learning The unifying perspective relies on two key abstractions: 1) and inference challenge is due to the exponential Inference is formulated as implicit or explicit search process; size of the structured output space and depend on and 2) Learning is performed with a fixed inference scheme its complexity. In this paper, we present a unify- or learning also supports to optimize the efficiency and ac- ing perspective of the different frameworks that ad- curacy of inference. We also discuss some important future dress structured prediction problems and compare research directions in this area, namely, integrating advances them in terms of their strengths and weaknesses. in deep learning for structured prediction and learning from We also discuss important research directions in- small amount of labeled data. Since the literature on struc- cluding integration of deep learning advances into tured prediction is vast, we refer the reader to our recent IJ- structured prediction methods, and learning from CAI and AAAI tutorials for a complete set of references. weakly supervised signals and active querying to overcome the challenges of building structured pre- 2 Problem Setup dictors from small amount of labeled data. A structured prediction problem specifies a space of struc- 1 Introduction tured inputs X , a space of structured outputs Y, and a non- negative loss function L : X × Y × Y 7! <+ such that Structured prediction (SP) tasks arise in several domains in- L(x; y0; y∗) is the loss associated with labeling a particular cluding natural language processing, computer vision, com- input x by output y0 when the true output is y∗. Without loss putational biology, and graph analysis. In a SP problem, the of generality, we assume that each structured output y 2 Y goal is to learn a mapping from structured inputs to struc- be represented using d discrete and/or continuous variables tured outputs. For example, in semantic labeling of images, v ; v ; ··· ; v , and each variable v can take candidate values the structured input is an image and the structured output is a 1 2 d i from a set C(vi). Since all algorithms will be learning func- labeling of the image regions. In structured prediction tasks, tions or objectives over input-output pairs, they assume the we need to predict multiple output variables by exploiting the availability of a joint feature function Φ: X × Y 7! <m that dependencies between them. Viewed as a traditional clas- computes an m dimensional feature vector for any pair. sification problem, the set of candidate classes in structured prediction is exponential in the size of the output. The large Example 1: Part-of-speech (POS) tagging task. Each number of candidate structured outputs pose significant infer- structured input is a sequence of words. Each output vari- ence and learning challenges (inference task is NP-Hard and able vi stands for POS tag of a word and C(vi) is the list 6291 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) of all candidate POS tags. Hamming loss (number of incor- learning approach is iterative by nature. In each iteration, we rect POS tags) is typically employed as loss function. Joint perform three main steps: select a subset of inputs D0 from 0 features include unary features (representing words and their the training set and compute predictions Dy by running the POS tags as in a multi-class classifier) and structural features inference algorithm with current weights; generate ranking (e.g., label pairs to capture the strength of label transitions). examples if predictions don’t match with ground-truth out- Example 2: Image labeling task. Each structured input is puts; and update weights based on new or aggregate ranking an image. Each output variable v corresponds to a seman- examples using appropriate optimization method. By varying i jD0j, we can get online (jD0j = 1), mini-batch (jD0j < jDj), tic label of one pixel in the image and C(vi) is the list of all 0 candidate labels. Intersection-over-Union (IoU) loss – simi- and full batch (jD j = jDj) training methods. Cost func- larity between the predicted region and the ground-truth re- tion learning approaches treat inference solver as a “black- gion for a given semantic labeling – is employed as loss func- box” during training (aka learning with inference). tion. Unlike Hamming loss, IoU loss doesn’t decompose over Algorithm 1 Cost Function Learning Framework the loss of individual output variables. Joint features include Input: D = fx; y∗g, structured training examples unary features and structural features (e.g., context features 1: Initialize the weights of cost function C: w 0 that count the co-occurrences of different labels in different 2: repeat spatial relationships such as left, right, above, and below. In- 3: Select a batch D0 ⊆ D for (loss-augmented) inference tuitively, we are capturing, for example, whether a pixel la- 4: // Call inference algorithm 0 0 beled “sky” is below another pixel labeled “grass”). 5: Dy Inference-Solver(D ;C(x; y)) ∗ 0 0 We are provided with a training set of structured input-output 6: for each (x; y ) 2 D and y^ 2 Dy do pairs f(x; y∗)g drawn from an unknown target distribution 7: If y∗ 6=y ^, create a ranking example C(x; y∗) < C(x; y^) D. The goal is to return a function/predictor from structured 8: end for inputs to outputs whose predicted outputs have low expected 9: // Optimization to update weights loss with respect to the distribution D. The manner in which 10: Update weights w using new or aggregate ranking examples 11: until convergence this goal is achieved varies among SP algorithms. 12: return w, weights of the learned cost function C 3 Cost Function Learning Approaches Learning with approximate inference. It is conceivable Cost function learning approaches correspond to general- that the above learning mechanism may not be reliable with izations of traditional classification algorithms: Conditional approximate inference solvers. Researchers have investi- Random Fields (CRFs) [Lafferty et al., 2001], Structured Per- gated the impact of approximation on learning by consid- ceptron [Collins, 2002], and Structured Support Vector Ma- ering two categories of approximate inference solvers [Fin- chines (SSVM) [Taskar et al., 2003; Tsochantaridis et al., ley and Joachims, 2008]: 1) Undergenerating, which con- 2004]. These methods typically learn a linear cost function sider a subset of the structured output space (e.g., greedy C(x; y) = w · Φ(x; y) to score a candidate structured out- search); and 2) Overgenerating, which consider a superset put y given a structured input x, where w 2 <m stands for of the output space (e.g., LP relaxations). Overgenerating weights/parameters. Given such a cost function and a new in- methods are found to better than undergenerating ones both put x, the output computation then involves finding the mini- theoretically and empirically [Finley and Joachims, 2008; mum cost output (aka Argmin inference problem): Kulesza and Pereira, 2007]. Learning methods that explic- itly account for approximation in inference is an active area y^ = arg min C(x; y): y2Y (x) of research [Stoyanov et al., 2011; Hazan et al., 2016]. Key challenges. The three main challenges for cost func- These methods typically do not have any explicit search for- tion learning framework are as follows: 1) Time complex- mulation for solving the inference problem. ity of solving inference problem is very high for complex Learning objective. The goal is to learn weights w of cost (i.e., higher-order) features; 2) Since weight learning ap- function C(x; y) such that for each training example (x; y∗), proach repeatedly calls the inference solver, training is com- the cost of the correct output is lower than score of any other putationally expensive; and 3) It is hard to optimize non- candidate output y, i.e., C(x; y∗) < C(x; y); 8y 2 Y (x)ny∗.

Load more