Predict and Constrain: Modeling Cardinality in Deep Structured Prediction
Total Page:16
File Type:pdf, Size:1020Kb
Predict and Constrain: Modeling Cardinality in Deep Structured Prediction Nataly Brukhim 1 Amir Globerson 1 Abstract cardinality (Tarlow et al. 2012; Tarlow et al. 2010; Milch Many machine learning problems require the pre- et al. 2008; Swersky et al. 2012; Gupta et al. 2007). Namely, diction of multi-dimensional labels. Such struc- the fact that the overall number of labels taking a specific tured prediction models can benefit from model- value has a distribution specific to the domain. For example, ing dependencies between labels. Recently, sev- in natural language processing, they can express a constraint eral deep learning approaches to structured pre- on the number of occurrences of a part-of-speech (e.g., that diction have been proposed. Here we focus on each sentence contains at least one verb (Ganchev et al., capturing cardinality constraints in such models. 2010)). In computer vision, a cardinality potential can en- Namely, constraining the number of non-zero la- code a prior distribution over object sizes in an image. bels that the model outputs. Such constraints have Although cardinality potentials have been very effective proven very useful in previous structured predic- in many structured prediction works, they have not yet tion approaches, but it is a challenge to introduce been successfully integrated into deep structured prediction them into a deep learning framework. Here we frameworks. This is precisely the goal of our work. show how to do this via a novel deep architec- ture. Our approach outperforms strong baselines, The challenge in modeling cardinality in deep learning is achieving state-of-the-art results on multi-label that cardinality is essentially a combinatorial notion, and classification benchmarks. it is thus not clear how to integrate it into a differentiable model. Our proposal to achieve this goal is based on two key observations. 1. Introduction First, we note that learning to predict how many labels are Deep structured prediction models have attracted consider- active for a given input is easier than predicting which labels able interest in recent years (Belanger et al. 2017; Zheng are active. Hence, we break our inference process into two et al. 2015; Schwing & Urtasun 2015; Chen et al. 2015; complementary components: we first estimate the label Ma & Hovy 2016). The goal in the structured prediction cardinality for a given input using a learned neural network, setting is to predict multiple labels simultaneously while and then predict a label that satisfies this constraint. utilizing dependencies between labels to improve accuracy. The second observation is that constraining a label to sat- Examples of such application domains include dependency isfy a cardinality constraint can be approximated via pro- parsing of text and semantic image segmentation. jected gradient descent, where the cardinality constraint Several recent approaches have proposed to combine the corresponds to a set of linear constraints. Thus, the overall representational power of deep neural networks with the label prediction architecture performs projected gradient ability of structured prediction models to capture label de- descent in label space. Moreover, the above projection can pendence. This concept has been shown to be effective for be implemented via sorting, and results in an end-to-end various applications (Zheng et al. 2015; Schwing & Urtasun differentiable architecture which can be directly optimized 2015; Ma & Hovy 2016; Belanger et al. 2017). However, to maximize any performance measure (e.g., F1, recall at there are still many important label dependency structures K, etc.). Importantly, with this approach cardinality can be which are not fully captured by these approaches. naturally integrated with other higher-order scores, such as the global scores considered in Belanger & McCallum 2016, One of the most effective forms of label dependencies is and used in our model as well. 1Tel Aviv University, Blavatnik School of Computer Science. Our proposed method significantly improves prediction ac- Correspondence to: Nataly Brukhim <[email protected]>, curacy on several datasets, when compared to recent deep Amir Globerson <[email protected]>. structured prediction methods. We also experiment with Proceedings of the 35 th International Conference on Machine other approaches for modeling cardinality in deep struc- Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 tured prediciton, and observe that our Predict and Constrain by the author(s). Predict and Constrain: Modeling Cardinality in Deep Structured Prediction method outperforms these. Taken together, our results demonstrate that deep structured prediction models can benefit from representing cardinality, and that a Predict and Constrain approach is an effective method for introducing cardinality in a differentiable end- to-end manner. 2. Preliminaries We consider the setting of assigning L labels y = Figure 1. Deep structured model with unrolled projected gradient ascent inference. (y1; : : : ; yL) to an input x 2 X . We assume yi are bi- nary labels, however our approach can be generalized to the which corresponds to the hard constraint that the cardinality multi-class case. We denote [L] = f1; :::; Lg. of y is equal to z. Namely: To model our problem in a structured prediction framework, ( P we consider a score function s(x; y). The score is meant to 0 if i yi = z sz(y) = : (3) reflect how “compatible” a label y is with an input x. Thus, −∞ otherwise for a given input x, the predicted label is: We refer to the method of using this two component potential y^ = argmax s(x; y) (1) y as a Predict and Constrain approach. In practice, we will maximize over y by relaxing the inte- 3.1. Inference with Projected Gradient Ascent grality constraint and using projected gradient ascent. We assume that s(x; y) can be decomposed as follows, We turn to the problem of maximizing the score function with respect to y. Exact maximization is intractable for the X s(x; y) = si(x; yi) + sg(y) + sz(x)(y) (2) potentials we consider. As in Belanger & McCallum 2016 i we address this by first relaxing the integrality constraint on y, and using gradient ascent for inference. However, where s is a function that depends on a single output vari- i the cardinality constraint means that we cannot use naive able (i.e., a unary potential), s is an arbitrary learned global g gradient ascent on the relaxed y. Instead we will repeatedly potential defined over all variables and independent of the project y onto the constrained space. input, and sz(x) is a cardinality potential which constrains y to be of cardinality z(x), as explained in Section3. For We first relax the discrete constraint on the binary variables brevity, we denote the cardinality potential as sz. y to a continuous constraint yi 2 [0; 1]. Next, the score s(x; y) is maximized with respect to the relaxed y using Concretely, s (x; y ) is defined as the multiplication of y i i i projected gradient ascent, as depicted in Figure1. At each by a linear function of a feature representation of the input, iteration of projected gradient ascent, the following two up- where the feature representation is given by a multi-layer dates are performed. First, we apply a gradient update step perceptron f(x). The global score s is obtained by apply- g over the score function, excluding the cardinality potential, ing a neural network over the variables, and evaluates such with respect to y. Second, we project it onto the cardinality assignments independent of x, similarly to the architecture constrained space, as explained in Section 3.2. used by Belanger & McCallum 2016. The above projected-gradient updates are repeated for T The components of the score function s(x; y) are parame- iterations, where the initial variables y are given by a sig- terized by a set of weights w. As in Belanger & McCallum 0 moid function applied over the unary terms. Since all update 2016, we learn these by end-to-end minimization of a loss operations are differentiable, this results in an end-to-end function on the training data. In what follows we describe differentiable architecture. Thus, we can directly apply a our method in more detail. loss function to the output of this network yT , and optimize it (see Belanger et al. 2017; Maclaurin et al. 2015). 3. Learning with Cardinality Potentials Let `(y; y∗) be a label-loss function, reflecting the error of We next explain how cardinality is modeled in our score predicting a label y where the true label is y∗. For example, function, and how the score is maximized. Our cardinality `(y; y∗) might be the cross-entropy loss function, or a dif- score has two key components. First, a learned cardinality ferentiable approximation of any performance measure of predictor z = h(x) that maps an input to an estimate of the interest, such as the F1 measure we use in our experiments cardinality of y. Second, the cardinality potential sz(y), (see Equation (10)). Predict and Constrain: Modeling Cardinality in Deep Structured Prediction ∗ Given training data (x; y ), let yt(x) be the value of label y Algorithm 1 Soft projection onto the simplex after t iterations of projected gradient ascent. We would like Input: vector v 2 RL, cardinality z 2 [Z] `(y (x); y∗) to minimize the final loss T . As in other end-to- Sort v into µ: µ1 ≥ ::: ≥ µL end approaches (Belanger et al., 2017), we found it better Compute µ¯ the cumulative sum of µ to apply the loss function over all T iterations. Specifically, Let I be a vector of indices 1; :::; L. we minimize the following objective: δ = Softsignµ ◦ I − (µ¯ − z) ρ = Softmaxδ ◦ I T ∗ ∗ 1 X `(yt(x); y ) 1 T `¯(x; y ) = θ = ITρ µ¯ ρ − z T T − t + 1 t=1 Output: max(v − θ; 0) By using all iterates in the objective, the model learns to converge quickly during inference.