Learning Max-Margin Tree Predictors
Total Page:16
File Type:pdf, Size:1020Kb
Learning Max-Margin Tree Predictors Ofer Meshiy∗ Elad Ebany∗ Gal Elidanz Amir Globersony y School of Computer Science and Engineering z Department of Statistics The Hebrew University of Jerusalem, Israel Abstract years using structured output prediction has resulted in state-of-the-art results in many real-worlds problems from computer vision, natural language processing, Structured prediction is a powerful frame- computational biology, and other fields [Bakir et al., work for coping with joint prediction of 2007]. Such predictors can be learned from data using interacting outputs. A central difficulty in formulations such as Max-Margin Markov Networks using this framework is that often the correct (M3N) [Taskar et al., 2003, Tsochantaridis et al., 2006], label dependence structure is unknown. At or conditional random fields (CRF) [Lafferty et al., the same time, we would like to avoid an 2001]. overly complex structure that will lead to intractable prediction. In this work we ad- While the prediction and the learning tasks are dress the challenge of learning tree structured generally computationally intractable [Shimony, 1994, predictive models that achieve high accuracy Sontag et al., 2010], for some models they can be while at the same time facilitate efficient carried out efficiently. For example, when the model (linear time) inference. We start by proving consists of pairwise dependencies between output vari- that this task is in general NP-hard, and then ables, and these form a tree structure, prediction can suggest an approximate alternative. Our be computed efficiently using dynamic programming CRANK approach relies on a novel Circuit- at a linear cost in the number of output variables RANK regularizer that penalizes non-tree [Pearl, 1988]. Moreover, despite their simplicity, tree structures and can be optimized using a structured models are often sufficiently expressive to convex-concave procedure. We demonstrate yield highly accurate predictors. Accordingly, much of the effectiveness of our approach on several the research on structured prediction focused on this domains and show that its accuracy matches setting [e.g., Lafferty et al., 2001, Collins, 2002, Taskar that of fully connected models, while per- et al., 2003, Tsochantaridis et al., 2006]. forming prediction substantially faster. Given the above success of tree structured models, it is unfortunate that in many scenarios, such as a document classification task, there is no obvious way 1 Introduction in which to choose the most beneficial tree. Thus, a natural question is how to find the tree model that Numerous applications involve joint prediction of com- best fits a given structured prediction problem. This is plex outputs. For example, in document classification precisely the problem we address in the current paper. the goal is to assign the most relevant (possibly Specifically, we ask what is the tree structure that is multiple) topics to each document; in gene annotation, optimal in terms of a max-margin objective [Taskar we would like to assign each gene a set of relevant func- et al., 2003]. Somewhat surprisingly, this optimal tional tags out of a large set of possible cellular func- tree problem has received very little attention in the tions; in medical diagnosis, we would like to identify all context of discriminative structured prediction (the the diseases a given patient suffers from. Although the most relevant work is Bradley and Guestrin [2010] output space in such problems is typically very large, which we address in Section 6). it often has intrinsic structure which can be exploited Our contributions are as follows. We begin by proving to construct efficient predictors. Indeed, in recent that it is NP-hard in general to find the optimal ∗Authors contributed equally. max-margin predictive tree, in marked contrast to the generative case where the optimal tree can be Since the space of possible outputs may be quite large, learned efficiently [Chow and Liu, 1968]. To cope with maximization of y can be computationally intractable. this theoretical barrier, we propose an approximation It is therefore useful to consider score functions that scheme that uses regularization to penalize non-tree decompose into simpler ones. One such decomposition models. Concretely, we propose a regularizer that is that is commonly used consists of scores over single based on the circuit rank of a graph [Berge, 1962], variables and pairs of variables that correspond to namely the minimal number of edges that need to nodes and edges of a graph G, respectively: be removed from the graph in order to obtain a tree. > X > X > Minimization of the resulting objective is still difficult, w φ(x; y) = wijφij(x; yi; yj) + wi φi(x; yi): and we further approximate it using a difference of ij2E(G) i2V (G) continuous convex envelopes. The resulting objective (2) can then be readily optimized using the convex concave Importantly, when the graph G has a tree structure procedure [Yuille and Rangarajan, 2003]. then the maximization over y can be solved exactly and efficiently using dynamic programming algorithms We apply our method to synthetic and varied real- (e.g., Belief Propagation [Pearl, 1988]). world structured output prediction tasks. First, we show that the learned tree model is competitive with As mentioned above, we consider problems where there a fully connected max-margin model that is substan- is no natural way to choose a particular tree structure, tially more computationally demanding at prediction and our goal is to learn the optimal tree from training time. Second, we show that our approach is superior data. We next formalize this objective. to several baseline alternatives (e.g., greedy structure In a tree structured model, the set of edges ij in Eq. (2) learning) in terms of generalization performance and forms a tree. This is equivalent to requiring that the running time. vectors wij in Eq. (2) be non-zero only on edges of some tree. To make this precise, we first define, for a 2 The Max-margin Tree given spanning tree T , the set WT of weight vectors that \agree" with T :1 Let x be an input vector (e.g., a document) and y a discrete output vector (e.g., topics assigned to the WT = fw : ij2 = T =) wij = 0g : (3) document, where yi = 1 when topic i is addressed in x). As in most structured prediction approaches, Next we consider the set W[ of weight vectors that we assume that inputs are mapped to outputs ac- agree with some spanning tree. Denote the set of S cording to a linear discrimination rule: y(x; w) = all spanning trees by T , then: W[ = WT . > 0 T 2T argmax 0 w φ(x; y ), where φ(x; y) is a function that y The problem of finding the optimal max-margin tree maps input-output pairs to a feature vector, and w predictor is therefore: is the corresponding weight vector. We will call > 0 w φ(x; y ) the score that is assigned to the prediction min `(w): (4) 0 y given an input x. w2W[ Assume we have a set of M labeled pairs We denote this as the MT reeN problem. In what m m M f(x ; y )gm=1, and would like to learn w. In the follows, we first show that this problem is NP-hard, 3 M N formulation proposed by Taskar et al. [2003], and then present an approximation scheme. w is learned by minimizing the following (regularized) structured hinge loss: 3 Learning M 3N Trees is NP-hard λ 1 X `(w) = kwk2 + hm(w); 2 M We start by showing that learning the optimal tree m in the discriminative max-margin setting is NP-hard. where As noted, this is somewhat of a surprise given that the best tree is easily learned in the generative setting h i hm(w) = max w>φ(xm; y) + ∆(y; ym) − w>φ(xm; ym); [Chow and Liu, 1968], and that tree structured models y are often used due to their computational advantages. (1) and ∆(y; ym) is a label-loss function measuring the In particular, we consider the problem of deciding cost of predicting y when the true label is ym (e.g., 0/1 whether there exists a tree structured model that cor- rectly labels a given dataset (i.e., deciding whether the or Hamming distance). Thus, the learning problem involves a loss-augmented prediction problem for each 1Note that weights corresponding to single node fea- training example. tures are not restricted. dataset is separable with a tree model). Formally, we parameters. For clarity of exposition, we defer the define the MT reeN decision problem as determining proof that these parameters are identifiable using a whether the following set is empty: polynomial number of training examples to App. A. n o > m m > m m The singleton parameters are w 2 W[ w φ(x ; y ) ≥ w φ(x ; y) + ∆(y; y ) 8m; y : (5) ( D i = 1; y1 = 0 To facilitate the identifiability of the model parameters θi(yi) = (8) that is later needed, we adopt the formalism of Sontag 0 otherwise; et al. [2010] and define the score: and the pairwise parameters for ij 2 E(G) are: S(y; x; T; w) 8 2 X > X > >−n yi 6= yj = wijφij(x; yi; yj) + (wi φi(x; yi) + xi(yi)) > >0 y = y = 0 ij2T i <> i j X X X ≡ θ (y ; y ) + θ (y ) + x (y ); (6) θij(yi; yj) = 1 yi = yj = i (9) ij i j i i i i > ij2T i i >1 yi = yj = j > :>0 otherwise: where xi(yi) is a bias term which does not depend on w,2 and for notational convenience we have dropped the dependence of θ on x and w.