Convex Envelopes of Complexity Controlling Penalties: the Case Against Premature Envelopment
Total Page:16
File Type:pdf, Size:1020Kb
Convex envelopes of complexity controlling penalties: the case against premature envelopment Vladimir Jojic Suchi Saria Daphne Koller Stanford University Stanford University Stanford University Abstract that capture this trade-off: MDL (Barron et al., 1998), BIC (Schwarz, 1978), AIC (Akaike, 1973), Bayes factor Convex envelopes of the cardinality and rank (Kass and Raftery, 1995) and so on. These costs are discontinuous and in some cases even their evaluation function, l1 and nuclear norm, have gained immense popularity due to their sparsity in- (e.g., Bayes factors) can be challenging. ducing properties. This has given rise to a Recent work on discovering compact representations natural approach to building objectives with of data has focused on combinations of convex losses, sparse optima whereby such convex penal- mostly stemming from generalized linear models, and ties are added to another objective. Such a convex sparsity promoting penalties. The uniqueness heuristic approach to objective building does of optima in these costs combined with parameter not always work. For example, addition of sparsity makes them quite desirable. The joint sim- an L1 penalty to the KL-divergence fails to plicity of generalized linear losses and sparsity of pa- induce any sparsity, as the L1 norm of any rameters makes the models easily interpretable, a crit- vector in a simplex is a constant. However, ical feature when the results of such fits need to be a convex envelope of KL and a cardinality communicated across fields. penalty can be obtained that indeed trades off sparsity and KL-divergence. The sparsity inducing penalties have found applica- tions across a variety of disciplines: `1, most commonly We consider the cases of two composite under the guise of lasso(Tibshirani, 1996) and elastic penalties, elastic net and fused lasso, which net (Zou and Hastie, 2005); nuclear norm within a vari- combine multiple desiderata. In both of these ety of applications such as compressed sensing (Candes cases, we show that a hard objective relaxed et al., 2006); and recommender matrix reconstruction to obtain penalties can be more tightly ap- (Candes and Recht, 2009). proximated. Further, by construction, it is impossible to get a better convex approx- However, the crucial observation that `1 is the tight- imation than the ones we derive. Thus, est convex relaxation of cardinality and nuclear norm constructing a joint envelope across different for rank (Fazel, 2002) has not been leveraged. At the parts of the objective provides a means to same time, the family of lasso penalties continues to trade off tightness and computational cost. grow via juxtaposition. The constituent weights of these fragmented penalties are adjusted via cross vali- dation posing serious computational problems, and at 1 Introduction the same time eroding interpretability. The goal of this paper is to illustrate how the juxtaposition of convex Compact summarization of data succinctly describes penalties corresponds to piecemeal relaxation of a hard many tasks shared across areas such as statistics, ma- penalty and to show that the provably tightest convex chine learning, information theory, and computational approximation can be constructed for hard penalties. biology. The quality of a model is measured as a trade- We also show that in some fairly common cases, elas- off between reconstruction error and the complexity of tic net and fused lasso, tighter convex relaxations ex- the model’s parametrization. A number of costs exist ist. In other cases, we show a dramatic failure of the th juxtaposition approach in inducing any change to the Appearing in Proceedings of the 14 International Con- objective: the sparsifying convex penalty fails to spar- ference on Artificial Intelligence and Statistics (AISTATS) 2011, Fort Lauderdale, FL, USA. Volume 15 of JMLR: sify. W&CP 15. Copyright 2011 by the authors. We show how an envelope can be computed and Convex envelopes of complexity controlling penalties envelope-penalized losses can be optimized. We illus- 2 trate how these tasks can be accomplished both when cardinality L1 an envelope is available in closed form and in cases 1.5 where a closed form is not available. 1 In contrast to recent work on relationships between submodular set functions, convex envelopes of these 0.5 functions, and sparsity inducing norms (Bach, 2010), 0 we focus on envelopes of composite penalties involving −2 −1 0 1 2 cardinality. The key insight is that a joint envelope over multiple hard penalties is tighter than envelop- Figure 1: `1 is an envelope of card(·) only in the ment of each penalty in isolation. [−1, 1] interval The structure of the paper is as follows: first we state results on Fenchel duality and envelopes and show the is the pointwise supremum of all the affine functions main steps in deriving a convex envelope of cardinal- on Rn majorized by f. ity. We then give two simple optimization algorithms for cases where the closed-form of the envelope is not Hence the envelope of f is the tightest convex under- available. The first algorithm numerically evaluates approximate of the function f. an envelope. The second algorithm provide a gen- eral blueprint for optimizing envelope penalized losses. 2.1 An example of convex envelope We proceed to illustrate how simple juxtaposition of derivation: Cardinality to `1 penalties, KL-divergence and `1 in this case, can fail to produce any effect, but also show the existence of We show the computation of the convex envelope for sparsified KL-divergence. We then focus on the elastic card(·) as a prototype for such derivations (Boyd and net penalty, composed of `2 and `1. For this penalty, Vandenberghe, 2004). The penalty is given as a relaxation of cardinality and `2, we show the tight- n est convex relaxation and compare the performance of fλ(x ∈ [−1, 1] ) = λcard(x) these two relaxations in the tasks of support recov- and its conjugate is by definition ery. Given a closed-form solution of the envelope, we show a simple coordinate descent method for a loss f ∗(y ∈ Rn) = sup hx, yi − λcard(x). penalized by this envelope. Finally, we look at an- x∈[−1,1]n other example of a composite penalty, fused lasso, and We will assume an ordering σ of coordinates y such show that it also has a corresponding hard penalty, that y2 ≥ y2 and introduce an auxiliary variable related to its degrees of freedom, and use its envelope σ(i) σ(i−1) r to illustrate its benefits in the context of a biological ∗ f (y) = max sup x, yσ(1:r) − λr application of copy number variation analysis. r x∈[−1,1]r Once r is fixed, the supremum is achieved when the 2 Convex envelope inner product is maximized and given a constraint that n x ∈ [−1, 1] , that is xi = sgn (yi), hence Following Hiriart-Urruty and Lemarchal (1993) we will r n assume that the functions that we aim to envelope are ∗ X X f (y) = max yσ(i) − λ = (|yi| − λ) . not identically +∞ and can be minorized by an affine r + i=1 i=1 function on a set of interest, for example box [−1, 1]n. Here we restate the definition of a conjugate and the The biconjugate is theorem about the tightness of the envelope. n ∗∗ n X f (z ∈ [−1, 1] ) = sup hz, yi − (|yi| − λ)+ n Definition The conjugate of a function f(x ∈ X ) is y∈R i=1 the function Since the above supremum is separable in coordinates, f ∗(y) = sup hy, xi − f(x). (1) we can reason about each coordinate in isolation and, x∈X for each, consider cases when the positive part is zero and non-negative. Elementary reasoning produced the Theorem 2.1 For a function f(x ∈ X ) the envelope closed form solution of the supremum (biconjugate) n X f ∗∗(z) = (f ∗)∗(z) = sup hy, zi − f ∗(y) (2) λ |zi| y i=1 Vladimir Jojic, Suchi Saria, Daphne Koller n Hence the biconjugate of λcard(·) on [−1, 1] is λ k·k1. Algorithm 1 A subgradient method for computing Figure 1 shows both the f and its envelope f ∗∗. Note envelope f ∗∗(z) that the envelope is only valid on the box [−C, C]n. Input: z WLOG we will assume that C = 1 in the rest of the Output: f ∗∗(z) paper. As the box boundaries grow, the envelope f ∗∗ for k = 1 to MAXITER do can be shown to tend to a constant. However, a change yk+1 = yk + (1/k)∇h(y, z) in the box boundaries corresponds to scaling of the end for penalty parameter λ. return yMAXITER+1, z − f ∗(yMAXITER+1) 3 Evaluating the envelope and ∗ optimizing envelope-penalized losses f (y) = supβ hβ, yi − λf(β) in this case is just as hard as the original problem, subset selection. This is not In this section we show numerical methods for two surprising as the tightness of the convex envelope im- common tasks: pointwise evaluation of the enve- plies that it touches the global minimum of the func- lope and optimization of the envelope-penalized losses. tion. Hence, envelopment does not erase the hardness Pointwise evaluation of the envelope is useful for ex- of the original problem. ploring the behavior of the envelope, such as the level sets’ shape. However, for the second task of minimiz- 3.2 The variational inequality approach to ing penalized losses, explicit pointwise evaluation of minimizing envelope-penalized losses the envelope can be bypassed in the interest of effi- cient optimization. Minimization of an envelope-penalized loss PLoss(z) = Loss(z) + f ∗∗(z) (5) 3.1 Pointwise evaluation of the envelope seems to require pointwise-evaluation of the envelope. We focus on envelopes of functions that do not have When a closed-form envelope is available, this does a closed form solution but turn out to be numeri- not pose a problem.