Convex envelopes of complexity controlling penalties: the case against premature envelopment

Vladimir Jojic Suchi Saria Stanford University Stanford University

Abstract that capture this trade-off: MDL (Barron et al., 1998), BIC (Schwarz, 1978), AIC (Akaike, 1973), Bayes factor Convex envelopes of the cardinality and rank (Kass and Raftery, 1995) and so on. These costs are discontinuous and in some cases even their evaluation function, l1 and nuclear norm, have gained immense popularity due to their sparsity in- (e.g., Bayes factors) can be challenging. ducing properties. This has given rise to a Recent work on discovering compact representations natural approach to building objectives with of data has focused on combinations of convex losses, sparse optima whereby such convex penal- mostly stemming from generalized linear models, and ties are added to another objective. Such a convex sparsity promoting penalties. The uniqueness heuristic approach to objective building does of optima in these costs combined with parameter not always work. For example, addition of sparsity makes them quite desirable. The joint sim- an L1 penalty to the KL-divergence fails to plicity of generalized linear losses and sparsity of pa- induce any sparsity, as the L1 norm of any rameters makes the models easily interpretable, a crit- vector in a simplex is a constant. However, ical feature when the results of such fits need to be a convex envelope of KL and a cardinality communicated across fields. penalty can be obtained that indeed trades off sparsity and KL-divergence. The sparsity inducing penalties have found applica- tions across a variety of disciplines: `1, most commonly We consider the cases of two composite under the guise of lasso(Tibshirani, 1996) and elastic penalties, elastic net and fused lasso, which net (Zou and Hastie, 2005); nuclear norm within a vari- combine multiple desiderata. In both of these ety of applications such as compressed sensing (Candes cases, we show that a hard objective relaxed et al., 2006); and recommender matrix reconstruction to obtain penalties can be more tightly ap- (Candes and Recht, 2009). proximated. Further, by construction, it is impossible to get a better convex approx- However, the crucial observation that `1 is the tight- imation than the ones we derive. Thus, est convex relaxation of cardinality and nuclear norm constructing a joint envelope across different for rank (Fazel, 2002) has not been leveraged. At the parts of the objective provides a means to same time, the family of lasso penalties continues to trade off tightness and computational cost. grow via juxtaposition. The constituent weights of these fragmented penalties are adjusted via cross vali- dation posing serious computational problems, and at 1 Introduction the same time eroding interpretability. The goal of this paper is to illustrate how the juxtaposition of convex Compact summarization of data succinctly describes penalties corresponds to piecemeal relaxation of a hard many tasks shared across areas such as statistics, ma- penalty and to show that the provably tightest convex chine learning, information theory, and computational approximation can be constructed for hard penalties. biology. The quality of a model is measured as a trade- We also show that in some fairly common cases, elas- off between reconstruction error and the complexity of tic net and fused lasso, tighter convex relaxations ex- the model’s parametrization. A number of costs exist ist. In other cases, we show a dramatic failure of the

th juxtaposition approach in inducing any change to the Appearing in Proceedings of the 14 International Con- objective: the sparsifying convex penalty fails to spar- ference on Artificial Intelligence and Statistics (AISTATS) 2011, Fort Lauderdale, FL, USA. Volume 15 of JMLR: sify. W&CP 15. Copyright 2011 by the authors. We show how an envelope can be computed and Convex envelopes of complexity controlling penalties envelope-penalized losses can be optimized. We illus- 2 trate how these tasks can be accomplished both when cardinality L1 an envelope is available in closed form and in cases 1.5 where a closed form is not available. 1 In contrast to recent work on relationships between submodular set functions, convex envelopes of these 0.5 functions, and sparsity inducing norms (Bach, 2010), 0 we focus on envelopes of composite penalties involving −2 −1 0 1 2 cardinality. The key insight is that a joint envelope over multiple hard penalties is tighter than envelop- Figure 1: `1 is an envelope of card(·) only in the ment of each penalty in isolation. [−1, 1] interval The structure of the paper is as follows: first we state results on Fenchel duality and envelopes and show the is the pointwise supremum of all the affine functions main steps in deriving a convex envelope of cardinal- on Rn majorized by f. ity. We then give two simple optimization algorithms for cases where the closed-form of the envelope is not Hence the envelope of f is the tightest convex under- available. The first algorithm numerically evaluates approximate of the function f. an envelope. The second algorithm provide a gen- eral blueprint for optimizing envelope penalized losses. 2.1 An example of convex envelope We proceed to illustrate how simple juxtaposition of derivation: Cardinality to `1 penalties, KL-divergence and `1 in this case, can fail to produce any effect, but also show the existence of We show the computation of the convex envelope for sparsified KL-divergence. We then focus on the elastic card(·) as a prototype for such derivations (Boyd and net penalty, composed of `2 and `1. For this penalty, Vandenberghe, 2004). The penalty is given as a relaxation of cardinality and `2, we show the tight- n est convex relaxation and compare the performance of fλ(x ∈ [−1, 1] ) = λcard(x) these two relaxations in the tasks of support recov- and its conjugate is by definition ery. Given a closed-form solution of the envelope, we show a simple coordinate descent method for a loss f ∗(y ∈ Rn) = sup hx, yi − λcard(x). penalized by this envelope. Finally, we look at an- x∈[−1,1]n other example of a composite penalty, fused lasso, and We will assume an ordering σ of coordinates y such show that it also has a corresponding hard penalty, that y2 ≥ y2 and introduce an auxiliary variable related to its degrees of freedom, and use its envelope σ(i) σ(i−1) r to illustrate its benefits in the context of a biological ∗ f (y) = max sup x, yσ(1:r) − λr application of copy number variation analysis. r x∈[−1,1]r Once r is fixed, the supremum is achieved when the 2 Convex envelope inner product is maximized and given a constraint that n x ∈ [−1, 1] , that is xi = sgn (yi), hence Following Hiriart-Urruty and Lemarchal (1993) we will r n assume that the functions that we aim to envelope are ∗ X  X f (y) = max yσ(i) − λ = (|yi| − λ) . not identically +∞ and can be minorized by an affine r + i=1 i=1 function on a set of interest, for example box [−1, 1]n. Here we restate the definition of a conjugate and the The biconjugate is theorem about the tightness of the envelope. n ∗∗ n X f (z ∈ [−1, 1] ) = sup hz, yi − (|yi| − λ)+ n Definition The conjugate of a function f(x ∈ X ) is y∈R i=1 the function Since the above supremum is separable in coordinates, f ∗(y) = sup hy, xi − f(x). (1) we can reason about each coordinate in isolation and, x∈X for each, consider cases when the positive part is zero and non-negative. Elementary reasoning produced the Theorem 2.1 For a function f(x ∈ X ) the envelope closed form solution of the supremum (biconjugate) n X f ∗∗(z) = (f ∗)∗(z) = sup hy, zi − f ∗(y) (2) λ |zi| y i=1 Vladimir Jojic, Suchi Saria, Daphne Koller

n Hence the biconjugate of λcard(·) on [−1, 1] is λ k·k1. Algorithm 1 A subgradient method for computing Figure 1 shows both the f and its envelope f ∗∗. Note envelope f ∗∗(z) that the envelope is only valid on the box [−C,C]n. Input: z WLOG we will assume that C = 1 in the rest of the Output: f ∗∗(z) paper. As the box boundaries grow, the envelope f ∗∗ for k = 1 to MAXITER do can be shown to tend to a constant. However, a change yk+1 = yk + (1/k)∇h(y, z) in the box boundaries corresponds to scaling of the end for penalty parameter λ. return yMAXITER+1, z − f ∗(yMAXITER+1)

3 Evaluating the envelope and ∗ optimizing envelope-penalized losses f (y) = supβ hβ, yi − λf(β) in this case is just as hard as the original problem, subset selection. This is not In this section we show numerical methods for two surprising as the tightness of the convex envelope im- common tasks: pointwise evaluation of the enve- plies that it touches the global minimum of the func- lope and optimization of the envelope-penalized losses. tion. Hence, envelopment does not erase the hardness Pointwise evaluation of the envelope is useful for ex- of the original problem. ploring the behavior of the envelope, such as the level sets’ shape. However, for the second task of minimiz- 3.2 The variational inequality approach to ing penalized losses, explicit pointwise evaluation of minimizing envelope-penalized losses the envelope can be bypassed in the interest of effi- cient optimization. Minimization of an envelope-penalized loss PLoss(z) = Loss(z) + f ∗∗(z) (5) 3.1 Pointwise evaluation of the envelope seems to require pointwise-evaluation of the envelope. We focus on envelopes of functions that do not have When a closed-form envelope is available, this does a closed form solution but turn out to be numeri- not pose a problem. In the previous section, we have cally tractable. The requirement in this scheme is ∗ shown a method for evaluating a convex envelope nu- that pointwise computation of the conjugate f (y) = merically. It would seem that in order to minimize supx hy, xi − f(x) is tractable. Given this assumption, ∗ Eq.5 we would need to wrap an outer loop optimizing we can find an xy that achieves the supremum, where z around an inner loop computing the envelope. To the subscript is used to indicate dependence on y. The further complicate matters, the approximate nature of envelope is given as the inner loop would then necessitate an appeal to ap- ∗∗ ∗ proximate subgradient schemes such as (Solodov and f (z) = supy h(y, z), h(y, z) = hy, zi − f (y). (3) Svaiter, 2000). We bypass these considerations by re- casting the optimization problem as a variational in- If the conjugate has a closed form then ∇yh(y, z) can be computed symbolically. Alternatively, we can equality problem. apply Danskin’s theorem (Bertsekas, 1999) to obtain The problem of minimizing the cost in Eq. 5 can be ∇yh(y, z). For this, we need to define the set Xy that written as a saddle problem contains all x such that they achieve the supremum of hy, xi − f(x). Thus, inf sup Loss(z) + h(y, z) (6) z y ∗ ∗ ∇yh(y, z) = {z − x : x ∈ Xy} . (4)  y  which is equivalent to finding V = such that In either case, we can evaluate the envelope using z the standard subgradient method, for example Algo- rithm 1. We note that the form of the envelope prob- hF (V ),W − V i ≥ 0, ∀W ∈ V lem makes it amenable to the smoothing methods of (Nesterov, 2005). where  y   −∇ h(y, z)  However, the conjugate is not always tractable. An F = y (7) example of such a function is z ∇zLoss(z) + y 2 and V = Rn × B, with B denoting the set on f(β) = kD − Mβk2 + λcard(β) which f ∗∗ is defined, for example box [−1, 1]n. Algo- a linear regression problem with target variable D ∈ rithm 2 solves the variational inequality problem (Ne- Rm and predictor matrix M ∈ Rm×n. Evaluating mirovski, 2005; Tseng, 2008). One choice for D(V,W ) Convex envelopes of complexity controlling penalties is (1/2) kV − W k2, but other distance terms can be 0.6 1 2 KL+card used, especially on coordinates that correspond to set 0.4 Convex envelope

B. Finally, L denotes the Lipschitz constant of F and 0.2 0.5 is influenced primarily by the Loss function. 0

Algorithm 2 Proximal point algorithm for a VIP −0.2 0 a) 0 0.5 1 b) 0 0.5 1 for k = 1 to MAXITER do W k = argmin  V,F (V k) + L · D(V,V k) V ∈V Figure 2: a) KL(p||Unif) + 0.33card(p) and its con- V k+1 = argmin  V,F (W k) + L · D(V,V k) V ∈V vex envelope in 2d simplex. b) contours of the convex end for envelope of KL(p||Unif) + 0.33card(p) in 3d simplex. return V MAXITER+1 z The contours become closer to straight lines closer to low complexity solutions, the corners and sides of the This numerical algorithm requires O(L/) iterations to simplex. This shape of contours is reminiscent of `1’s achieve an -approximate solution (Nemirovski, 2005; contours. The switch to entropy-like curved contours Tseng, 2008). This complexity still compares favor- occurs towards the middle of the simplex. ably to projected subgradients optimizing a closed form cost, which in general require O(1/2) iterations for a solution of the same quality. where σ is an order such that

exp y p ≥ exp y p , ∀i. 4 Sparsified KL-divergence σ(i) σ(i) σ(i+1) σ(i+1)

KL-divergence’s use is ubiquitous in , The derivation of the envelope can proceed from this ∗ and more recent work has investigated incorporating point by conjugating f (y). Here we opt to use the penalties and constraints with the divergence (Gra¸ca scheme proposed in Sec. 3 to numerically evaluate the et al., 2007). Obtaining sparse multinomials has been envelope. The requirements for application of that attempted by using a negative-weight Dirichlet prior scheme were that we can compute the conjugate and ∗ (Bicego et al., 2007), as well as by inducing sparsity its gradient efficiently. Exact computation of f (y) in sufficient statistics for mixture proportions via l∞ requires sorting of a vector of length n to obtain the norm (Graca et al., 2009). The approach of using L1 order σ, and a pass through the logarithm of cumula- ∗ norm to induce sparsity has served well in sparsifying tive sums to obtain an r that maximizes Eq. 8. We ∗ objectives. However, in the case of KL this augmen- can plug in f into the definition of h(y, z) in Eq. 3 tation fails immediately and obviously as soon as the and obtain a subgradient ∇yh(y, z) optimization problem is written ( ∗ zσ(i), σ(i) > r D q E ∂h(y, z) = exp{yσ(i)pσ(i)} minimize q, log p + λ kqk1 ∗ ∂yσ(i) zσ(i) − Pr∗ , σ(i) ≤ r j=1 exp{yσ(j)pσ(j)} subject to qi ≥ 0, ∀i P i qi = 1, Figure 2 shows a convex envelope of KL(p||Unif) + since the q is constrained to be in a simplex, its L1 0.33card(p) in 2d simplex as well as its contour plots norm is always 1, thus the penalty is a constant. Of in 3d simplex. Here we can conclude that a joint en- course, this does not mean that sparsification of KL velope of the KL-divergence and cardinality yields a divergence is impossible. sparsifying objective, whereas a separate envelope of

Recalling that the L1 norm is the tightest convex re- the objective parts yields only KL-divergence. Not laxation (envelope) of card(·), the de-facto sparsifying only does the simple addition of penalties yield a loose penalty, we can pose a different problem convex approximation of the desired objective, but in some cases it fails completely to produce any effect. D q E minimize f(q) = q, log p + λcard(q) Thus, a joint envelope of all parts of the objective is subject to qi ≥ 0, ∀i preferable when computationally feasible. P i qi = 1, and seek the convex envelope of f on the n-dimensional 5 Envelope of cardinality and `2 simplex ∆n. We obtain the conjugate ( ( r ) ) Elastic net penalty (Zou and Hastie, 2005) consists of a ∗ X  sum of ` and ` . These two penalties induce a group- f (y) = max log exp yσ(i) pσ(i) − λr 2 1 r∈{1,..,n} i=1 ing effect, i.e. joint selection of correlated predictors, (8) and sparsity in regression weights. Considering again Vladimir Jojic, Suchi Saria, Daphne Koller

`1 as a convex relaxation of cardinality, we can ask 0.7 0.7 0.7 0.7 0.4 1 1 1 if jointly enveloping a penalty that combines `2 and 1 0.5 0.4 0.5 cardinality would yield a different convex penalty. In- 0.7 0.7 0.7 0.1 0.7 deed, it does, and the resulting penalty is the “Berhu” 0 0.4 0 0.4 0.1 0.4 0.7 0.1 0.4 penalty (Owen, 2006). The intuition behind the in- 0.4 0.7 −0.5 1 −0.5 0.7 0.4 1 1 troduction of this penalty was the desire to produce a 1 0.7 −1 −1 0.7 robust ridge regression estimator, and the form of the −1 −0.5 0 0.5 −1 −0.5 0 0.5 penalty was arrived at by rearranging parts of the Hu- ber penalty. Our construction shows that this penalty Figure 3: Two relaxations of 0.2card(x) + 0.5` (x) is the tightest convex relaxation of cardinality and `2. 2 on the [−1, 1]2 box. The convex envelope (left) and elastic net penalty 0.2` (x) + 0.5` (x) (right). The Envelope of `2 + card(·) The penalty is 1 2 level sets of the convex envelope for the same value of κ 2 penalty are smaller as a consequence of the tightness f(x) = kxk2 + λcard(x) (9) 2 of the envelope compared to elastic net. Note that n where x ∈ Bn = [−1, 1] and the contours of the convex envelope close to low com- plexity solutions are composed of straight lines, with n X an `1-like diamond for small coordinates. The regime card(x) = [xi 6= 0]. (10) switch to the curved, `2-like curves occurs with larger i=1 coordinates. We can extend f on the whole Rn so that the conju- gate is well defined We note that the function under the supremum is continuous at the boundaries of the conditions above  ∞, x∈ / B f (x) = n (11) y2 = 2λκ and y2 = κ2. Computing optima for each λ κ kxk2 + λcard(x), x ∈ B i i 2 2 n of the three conditions, we obtain the envelope below. However, this does not affect computation of the con- Note that the third condition in Eq. 16 is never active. jugate, so we can focus on the x ∈ Bn.  √ q |z | 2λκ, |z | ≤ 2λ ∗∗  i i κ fi (zi) = (17) Conjugate The conjugate of f is given by κ 2 q 2λ  2 zi + λ, |zi| ≥ κ ∗ κ 2 f (y) = sup hx, yi − kxk2 − λcard(x) (12) x∈B 2 and n X f ∗∗(z) = f ∗∗(z ) and rewriting i i i 2 ∗ 1 2 κ 1 f (y) = kyk2 + sup − x − y − λcard(x) 2κ x∈Bn 2 κ 2 Optimizing penalty and loss jointly We can now (13) take the envelope f ∗∗ of the target function f and add The conjugate simplifies into a coordinate-wise repre- it to the loss to obtain a penalized loss PL(z). WLOG, sentation: we assume that all predictors are standardized, i.e. P T  2 j Xi,j = 0 and Xi Xi = 1. 0 yi ≤ 2λκ ∗  1 2 2 2 fi (y) = 2κ yi − λ, 2λκ ≤ yi ≤ κ (14)  √ q κ 2 2 2λ  |yi| − λ − , κ ≤ y 2 X  |zi| 2λκ, |zi| ≤ κ 2 i PL(z) = (1/2) kY − Xzk + 2 κ 2 q 2λ i  zi + λ, |zi| ≥ Biconjugate We form the biconjugate which also 2 κ remains in coordinate-wise form: This problem can be optimized by coordinate descent. ∗∗ X ∗ To derive the update equations, we will use the follow- f (z ∈ Bn) = sup yizi − fi (y) (15) n ing notation: Y = Y − P X z . y∈R i −i j6=i j j Since the supremum is separable, we can express it in Hence coordinate descent iterates the following up- T a pointwise manner date, with Z−i = Y−iXi √  2   ziyi, yi ≤ 2λκ 0, |Z−i| ≤ 2λκ ∗∗ 1 2 2 2  √ q fi (zi) = sup ziyi − yi + λ, 2λκ ≤ yi ≤ κ  |Z−i| 2λ 2κ z = Z−i − sgn (Z−i) 2λκ, ≤ yi  κ 2 2 i 1+κ κ ziyi + 2 − |yi| + λ, κ ≤ yi q  1 |Z−i| 2λ (16)  1+κ Z−i, 1+κ ≥ κ Convex envelopes of complexity controlling penalties

1 Sparsity in a vector of values is induced by `1 on the coordinates. Smoothness is induced by `1 on the dif- 0.8 ference between subsequent coordinate values. This penalty, in the case of a neighborhood relationship in- 0.6 duced by a chain, is 0.4 X X λ |x | + λ |x − x | True positive support positive True 1 i 2 i i−1

0.2 Convex envelope of L 2 + card i i Elastic net

0 combined with a squared error loss 0 0.005 0.01 0.015 0.02 0.025 False positive support X 2 ky − xk2 (18) i Figure 4: Comparison of elastic net and the convex envelope of `2+card(·) on a synthetic support recovery where y (the data vector) and x are both column vec- task. The data was generated from a linear model tors. The combination of this squared loss and fused with 5000 predictors with true support consisting of lasso penalty yields the fused lasso signal approxima- 125 predictors. The support consisted of 25 groups of tor. 5 correlated predictors, with correlation bounded from One characterization of the complexity of the fused below by 0.9. For both methods, λ corresponded to a 1 lasso fit is given by the number of distinct non zero penalty associated with a sparsity component and λ 2 blocks (Eq. 13 (Tibshirani et al., 2005)). While op- corresponded to squared ` components. We scanned 2 tima of the fused lasso cost exhibit a “blocky” struc- a grid of values for both λ and λ ranging from -15 1 2 ture, the penalty can deviate significantly from simply to 0 in log domain, in increments of 1. The optimal counting the number of non-zero blocks. Here we show penalties are chosen via 5-fold cross-validation. We how the convex envelope of the non-zero-block count show an ROC curve on the task of support recovery for function can be optimized, even if the closed form is the best performing choice of λs for each penalty. The not available. false positive range is truncated at 2.5%, reflecting the imbalance between the sizes of the true positive and the true negative sets in this task. 6.1 Conjugate of degrees of freedom The complexity measure of a fused lasso fit, degrees- of-freedom, stated in Equation 13 (Tibshirani et al., As a consequence of Theorem 2.1 the derived penalty 2005), is the cardinality of the complement of the set is the tightest convex relaxation of card(·) + ` . Fig- 2 {i : x 6= 0} ∪ {i : x − x = 0, x , x 6= 0} which is ure 3 illustrates the differences between the elastic net i i i−1 j j−1 exactly {i : x 6= 0, x = x }. Hence we can formu- penalty and the envelope of card(·) + ` . The enve- i i i−1 2 late the hard penalty relaxed by the fused lasso as lope has smaller level sets for the same penalty levels, which is a direct consequence of its tightness compared X card(hx 6= 0, x 6= x i) = λ (x 6= 0, x 6= x ) to elastic net. Further, the envelope smoothly transi- π i i i−1 i tions from the `1-like diamond contours for small coor- dinates to `2-like round curves for larger coordinates. where xπ denotes neighbors of x. We define the nonzero-block penalty as In Figure 4 we are showing a comparison of perfomance of the two penalties in the task of support recovery. fnzb(x) = λ hx 6= 0, x 6= xπi . The penalty constants λ1 and λ2 corresponding to the sparsity and `2 portions were sought on the grid of The conjugate is then given by values, independently for the two methods. The ROC ∗ curves illustrate the best recovery rates achieved us- fnzb(y) = sup hx, yi − λ hx 6= 0, x 6= xπi . (19) x∈[−1,1]N ing the two penalties with their respective parameters set by 5-fold cross-validation. The convex envelope penalty recovers higher a proportion of the support Next, we show that the optimization of the supre- compared to the elastic net penalty. mum can be achieved by a Viterbi-like algorithm over a small number of hidden states, by characterizing the optimal assignments to x. 6 Envelope of non-zero-block penalty Proposition 6.1 The supremum of h(x) = hx, yi − The fused lasso penalty proposed in (Tibshirani et al., λ hx 6= 0, x 6= xπi where x ∈ Bn can be achieved by n 2005) is meant to combine sparsity and smoothness. x ∈ Sn = {−1, 0, 1} . Vladimir Jojic, Suchi Saria, Daphne Koller

Proof Assume that there exists a maximizing assign- The complexity of an iteration of this algorithm is lin- ment for x∗ ∈ [−1, 1]N . The assignment x∗ can be bro- ear in the size of the data sample. ken down into blocks of consecutive indices I1,...Im ∗ such that xI = aj, a slight abuse of notation, and j 6.3 Comparison to fused lasso aj−1 6= aj and at least one aj ∈/ {−1, 0, 1}. We can ∗ Pm now write the function h(x ) = hj(xI ) where j=1 j For purposes of comparison, we use the loss from h (x ) = x , y − λ. We now proceed to con- j Ij Ij Ij Eq.18, struct x0 such that h(x0) ≥ h(x∗). For each block Ij, such that aj ∈ {−1, 0, 1}, we keep the same val- L(z) = (1/2) ky − zk2 ues x0 = a , hence h (x0 ) ≥ h (x∗ ). For blocks 2 Ij j j Ij j Ij 0 such that aj ∈/ {−1, 0, 1} and yIj = 0 setting xI =   j and compare the performance of the two penalties, sgn P y achieves h (x0 ) ≥ h (x∗). Hence, we l∈Ij l j Ij j fused lasso and the envelope of non-zero-block counter 0 ∗ 0 ∗ ∗∗ can obtain an x from x such that h(x ) ≥ h(x ) where fnzb. x0 ∈ {−1, 0, 1}N since x∗ achieves a supremum so does From the HapMap CNV Project we obtained copy x0. number estimates for chromosomes 12,13,15 and 18 of Hence, we can reformulate computation of the conju- individual NA10855. For both methods we perform gate as optimization in this smaller discrete set: pathwise fit, controlling for the number of degrees- of-freedom and plotting the best reconstruction error ∗ fnzb(y) = max hx, yi − λ hx 6= 0, x 6= xπi (20) for a model of that complexity, see Figure 5 . This x∈Sn regime corresponds to a purely unsupervised task of and highlight the applicability of a Viterbi-like algo- compression. In these examples, envelope penalized rithm in computing the conjugate squared-error loss can yield models that capture more of the data variance with a smaller number of degrees- f ∗ (y) = max x y − λ(x 6= 0)+ nzb x1∈S1 1 1 1 of-freedom. Notably, these examples have significant max x y − λ(x 6= 0, x 6= x ) + ... x2∈S1 2 2 2 2 1 amounts of structure. In cases of simpler signals, near max x y − λ(x 6= 0, x 6= x ). xn∈S1 n n n n n−1 constant or with a small number of switching points, (21) the two penalties do not recover noticeably different Given a maximizing assignment x∗ and using Eq. 4, we ∗ models. can compute a subgradient of ∇yh(y, z) = z − x and use it in Algorithm 1, specialized to the problem of the envelope of the non-zero-block penalty. The com- 7 Conclusion plexity of the Viterbi pass is linear in the length of the data vector and hence computation of the subgradient is linear. Convex envelopes of cardinality offer a direct way to capture parameter complexity and have yielded two 6.2 Optimization of a squared loss penalized very useful penalties, `1 and nuclear norm. In spite by the envelope of non-zero-block count of the introduction of novel lasso style penalties, the principled approach to combining these penalties has We now specialize the two updates in Algorithm 2 to not been put forward. We illustrated the hazards of the problem of optimizing the loss from Eq. 18 penal- simple composition of penalties on the example of KL- ized by the envelope of non-zero-block count divergence and `1. In that case, we show that intro- duction of the sparsifying penalty fails to yield any (1/2) ky − zk2 + λf ∗∗ (z). (22) 2 nzb sparsity. However, the joint convex envelope of KL- Adopting notation Viterbi(·, ·) to indicate Viterbi de- divergence and cardinality yields the desired sparsi- coding for the cost given in Eq. 21 we can specify the fied objective. Further, we show that two well known updates inside the loop of Algorithm 2 penalties, elastic net and fused lasso, can be viewed as relaxations of hard penalties and that those penalties k k 1 k k  can in turn be enveloped to yield even tighter convex Wy = Vy − −V (z) + Viterbi(Vy , λ) L relaxations. We also show two simple algorithms for   k k 1 k  k envelope evaluation and optimization of envelope pe- Wz = ΠBn Vz − V (z) − d + Vy L nalized losses. We suggest that convex envelopes with 1 guarantees on tightness provide a powerful approach V k+1 = V k − −W k(z) + Viterbi(W k, λ) y y L y to combining penalties, explicitly trading off computa-  1  tional costs, and investigating novel hybrid penalties. V k+1 = Π W k − W k(z) − d + V k . z Bn z L y Convex envelopes of complexity controlling penalties

Chr 12 0.5 tion Theory, IEEE Transactions on, 52(2):489 –

0.495 509, Feb. 2006.

0.49 M. Fazel. Matrix Rank Minimization with Applica- Chr 13 0.485 tions. PhD , Dept. of Elec. Eng., Stanford University, 2002. 0.48

0.475 Chr 15 J. Graca, K. Ganchev, B. Taskar, and F. Pereira. Chr12 FL Posterior vs parameter sparsity in latent variable 0.47 Chr12 NZB models. In Y. Bengio, D. Schuurmans, J. Lafferty, Chr13 FL Reconstruction Error 0.465 Chr13 NZB C. K. I. Williams, and A. Culotta, editors, Advances Chr 18 0.46 Chr15 FL in Neural Information Processing Systems 22, pages Chr15 NZB 0.455 Chr18 FL 664–672, 2009. Chr18 NZB 0.45 2 4 6 8 10 J. Gra¸ca,K. Ganchev, and B. Taskar. Expectation Degrees−of−freedom maximization and posterior constraints. In NIPS, 2007. Figure 5: Model fits to copy number variation data J.-B. Hiriart-Urruty and C. Lemarchal. Convex anal- from the HapMap project. The data is shown on the ysis and minimization algorithms. Part 1: Funda- right. The measurements span four chromosomes from mentals. Springer-Verlag, 1993. individual NA10855. The plot shows the reconstruc- tion error for varying complexity of models. In the R. E. Kass and A. E. Raftery. Bayes factors. Journal plot, FL denotes fused lasso penalized loss and NZB of the American Statistical Association, 90(430):pp. denotes loss penalized by the envelope of non-zero- 773–795, 1995. block counts. A. Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz con- References tinuous monotone operators and smooth convex- concave saddle point problems. SIAM J. on Op- H. Akaike. Information theory and an extension of timization, 15(1):229–251, 2005. ISSN 1052-6234. the maximum likelihood principle. 2nd internat. Yu. Nesterov. Smooth minimization of non-smooth Sympos. Inform. Theory, Tsahkadsor 1971, 267-281 functions. Math. Program., 103(1):127–152, 2005. (1973)., 1973. ISSN 0025-5610. F. Bach. Structured sparsity-inducing norms through A. B. Owen. A robust hybrid of lasso and ridge regres- submodular functions. In NIPS, 2010. sion. Technical report, Stanford University, 2006. A. Barron, J. Rissanen, and Bin Yu. The minimum G. Schwarz. Estimating the dimension of a model. description length principle in coding and modeling. Ann. Stat., 6:461–464, 1978. Information Theory, IEEE Transactions on, 44(6): M. V. Solodov and B. F. Svaiter. An inexact hybrid 2743 –2760, October 1998. generalized proximal point algorithm and some new D. P. Bertsekas. Nonlinear Programming. Athena Sci- results on the theory of bregman functions. Mathe- entific, September 1999. ISBN 1886529000. matics of Operations Research, 25:214–230, 2000. M. Bicego, M. Cristani, and V. Murino. Sparseness R. Tibshirani. Regression shrinkage and selection via achievement in hidden markov models. In ICIAP the lasso. Journal of the Royal Statistical Society ’07: Proceedings of the 14th International Confer- (Series B), 58:267–288, 1996. ence on Image Analysis and Processing, pages 67– R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and 72, Washington, DC, USA, 2007. IEEE Computer K. Knight. Sparsity and smoothness via the fused Society. ISBN 0-7695-2877-5. lasso. Journal of the Royal Statistical Society Series S. Boyd and L. Vandenberghe. Convex Optimization. B, pages 91–108, 2005. Cambridge University Press, March 2004. ISBN P. Tseng. On accelerated proximal gradient methods 0521833787. for convex-concave optimization. SIAM Journal on E J. Candes and B. Recht. Exact matrix completion Optimization Optim., 2008. via convex optimization. Foundations of Computa- H. Zou and T. Hastie. Regularization and variable tional Mathematics, 9(6):717–772, 2009. selection via the elastic net. Journal of the Royal E.J. Candes, J. Romberg, and T. Tao. Robust uncer- Statistical Society, Series B, 67:301–320, 2005. tainty principles: exact signal reconstruction from highly incomplete frequency information. Informa-