On the of strong convexity and strong smoothness: Learning applications and matrix regularization

Sham M. Kakade and Shai Shalev-Shwartz and Ambuj Tewari Toyota Technological Institute—Chicago, USA {sham,shai,tewari}@tti-c.org

Abstract strictly convex functions). Here, the notion of strong con- vexity provides one means to characterize this gap in terms We show that a function is strongly convex with of some general (rather than just Euclidean). respect to some norm if and only if its conjugate This work examines the notion of strong convexity from function is strongly smooth with respect to the a duality perspective. We show a function is strongly convex dual norm. This result has already been found with respect to some norm if and only if its (Fenchel) con- to be a key component in deriving and analyz- jugate function is strongly smooth with respect to its dual ing several learning algorithms. Utilizing this du- norm. Roughly speaking, this notion of smoothness (defined ality, we isolate a single inequality which seam- precisely later) provides a second order upper bound of the lessly implies both generalization bounds and on- conjugate function, which has already been found to be a key line regret bounds; and we show how to construct component in deriving and analyzing several learning algo- strongly convex functions over matrices based on rithms. Using this relationship, we are able to characterize strongly convex functions over vectors. The newly a number of matrix based penalty functions, of recent in- constructed functions (over matrices) inherit the terest, as being strongly convex functions, which allows us strong convexity properties of the underlying vec- to immediately derive online algorithms and generalization tor functions. We demonstrate the potential of bounds when using such functions. this framework by analyzing several learning algo- We now briefly discuss related work and our contribu- rithms including group Lasso, kernel learning, and tions. online control with adversarial quadratic costs. 1.1 Related work The notion of strong convexity takes its roots in optimiza- 1 Introduction tion based on ideas in Nemirovski and Yudin [1978] (where it was defined with respect to the Euclidean norm) — the As we tackle more challenging learning problems, there is generalization to arbitrary norms was by Nesterov [2005]. an increasing need for algorithms which efficiently impose Relatively recently, its use in machine learning has been two more sophisticated forms of prior knowledge. Examples in- fold: in deriving regret bounds for online algorithm and gen- clude: the group Lasso problem (for “shared” feature selec- eralization bounds in batch settings. tion across problems), kernel learning, multi-class predic- The duality of strong convexity and strong smoothness tion, and multi-task learning. A central question here is to was first used by Shalev-Shwartz and Singer [2006], Shalev- understand the generalization ability of such algorithms in Shwartz [2007] in the context of deriving low regret online terms of the attendant complexity restrictions imposed by algorithms. Here, once we choose a particular strongly con- the algorithm (such analyses often illuminate the nature in vex penalty function, we immediately have a family of algo- which our prior knowledge is being imposed). rithms along with a regret bound for these algorithms that is There is growing body of work suggesting that the no- in terms of a certain strong convexity parameter. A variety of tion of strong convexity is a fundamental tool in designing algorithms (and regret bounds) can be seen as special cases. and analyzing (the regret or generalization ability of) a wide A similar technique, in which the Hessian is directly range of learning algorithms (which we discuss in the next bounded, is described by Grove et al. [2001], Shalev-Shwartz Subsection). The underlying intuition for this is as follows: and Singer [2007]. Another related approach involved Most of our efficient algorithms (both in the batch and on- bounding a Bregman divergence [Kivinen and Warmuth, line settings) impose some complexity control via the use of 1997, 2001, Gentile, 2002] (see Cesa-Bianchi and Lugosi some strictly convex penalty function (either explicitly via [2006] for a detailed survey). Another interesting applica- a regularizer or implicitly in the design of an online update tion of the very same duality is for deriving and analyzing rule). Central to understanding these algorithms is the man- boosting algorithms [Shalev-Shwartz and Singer, 2008]. ner in which these penalty functions are strictly convex, i.e. More recently, Kakade et al. [2008] showed how to use the behavior of the “gap” by which these convex functions the very same duality for bounding the Rademacher com- lie above their tangent planes (which is strictly positive for plexity of classes of linear predictors. That the Rademacher complexity is closely related to Fenchel duality was shown in former, we are able to show how the ||·||2,1 “group” norm en- Meir and Zhang [2003], and the work in Kakade et al. [2008] joys certain (shared) feature selection properties (with only made the further connection to strong convexity. Again, logarithmic dependence on the number of features). In the under this characterization, a number of generalization and latter, we show how kernel learning (learning a kernel as a margin bounds (for methods which use linear prediction) convex combination of base kernels) has only a mild depen- are immediate corollaries, as one only needs to specify the dence on the number of base kernels used (only logarithmic). strong convexity parameter from which these bounds easily follow (see Kakade et al. [2008] for details). 2 The duality of strong convexity and strong The concept of strong smoothness (essentially a second smoothness order upper bound on a function) has also been in play in a different literature, for the analysis of the concentration of 2.1 Preliminaries martingales in smooth Banach spaces [Pinelis, 1994, Pisier, Here, we briefly recall some key definitions from convex 1975]. This body of work seeks to understand the concen- analysis that are useful throughout the paper (for details, see tration properties of a random variable ||Xt||, where Xt is a any of the several excellent references on the subject, e.g. (vector valued) martingale and ||·|| is a smooth norm, say an Borwein and Lewis [2006], Rockafellar [1970]). Lp-norm. We consider convex functions f : X → R ∪ {∞}, where Recently, Juditsky and Nemirovski [2008] proved that X is a Euclidean equipped with an inner product a norm is strongly convex if and only if its conjugate is h·, ·i. We denote R∗ = R ∪ {∞}. strongly smooth. This duality was useful in deriving con- centration properties of a random variable ||M||, where now Definition 1 Given a convex function f : X → R∗, its sub- M is a random matrix. The norms considered here were the differential at x ∈ X , denoted by ∂f(x), is defined as, L ||M|| L (Schatten) p-matrix norms (where p is the p norm ∂f(x) := {y ∈ X : ∀z, f(x + z) ≥ f(x) + hy, zi} of the singular values of M) and certain “block” composed norms (such as the || · || norm). ∗ 2,q Definition 2 Given a convex function f : X → R , the Fenchel conjugate f ? : X → R∗ is defined as 1.2 Our Contributions f ?(y) := sup hx, yi − f(x) The first contribution of this paper is to further distill this x∈X theory of strong convexity. While Shalev-Shwartz [2007] have shown that strong convexity (of general functions) im- We also deal with a variety of norms in this paper. Recall plies strong smoothness, here we show that the other di- the definition of the dual norm. rection also holds and thus the two notions are equivalent. This result generalizes the recent results of Juditsky and Ne- Definition 3 Given a norm k · k on X , its dual k · k? is the mirovski [2008] to functions rather than norms. This gen- norm (also on X ) defined as, eralization has a number of consequences which this work kyk? := sup{hx, yi : kxk ≤ 1} explores. For example, in Corollary 7, we isolate an impor- tant inequality that follows from the strong-convexity/strong- An important property of the dual norm is that the 1 2 1 2 smoothness duality and show that this inequality alone seam- Fenchel conjugate of the function 2 kxk is 2 kyk?. lessly yields regret bounds and Rademacher bounds. The definition of Fenchel conjugate implies The second contribution of this paper is in deriving new ∀x, y, f(x) + f ?(y) ≥ hx, yi , families of strongly convex (smooth) functions. To do so, we rely and further generalize the recent results of Juditsky and which is known as the Fenchel-Young inequality. An equiv- Nemirovski [2008]. In particular, we obtain a strongly con- alent and useful definition of the subdifferential can be given vex function over matrices based on strongly convex vector in terms of the Fenchel conjugate, functions, which leads to a number of corollaries relevant to ∂f(x) = {y ∈ X : f(x) + f ∗(y) = hx, yi} problems of recent interest. Furthermore, this characterization allows us to place a 2.2 Main result wider class of online algorithms (along with regret bounds) Recall that the domain of a function f : X → R∗ is the set as special cases of the general primal-dual framework de- of x such that f(x) < ∞ (allowing f to take infinite values veloped in Shalev-Shwartz [2007]. Examples which are is the effective way to restrict its domain to a proper subset now immediate corollaries include: online PCA [Warmuth of X ). We first define strong convexity. and Kuzmin, 2006], the perceptron algorithm derived with a Schatten norm complexity function [Cavallanti et al., 2008], Definition 4 A function f : X → R∗ is β-strongly convex and the multi-task algorithm of Agarwal et al. [2008]. These w.r.t. a norm k · k if for all x, y in the relative interior of the corollaries follow once we characterize a certain strong con- domain of f and α ∈ (0, 1) we have vexity parameter (along with the derivative of the conjugate function) — here, a family of online algorithms all enjoy the f(αx + (1 − α)y) ≤αf(x) + (1 − α)f(y) same regret bound, with no further analysis required. 1 2 − 2 βα(1 − α)kx − yk Finally, we use the generality of our results for obtaining new (and sharper) generalization bounds for various applica- We now define strong smoothness. Note that a strongly tions, including the group Lasso and kernel learning. In the smooth function f is always finite. Definition 5 A function f : X → R is β-strongly smooth Algorithm 1 Follow the Regularized Leader w.r.t. a norm k · k if f is everywhere differentiable and if for ? w1 ← ∇f (0) all x, y we have 1 for t = 1 to T do f(x + y) ≤ f(x) + h∇f(x), yi + βkyk2 Play w ∈ S 2 t Receive lt and pick vt ∈ ∂lt(wt) ?  Pt  The following central theorem shows that strong convex- wt+1 ← ∇f −η s=1 vt ity and strong smoothness are dual properties. Recall that the biconjugate f ?? equals f if and only if f is closed and end for convex.

Theorem 6 (Strong/Smooth Duality) Assume that f is a Proof: Apply Corollary 7 to the sequence −ηv1,..., −ηvT closed and convex function. Then f is β-strongly convex to get, for all u, ? 1 w.r.t. a norm k · k if and only if f is β -strongly smooth T T T w.r.t. the dual norm k · k?. X X 1 X −η hv , ui − f(u) ≤ −η hv , w i + kηv k2 . t t t 2β t ? Subtly, note that while the domain of a strongly convex t=1 t=1 t=1 function f may be a proper subset of X (important for a num- ? Using the that lt is V -Lipschitz, we get kvtk? ≤ V . Plugging ber of settings), its conjugate f always has a domain which this into the inequality above and rearranging gives, is X (since f ? is strongly smooth then it is finite and every- where differentiable). T 2 X f(u) ηV T The proof is provided in the appendix. hv , w − ui ≤ + . t t η 2β t=1 2.3 Machine learning implications of the strong-convexity / strong-smoothness duality By convexity of lt, lt(wt)−lt(u) ≤ hvt, wt − ui. Therefore, The following direct corollary of Thm. 6 is central in proving T T X X f(u) ηV 2T both regret and generalization bounds. l (w ) − l (u) ≤ + . t t t η 2β t=1 t=1 Corollary 7 If f is β strongly convex w.r.t. k·k and f ?(0) = Take the max over u ∈ S on both sides to finish the proof. 0, then, for any sequence v1, . . . , vn and for any u we have n X ? 2.3.2 Rademacher Bound hvi, ui − f(u) ≤ f (v1:n) i=1 Let X be an input space. Let T = (X1,...,Xn) be a dataset n n consisting of i.i.d. samples from some fixed distribution on X 1 X X ≤ h∇f ?(v ), v i + kv k2 X . For a class of real valued functions F ⊆ R , define its 1:i−1 i 2β i ? i=1 i=1 Rademacher complexity on T to be " n # where v denotes the sum Pi v . 1 X 1:i j=1 j R (F) := sup  f(X ) . T E n i i f∈F i=1 Proof: The first inequality is Fenchel-Young and the second is from the definition of smoothness by induction. Here, the expectation is over i’s, which are i.i.d. Rademacher random variables, i.e. P(i = −1) = P(1 = From this we can easily obtain regret bounds and 1 +1) = 2 . Since T is random, this is a random variable. We Rademacher bounds. can also take expectation over the choice of T and define, 2.3.1 Regret Bound Rn(F) := E [RT (F)] Algorithm 1 provides one common algorithm (Follow the Regularized Leader) which achieves the following regret which gives us a number that is a function of the input space, bound. It is one of a family of algorithms which enjoys the the function class and the sample size n. It is well known that same regret bound (see Shalev-Shwartz [2007]). bounds on Rademacher complexity of a class immediately yield generalization bounds for classifiers picked from that Theorem 8 (Regret) Suppose Algorithm 1 is used with a class. Recently, Kakade et al. [2008] proved Rademacher function f that is β-strongly convex w.r.t. a norm k · k on complexity bounds for classes consisting of linear predic- ? tors using strong convexity arguments. We now give a quick S and has f (0) = 0. Suppose the loss functions lt are con- proof of their main result using Corollary 7. This proof is vex and V -Lipschitz w.r.t. the dual norm k · k?. Then, the algorithm run with any positive η enjoys the regret bound, essentially the same as their original proof but highlights the importance of Corollary 7. T T 2 X X maxu∈S f(u) ηV T lt(wt) − min lt(u) ≤ + Theorem 9 (Generalization) Let f be a β-strongly convex u∈S η 2β t=1 t=1 function w.r.t. a norm k · k on S such that f ?(0) = 0. Let m×n X = {x : kxk? ≤ X} and W = {w : f(w) ≤ fmax}. Recall that any matrix X ∈ R can be decomposed as, Consider the class of linear functions, X = UDiag(σ(X))V F = {x 7→ hw, xi : w ∈ W} . where σ(X) denotes the vector (σ1, σ2, . . . σl) (l = Then, for any dataset T ∈ X n, we have min{m, n}), where σ1 ≥ σ2 ≥ ... ≥ σl ≥ 0 are the s singular values of X arranged in non-increasing order, and 2fmax m×m n×n RT (F) ≤ X . U ∈ R ,V ∈ R are orthogonal matrices. Also, any βn matrix X ∈ Sn can be decomposed as, Therefore, the same bound holds for Rn(F). X = UDiag(λ(X))U >

Proof: Let λ > 0. Apply Corollary 7 with u = w and where λ(X) = (λ1, λ2, . . . λn), where λ1 ≥ λ2 ≥ ... ≥ λn vi = λiXi to get, are the eigenvalues of X arranged in non-increasing order, n n and U is an orthogonal matrix. Two important results re- X λ2 X sup hw, λ X i ≤ k X k2 + sup f(w) late matrix inner products to inner products between singular i i 2β i i ? w∈W i=1 i=1 w∈W (and eigen-) values n X + h∇f ?(v ),  X i Theorem 10 (von Neumann) Any two matrices X,Y ∈ 1:i−1 i i m×n i=1 R satisfy the inequality λ2X2n hX,Y i ≤ hσ(X), σ(Y )i . ≤ + f 2β max Equality holds above, if and only if, there exist orthogonal n X ? U, V such that + h∇f (v1:i−1), iXii . i=1 X = UDiag(σ(X))VY = UDiag(σ(Y ))V. Now take expectation on both sides. The left hand side is Theorem 11 (Fan) Any two matrices X,Y ∈ n satisfy the nλRT (F) and the last term above becomes zero. Dividing S throughout by nλ, we get, inequality hX,Y i ≤ hλ(X), λ(Y )i . 2 λX fmax RT (F) ≤ + . Equality holds above, if and only if, there exists orthogonal 2β nλ U such that Optimizing over λ gives us the result. X = UDiag(λ(X))U > Y = UDiag(λ(Y ))U > .

3 Examples of strongly convex matrix We say that a function g : Rn → R∗ is symmetric if g(x) functions is invariant under arbitrary permutations of the components of x. We say g is absolutely symmetric if g(x) is invariant We now provide examples of strongly convex functions over under arbitrary permutations and sign changes of the compo- matrices, which have a number of algorithmic implications. nents of x. We begin by understanding the strong convexity properties Given a function f : Rl → R∗, we can define a function of functions which only depend on the singular values of the f ◦ σ : Rm×n → R∗ as, matrix — this class includes the Schatten norms. We then turn to understanding norms of matrices which constructed (f ◦ σ)(X) := f(σ(X)) . in a certain “group” manner, where a norm is first applied n ∗ Similarly, given a function g : R → R , we can define a to each column of the matrix (to obtain a vector) and then a n ∗ function g ◦ λ : S → R as, norm is applied to this resultant vector. This class for norms include the || · ||2,1 norm of recent interest (e.g. for the group (g ◦ λ)(X) := g(λ(X)) . Lasso). We start by presenting tools useful for analyzing matri- This allows us to define functions over matrices starting from functions over vectors. Note that when we use f ◦ σ we are ces, borrowing heavily from Lewis [1995] and Juditsky and m×n n Nemirovski [2008]. In fact, Juditsky and Nemirovski [2008] assuming that X = R and for g ◦ λ we have X = S . already proved the strong smoothness for Schatten p-norms. The following result allows us to immediately compute the We provide additional results for the entropy based matrix conjugate of f ◦ σ and g ◦ λ in terms of the conjugates of f functions. Also, our results on the group norms are more and g respectively. general that those in Juditsky and Nemirovski [2008]. Theorem 12 (Lewis [1995]) Let f : Rl → R∗ be an abso- 3.1 of matrix functions lutely symmetric function. Then, We consider the vector space X = Rm×n of real matrices (f ◦ σ)? = f ? ◦ σ . of size m × n and the vector space X = n of symmetric S n ∗ matrices of size n × n, both equipped with the inner product, Let g : R → R be a symmetric function. Then, ? ? hX,Y i := Tr(X>Y ) . (g ◦ λ) = g ◦ λ . Proof: Lewis [1995] proves the case for singular values. For 3.2 Strongly convex matrix functions the eigenvalue case, the proof is entirely analogous to that in We first provide results on functions which only depend on Lewis [1995], except that Fan’s inequality is used instead of the singular values of a matrix and then provide results on von Neumann’s inequality. group norms. Using this general result, we are able to define certain Unitarily invariant matrix functions. Our first result is matrix norms. on the p-Schatten norm kXkS(p) := kσ(X)kp (which fol- lows from results in Juditsky and Nemirovski [2008]) and l ∗ Corollary 13 (Matrix norms) Let f : R → R be abso- on an entropy-based matrix function. lutely symmetric. Then if f = k · k is a norm on Rl then f ◦ σ = kσ(·)k is a norm on Rm×n. Further, the dual of this Theorem 16 (Schatten and entropic matrix functions) norm is kσ(·)k . ? • Define F (X) = P λ (X) log(λ (X)) on its domain: Let g : Rn → R∗ be symmetric. Then if g = k · k is a i i i n n n norm on R then g ◦ λ = kλ(·)k is a norm on S . Further, {X ∈ S : X  0, Tr(X) = 1}, the dual of this norm is kλ(·)k . ? i.e. the set of symmetric positive semidefinite matrices with trace 1, and F (X) = ∞ elsewhere (on n). We Another nice result allows us to compute subdifferentials S have that F (X) is 1/2-strongly convex w.r.t. the trace of f ◦ σ and g ◦ λ (note that elements in the subdifferential norm kλ(X)k . of f ◦ σ and g ◦ λ are matrices) from the subdifferentials of 1 1 2 f and g respectively. • For p ∈ [1, 2], the function F (X) = 2 kσ(X)kp is min{ 1 , p − 1}-strongly convex w.r.t. the p-Schatten l ∗ 2 Theorem 14 (Lewis [1995]) Let f : R → R be absolutely norm kXkS(p) := kσ(X)kp. symmetric and X ∈ Rm×n. Then, Proof: For the first part, we prove that the function (g ◦ > ∂(f ◦ σ)(X) = {UDiag(µ)V : µ ∈ ∂f(σ(X)) λ)(X) is 2-smooth on Sn w.r.t. kλ(X)k∞ where > n ! U, V orthogonal,X = UDiag(σ(X))V } X g(x) = log exp(xi) . n ∗ n Let g : R → R be symmetric and X ∈ S . Then, i=1 ? ? > Since g is symmetric, by Thm. 12, (g ◦ λ) is g ◦ λ, where ∂(g ◦ λ)(X) = {UDiag(µ)U : µ ∈ ∂g(λ(X)) g? can be shown to be the function > U orthogonal,X = UDiag(λ(X))U } n ? X g (x) = xi log xi Proof: Again, Lewis [1995] proves the case for singular val- i=1 ues. For the eigenvalue case, again, the proof is identical {x ≥ 0 : P x = 1} g?(x) = ∞ to that in Lewis [1995], except that Fan’s inequality is used with domain i i and instead of von Neumann’s inequality. elsewhere. Note that by Thm. 6, 2-smoothness of (g ◦ λ) implies 1/2-strong convexity of (g ◦ λ)?. Our final tool is a technical result from Juditsky and Ne- Let us now prove 2-smoothness of g ◦ λ. Fix arbitrary mirovski [2008]. X,H ∈ Sn, and define n X Lemma 15 (Juditsky and Nemirovski [2008]) Let ∆ be an f(t) = exp(λi(X + tH)) ∗ open interval. Suppose φ : ∆ → R is a twice differen- i=1 tiable convex function such that φ00 is monotonically non- and let h(t) = log(f(t)). Note that h(t) = (g ◦λ)(X +tH). decreasing. Let Sn(∆) be the set of all symmetric n × n matrices with eigenvalues in ∆. Define the function F : To prove 2-smoothness of g ◦ λ, it suffices to prove n ∗ 00 2 S (∆) → R h (0) ≤ 2kλ(H)k∞ . n X By the chain rule, F (X) = φ(λ (X)) i (f 0(t))2 f 00(t) i=1 h00(t) = − + . f(t)2 f(t) and let 00 f(t) = F (X + tH) The first term in non-positive and therefore h (0) ≤ f 00(0)/f(0). By Lemma 15 (with φ(x) = exp(x)), n n for some X ∈ S (∆),H ∈ S . Then, we have, n 00 X 2 n f (0) ≤ 2 exp(λi(X))λi(H) 00 X 00 2 f (0) ≤ 2 φ (λi(X))λi(H) . i=1 i=1 n 2 X ≤ 2kλ(H)k∞ exp(λi(X)) Proof: This follows directly from Proposition 3.1 in Juditsky i=1 and Nemirovski [2008]. 2 = 2kλ(H)k∞f(0) , 00 00 2 whence h (0) ≤ f (0)/f(0) ≤ 2kλ(H)k∞. previous algorithms/regret bounds are now special cases of For the second part, let q be the dual exponent of p, i.e. the results herein (using the family of algorithms described 1/q + 1/p = 1. Note that kσ(X)kp and kσ(X)kq are dual in Shalev-Shwartz [2007]), including online PCA [Warmuth norms by Corollary 13. Now we use the result [Juditsky and and Kuzmin, 2006], the perceptron algorithm derived with a 1 2 Nemirovski, 2008, Example 3.3] which says that 2 kσ(X)kq Schatten norm [Cavallanti et al., 2008], and the multi-task is max{2, q −1}-smooth w.r.t. kσ(X)kq. Hence, by Thm. 6, algorithm (using the || · ||2,1 group norm) of Agarwal et al. 1 2 1 [2008]. Note that in order to derive an online algorithm with kσ(X)kp is min{ , p−1}-strongly convex w.r.t. kσ(X)kp. 2 2 penalty F (e.g. as in Algorithm 1), we must specify ∇F ?, which is often straightforward to compute using the calculus of certain matrix functions discussed in Subsection 3.1. Group Norms. Let X = (X1X2 ...Xn) be a m × n real m We now demonstrate how to obtain a few generalization matrix with columns Xi ∈ R . Given norms Ψ and Φ on m n and regret bounds for problems of recent interest. R and R , we define the norm kXkΨ,Φ as

kXkΨ,Φ := Φ(Ψ(X1),..., Ψ(Xn)) . 4.1 Group Lasso Consider the setting of k-multivariate regression or classi- That is, we apply Ψ to each column of X to get a vector in n R fication problems, where the dataset consists of i.i.d. pairs to which we apply the norm Φ to get the value of kXk . Ψ,Φ (x , y ) where x ∈ d is an example vector and y ∈ k It is easy to check that this is indeed a norm. i i i R i R are the responses for k different problems. To predict the k An important special case is when Φ = k · k and Ψ = r responses, we learn a matrix W ∈ k×d such that W x is a k · k for r, p ≥ 1. In this case, we denote the norm k · k R p Ψ,Φ good predictor of y. The rows W (1 ≤ i ≤ k) are the linear by k · k . i,· p,r predictors for the individual problems. If the same features The dual of k · k is also easily calculated from the Ψ,Φ are going to be relevant across the k problems, then natural duals Ψ and Φ of Ψ and Φ under a mild condition on Φ. ? ? block regularization schemes have been already proposed in n the literature [Yuan and Lin, 2006]. With the squared loss Lemma 17 Let Φ be an absolutely symmetric norm on R . Then these schemes try to solve an optimization problem of the form, (k · kΨ,Φ)? = k · kΨ?,Φ? n 1 X 2 min kyi − W xik2 + λkW kp,r (1) Proof: See Sec. A.2 in the Appendix. W n i=1 We now state our main theorem for group norms. A spe- for some p, r ∈ [1, 2] and λ > 0. For some choices of cial case of this theorem (when Φ = k · ks, s ≥ 2) appeared (p, r) there is no coupling across problems. For example, in Juditsky and Nemirovski [2008]. We not only generalize the choices (2, 2) and (1, 1) exactly correspond to solving k their result but also provide a simpler proof. independent L2- and L1-regularized problems respectively. However, for other choices, we get more interesting coupling Theorem 18 (Group Norms) Let Ψ, Φ be absolutely sym- √ of the k problems. For example, the group Lasso choice sets metric norms on m, n. Let Φ2 ◦ : n → ∗ denote R R R R p = 2 and r = 1. That is, we take the L2-norm of the k the following function, columns of W and add them up. √ √ √ (Φ2 ◦ )(x) := Φ2( x ,..., x ) . Let us focus on the constrained form of the group Lasso 1 n problem Eq. (1), 2 √ n Suppose, (Φ ◦ ) is a norm on R . Further, let the functions 1 n 2 2 X 2 ¯ Ψ and Φ be σ - and σ -smooth w.r.t. Ψ and Φ respectively. min kyi − W xik2 s.t. kW k2,1 ≤ W2,1 1 2 W n 2 i=1 Then, k · kΨ,Φ is (σ1 + σ2)-smooth w.r.t. k · kΨ,Φ. for some W¯ 2,1 > 0. Proof: See Sec. A.2 in the Appendix. In order to obtain generalization bounds for the solution

Lemma 17 implies that (k · kp,r)? is k · kq,s where 1/p + of this problem, we need to control the Rademacher com- 2 √ plexity of the function class, 1/q = 1 and 1/r+1/s = 1. Moreover, when s ≥ 2, k·ks ◦ 2 is simply k · k s which is a norm. Thm. 18 now gives us the F = {(x, y) 7→ ky − W xk : kW k2,1 ≤ W¯ 2,1} . (2) 2 2 following corollary. Theorem 20 (Group Lasso) Let the distribution of x, y be 1 2 such that kxk∞ ≤ X∞ and kyk2 ≤ Y2 a.s. Then, for the Corollary 19 Let q, s ≥ 2. The function 2 k · kq,s is (q + s − m×n class defined above in Eq. (2), we have 2)-smooth w.r.t. k · kq,s on R . √ 2 Y + eW¯ X log d R (F) ≤ 2 2,√1 ∞ 4 Applications n n The potential of this framework is that once we characterize Note that this bound shows feature selection properties the β-strong convexity of our penalty function F (e.g over of the group Lasso, in the following sense: if there are q rel- matrices or other abstract convex sets), then we often im- evant (shared) features across all problems (whose weights mediately obtain both a family of online algorithms (along 2 q √log d with their regret bounds) and generalization bounds. In fact, are bounded), then the above bound scales as O( n ). for matrix based penalty functions, a number of dedicated The above bound directly leads to a generalization bound. Proof: We have In particular, we can choose a finite set {K1,...,Kk} of ky − W xk2 = y>y − 2y>W x + x>W >W x base kernels and consider the convex combinations, 2   > >  > >  k k = y y − 2Tr y W x + Tr x W W x + X X  K = µjKj : µj ≥ 0, µj = 1 . > >  > > c = y y − 2Tr xy W + Tr W W xx j=1 j=1  > > > > = y y − 2 yx ,W + W W, xx This is the unconstrained function class. In applications, one where the inner products appearing in the last line are matrix constrains the function class in some way. The class consid- inner products. Now consider the classes: ered in Lanckriet et al. [2004] is  F = {(x, y) 7→ 2 W, yx> : kW k ≤ W¯ } , n k 1 2,1 2,1  X X F + = x 7→ αiK(xi, ·): K = µjKj, µj ≥ 0, F = {(x, y) 7→ W >W, xx> : kW k ≤ W¯ } . Kc 2 2,1 2,1  i=1 j=1 It is straightforward to show:  k  R (F) ≤ R (F ) + R (F ) . (3) X > 2 n n 1 n 2 µj = 1, α K(T )α ≤ 1/γ (6) For F1, we use Thm. 9 with k · k = k · k2,r for r ∈ j=1  (1, 2] and f(W ) = 1 kW k2 . Let 1/r + 1/s = 1, so that 2 2,r where γ > 0 is a margin parameter and K(T )i,j = 1 2 s ∈ [2, ∞). By Corollary 19, 2 k · k2,s is s-smooth. Hence, K(xi, xj) is the n × n Gram matrix of the kernel K on the 1 2 dataset T . by Thm. 6, its conjugate 2 k · k2,r is 1/s-strongly convex. > 1/s Moreover kyx k2,s ≤ d Y2X∞. Now, Thm. 9 gives us, Theorem 21 (Kernel learning) Consider the class F + de- Kc r s fined in Eq. (6). Let K (x, x) ≤ B for 1 ≤ j ≤ k and R (F ) ≤ 2d1/sY X W¯ . j n 1 2 ∞ 2,1 n x ∈ X . Then, s Setting s = log d gives, B log k RT (F + ) ≤ e . r Kc 2 log d γ n Rn(F1) ≤ 2eY2X∞W¯ 2,1 . (4) n Before we present the proof, first note that the depen- For F2, note that dence on the number of features, k, is rather mild (only loga- X rithmic) — implying that we can learn a kernel as a (convex) kW >W k = | hW ,W i | 1,1 ·,i ·,j combination of a rather large number of base kernels. i,j Also, let us discuss how the above improves upon the X ≤ kW·,ik2 · kW·,jk2 prior bounds provided by Lanckriet et al. [2004] and Sre- i,j bro and Ben-David [2006] (neither of which had logarithmic q Bk  X X k dependence). The former proves a bound of O 2 = kW·,ik2 · kW·,jk2 γ n i j which is quite inferior to our bound. We cannot compare our bound directly to the bound in Srebro and Ben-David = kW k2 . 2,1 [2006] as they do not work with Rademacher complexities. > 2 Also, kxx k∞,∞ ≤ X∞. Now, using the L∞/L1 result However, if one compares the resulting generalization error from Sec. 3.1 in Kakade et al. [2008], we get bounds, then their bound is r s 3  2 log d k log n B + B log √γn log nB 2 ¯ 2 γ2k γ2 B γ2 Rn(F2) ≤ X∞W2,1 . (5) O n  n  The result follows from Eq. (3), with Eq. (4) and Eq. (5). and ours is s ! 4.2 Kernel Learning B log k O . We briefly review the kernel learning setting first explored γ2n in Lanckriet et al. [2004]. Let X be an input space and let n If k ≥ n, their bound is vacuous (while ours is still meaning- T = (x1,..., xn) ∈ X be the training dataset. Kernel ful). If k ≤ n, our bound is better. algorithms work with the space of linear functions, Proof: Let Hj be the RKHS of Kj, ( n ) X ( l ) x 7→ αiK(xi, x): αi ∈ R . X l Hj = αiKj(x˜i, ·): l > 0, x˜i ∈ X , α ∈ . i=1 R i=1 In kernel learning, we consider a kernel family K and con- equipped with the inner product sider the class, * l m + ( n ) X X 0 0 X 0 0 X αiKj(x˜i, ·), α Kj(x˜ , ·) = αiα Kj(x˜i, x˜ ) x 7→ αiK(xi, x): K ∈ K, αi ∈ R . i j j j i=1 j=1 i,j i=1 Hj Consider the space H = H1 × ... × Hk equipped with the where the second inequality is by Cauchy-Schwarz. Hence, inner product, Eq. (7) holds. √ Since kK (x, ·)k ≤ B, we also have kφ~(x)k ≤ k √ j Hj 2,s X 1/s 1 h~u,~vi := hu , v i . k B for any x ∈ X . By Corollary 19, k · k2,s is s- i i Hi 2 smooth. Hence, by Thm. 6, its conjugate 1 k · k2 is 1/s- i=1 2 2,r √ strongly convex. Now, we use Thm. 9 with X = k1/s B, Let r, s be dual exponents with r ∈ (1, 2], s ∈ [2, ∞). For 1 2 fmax = γ and β = 1/s, to get ~w ∈ H, let k · k2,r be the norm defined by 2 s 1 Bs k ! r 1/s Rn(F + ) ≤ Rn(Fr) ≤ k . X Kc 2 k~wk = kw kr . γ n 2,r i Hi i=1 Setting s = log k finishes the proof. We now claim that 4.3 Online Control with Quadratic Costs F + ⊆ F (7) Kc r Consider a finite horizon control problem where at each time where step t, the learner (or controller in this context) has to choose D ~ E a “control” direction u ∈ n, ku k = 1. However, instead Fr := {x 7→ ~w, φ(x) : ~w ∈ H, k~wk2,r ≤ 1/γ} , t R t 2 of a fixed quadratic cost function, as is assumed in Linear- and Quadratic control, the cost function is chosen adversarially. φ~(x) = (K (x, ·),...,K (x, ·)) ∈ H . More specifically, at each time step t, the adversary chooses 1 k n a positive semidefinite matrix Ct ∈ S and the learner incurs To see this, pick arbitrary an arbitrary f in F + . Thus, for > Kc the cost ut Ctut. As is usual, we define the learner’s regret some αi’s and µj’s, we have, after T time steps to be T T n  K  X > X > X X ut Ctut − inf u Ctu . f(x) = αi  µjKj(xi, x) u : kuk =1 t=1 2 t=1 i=1 j=1 Note that this is similar to the online variance minimization k n X X problem described in Warmuth and Kuzmin [2006]. Suppose = µjαiKj(xi, x) Algorithm 2 is run with j=1 i=1 n D E X = ~w, φ~(x) F (X) = λi(X) log(nλi(X)) . (8) i=1 P where ~w ∈ H is such that Since F = f ◦ λ for f(x) = i xi log(nxi), we have, by ? ? ? P n Thm. 12, F = f ◦λ where f (y) = log (( i exp(yi))/n). X X = UDiag(λ(X))U > wj = µjαiKj(xi, ·) ∈ Hj . Also, by Thm. 14,if then i=1 ∇F ?(X) = UDiag(∇f ?(λ(X)))U > Moreover, exp(λ (X)) exp(λ (X)) = UDiag 1 ,..., n U > 2 2 Z Z k~wk2,r ≤ k~wk2,1 where Z = Pn exp(λ (X)).  2 i=1 i k n X X Theorem 22 (Control) Suppose Algorithm 2 is run with the =  µjαiKj(xi, ·)  j=1 i=1 choice of F given in Eq. (8) and the sequence Ct is such that Hj C ∈ n, kλ(C )k ≤ K. Then, we have, for any η > 0,  2 t S t ∞ k n " T T # X X = µ α K (x , ·) X > X > log n 2  j i j i  E ut Ctut − inf u Ctu ≤ + ηK T. u : kuk2=1 η j=1 i=1 Hj t=1 t=1 k n 2 X X Proof: Note that " n # ≤ µj αiKj(xi, ·)  >  X > j=1 i=1 Hj E ut Ctut = E λt,ivt,iCtvt,i k i=1 X " n # = µ α>K (T )α j j X > j=1 = E λt,i Ct, vt,ivt,i i=1  k  "* n +# > X = α  µjKj(T ) α X > = E Ct, λt,ivt,ivt,i j=1 i=1 2 ≤ 1/γ , = E [hCt,Wti] Algorithm 2 Online Control with Quadratic Costs Y. Nesterov. Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of S1 ← 0 Louvain (UCL), 2005. W1 ← Diag(1/n, 1/n, . . . , 1/n) I. Pinelis. Optimum bounds for the distributions of martingales in banach spaces. Ann. Probab, 22(4):1679–1706, 1994. G. Pisier. Martingales with values in uniformly convex spaces. Israel J. Math., 20: for t = 1 to T do 326–350, 1975. Compute the eigen-decomposition R.T. Rockafellar. Convex Analysis. Princeton University Press, 1970. S. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD n thesis, The Hebrew University, 2007. X W = λ v v> S. Shalev-Shwartz and Y. Singer. Convex repeated games and Fenchel duality. In t t,i t,i t,i Advances in Neural Information Processing Systems 20, 2006. i=1 S. Shalev-Shwartz and Y. Singer. A primal-dual perspective of online learning algo- rithms. Machine Learning Journal, 2007. Choose it such that P(it = j) = λt,j S. Shalev-Shwartz and Y. Singer. On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms. In Proceedings of Apply “control” ut = vt,it n the Nineteenth Annual Conference on Computational Learning Theory, 2008. Receive cost matrix Ct ∈ S N. Srebro and S. Ben-David. Learning bounds for support vector machines with > Incur cost ut Ctut learned kernels. In Proceedings of the Nineteenth Annual Conference on Com- putational Learning Theory, pages 169–183, 2006. St+1 ← St + Ct ? M. Warmuth and D. Kuzmin. Online variance minimization. In Proceedings of the Wt+1 ← ∇F (−ηSt+1) Nineteenth Annual Conference on Computational Learning Theory, 2006. M. Yuan and Y. Lin. Model selection and estimation in regression with grouped vari- end for ables. Journal of the Royal Statistical Society: Series B, 68(1):49–67, 2006.

Note that F ?(0) = 0. Now, Thm. 8 along with Thm. 16 A Technical Lemmas and Proofs gives us, A.1 The duality of strong convexity and strong T T smoothness X X log n ηK2T hCt,Wti − min hCt,W i ≤ + , Proof:[of Thm. 6] First, [Shalev-Shwartz, 2007, Lemma W ∈S η 2 · 1 t=1 t=1 2 15] yields the claim 1 ⇒ 2. It is left to prove that f is strongly convex assuming that f ? is strongly smooth. For where S is the set of symmetric positive semidefinite matri- ? > simplicity assume that β = 1. Denote g(y) = f (x + ces with trace 1. Note that if kuk2 = 1, then uu ∈ S and ? ? > > y)−(f (x)+h∇f (x), yi). By the smoothness assumption, Ct, uu = u Ctu. Therefore, 1 2 ? 1 2 g(y) ≤ 2 kyk?. This implies that g (a) ≥ 2 kak because of T T [Shalev-Shwartz and Singer, 2008, Lemma 19] and that the X X > log n 2 conjugate of half squared norm is half squared of the dual hCt,Wti − min u Ctu ≤ + ηK T. kuk =1 η t=1 2 t=1 norm. Using the definition of g we have Taking expectation now proves the result. g?(a) = sup hy, ai − g(y) y = sup hy, ai − (f ?(x + y) − (f ?(x) + h∇f ?(x), yi)) References y ? ? ? Alekh Agarwal, Alexander Rakhlin, and Peter Bartlett. Matrix regularization tech- = sup hy, a + ∇f (x)i − f (x + y) + f (x) niques for online multitask learning. Technical report, EECS Department, Univer- y sity of California, Berkeley, 2008. ? ? ? J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, = sup hz − x, a + ∇f (x)i − f (z) + f (x) 2006. z G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. Linear algorithms for online multitask = f(a + ∇f ?(x)) + f ?(x) − hx, a + ∇f ?(x)i classification. In Proceedings of the Nineteenth Annual Conference on Computa- tional Learning Theory, pages 251–262, 2008. ?? N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge Univer- where we have used that f = f, in the last step. De- sity Press, 2006. note u = ∇f ?(x). From the equality in Fenchel-Young (e.g. C. Gentile. The robustness of the p-norm algorithms. Machine Learning, 53(3), 2002. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear [Shalev-Shwartz and Singer, 2008, Lemma 17]) we obtain ? discriminant updates. Machine Learning, 43(3):173–210, 2001. that hx, ui = f (x) + f(u) and thus A. Juditsky and A. Nemirovski. Large deviations of vector-valued martingales in 2- smooth normed spaces. submitted to Annals of Probability, 2008. g?(a) = f(a + u) − f(u) − hx, ai . S.M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in Neural Informa- ? 1 2 tion Processing Systems 22, 2008. Combining with g (a) ≥ 2 kak , we have J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3):301–329, July 2001. 1 2 J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear f(a + u) − f(u) − hx, ai ≥ kak , (9) predictors. Information and Computation, 132(1):1–64, January 1997. 2 G.R.G. Lanckriet, N. Cristianini, P.L. Bartlett, L. El Ghaoui, and M.I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning which holds for all a, x, with u = ∇f ?(x). Research, 5:27–72, 2004. Now let us prove that for any point u0 in the relative inte- A. S. Lewis. The convex analysis of unitarily invariant matrix functions. Journal of 0 0 ? Convex Analysis, 2(2):173–183, 1995. rior of the domain of f that if x ∈ ∂f(u ) then u = ∇f (x). R. Meir and T. Zhang. Generalization error bounds for Bayesian mixture algorithms. Let u := ∇f ?(x) and we must show that u0 = u. By Journal of Machine Learning Research, 4:839–860, 2003. 0 ? 0 A. Nemirovski and D. Yudin. Problem complexity and method efficiency in optimiza- Fenchel-Young, we have that hx, u i = f (x) + f(u ), and, ?? tion. Nauka Publishers, Moscow, 1978. again by Fenchel-Young (and f = f), we have hx, ui = ? f (x)+f(u). We can now apply Equation Eq. (9), to obtain: Furthermore, since Ψ(Xi) = |θi|Ψ(Zi) = |θi| and Φ is ab- 0 = hx, ui − f(u) − (hx, u0i − f(u0)) solutely symmetric, we have kXkΨ,Φ = Φ(|θ|) = Φ(θ) = 1. Thus, the sup in Eq. (10) is at least kY k . 1 Ψ?,Φ? = f(u0) − f(u) − hx, u0 − ui ≥ ku0 − uk2 , 2 Proof:[of Thm. 18] Note that an equivalent definition of σ- which implies that u0 = ∇f ?(x). smoothness of a function f w.r.t. a norm k · k is that, for all Next, let u1, u2 be two points in the relative interior of x, y and α ∈ [0, 1], we have the domain of f, let α ∈ (0, 1), and let u = αu1 + (1 − 1 f(αx + (1 − α)y) ≥ αf(x) + (1 − α)f(y) α)u2. Let x ∈ ∂f(u) (which is non-empty ). We have that u = ∇f ?(x), by the previous argument. Now we are able 1 − σα(1 − α)kx − yk2 . to apply Equation Eq. (9) twice, once with a = u1 − u and 2 once with a = u2 − u (and both with x) to obtain m×n 1 Let X,Y ∈ R be arbitrary matrices with columns Xi f(u ) − f(u) − hx, u − ui ≥ ku − uk2 and Y respectively. We need to prove 1 1 2 1 i 1 2 2 2 f(u ) − f(u) − hx, u − ui ≥ ku − uk2 k(1 − α)X + αY kΨ,Φ ≥ αkXkΨ,Φ + (1 − α)kY kΨ,Φ 2 2 2 2 1 2 Finally, summing up the above two equations with coeffi- − (σ1 + σ2)α(1 − α)kX − Y kΨ,Φ . (11) cients α and 1 − α we obtain that f is strongly convex. 2 We have, A.2 Group Norms 2 k(1 − α)X + αY kΨ,Φ Proof:[of Lemma 17] Recall that, by definition, 2 = Φ (..., Ψ(αXi + (1 − α)Yi),...) (kY kΨ,Φ)? = sup{hX,Y i : kXkΨ,Φ ≤ 1} 2 √ 2 n = (Φ ◦ )(..., Ψ (αXi + (1 − α)Yi),...) X = sup{ hXi,Yii : kXkΨ,Φ ≤ 1} (10) 2 √ 2 2 ≥ (Φ ◦ )(. . . , αΨ (Xi) + (1 − α)Ψ (Yi) i=1 1 2 We first prove (kY k ) ≤ kY k . Let Ψ(~ X) be a − σ1α(1 − α)Ψ (Xi − Yi),...) Ψ,Φ ? Ψ?,Φ? 2 shorthand for (Ψ(X1),..., Ψ(Xn)). Now, we have, 2 √ 2 2 n n ≥ (Φ ◦ )(. . . , αΨ (Xi) + (1 − α)Ψ (Yi),...) X X hXi,Yii ≤ Ψ(Xi)Ψ?(Yi) 1 2 √ 2 − σ1α(1 − α)(Φ ◦ )(..., Ψ (Xi − Yi),...) i=1 i=1 2 D E 2 p 2 2 = Ψ(~ X), Ψ~?(Y ) = Φ (..., αΨ (Xi) + (1 − α)Ψ (Yi),...) 1 2 ≤ Φ(Ψ(~ X)) · Φ (Ψ~ (Y )) − σ1α(1 − α)kX − Y k . (12) ? ? 2 Ψ,Φ = kXk · kY k . Ψ,Φ Ψ?,Φ? Now, we use that, for any x, y ≥ 0 and α ∈ [0, 1], we have Thus, the sup in Eq. (10) is no more than kY kΨ?,Φ? . n p 2 2 To prove (kY kΨ,Φ)? ≥ kY kΨ?,Φ? , let θ ∈ R be such αx + (1 − α)y ≥ αx + (1 − α)y . that Φ(θ) = 1 and Thus, we have D ~ E ~ θ, Ψ?(Y ) = Φ?(Ψ?(Y )) = kY kΨ?,Φ? . 2 p 2 2 m Φ (..., αΨ (Xi) + (1 − α)Ψ (Yi),...) Further, for each i, let Zi ∈ be such that Ψ(Zi) = 1 and R 2 hZi,Yii = Ψ?(Yi). Now, let X be the matrix with columns ≥ Φ (. . . , αΨ(Xi) + (1 − α)Ψ(Yi),...) Xi = θiZi. Then, we have 2 2 ≥ αΦ (..., Ψ(Xi),...) + (1 − α)Φ (..., Ψ(Yi),...) n n X X 1 2 hXi,Yii = θi hZi,Yii − σ2α(1 − α)Φ (..., Ψ(Xi) − Ψ(Yi),...) i=1 i=1 2 n ≥ αkXk2 + (1 − α)kY k2 X Ψ,Φ Ψ,Φ = θiΨ?(Yi) 1 2 i=1 − σ2α(1 − α)Φ (..., Ψ(Xi − Yi),...) D E 2 ~ 2 2 = θ, Ψ?(Y ) = αkXkΨ,Φ + (1 − α)kY kΨ,Φ = kY k . 1 2 Ψ?,Φ? − σ α(1 − α)kX − Y k 2 2 Ψ,Φ 1The set ∂f(u) is not empty for all u in the relative interior of the domain of f. See the relative max formula in [Borwein and Plugging this into Eq. (12) proves Eq. (11). Lewis, 2006, page 42] or [Rockafellar, 1970, page 253]. If u is not in the interior of f, then ∂f(u) is empty. But, a function is defined to be essentially strictly convex if it is strictly convex on any subset of {u : ∂f(u) 6= ∅}. The last set is called the domain of ∂f and it contains the relative interior of the domain of f, so we’re ok here.