<<

Chapter 6

Stochastic approximation

This chapter is dedicated to stochastic approximation for large sums. Stochastic approximation has a long history starting with Robbins-Monro algorithm [53] with the ODE method [38] from Ljung and latter extensions, see [11] for a complete exposition and [16]. The idea of using stochastic approximation in large scale setting gained significance interest in the machine learning littera- ture see for example [17]. We provide example of non asymptotic convergence rate analyses for stochastic subgradient and stochastic proximal gradient forfinite sums.

6.1 Motivation, largen 6.1.1 Lasso estimator The Lasso estimator is given as follows: 1 ˆ�1 2 θ arg min Xθ Y +λ θ 1. ∈ θ d 2n� − � � � ∈R We have seen that the optimization problem has a favorable structure which allow to devise efficient algorithms. Another way to write the same optimization problem is to consider

1 n 1 ˆ�1 T 2 θ arg min (xi θ y i) +λ θ 1, ∈ θ d n 2 − � � R i=1 ∈ � which actually exhibits an additional sum structure. In this chapter we will be considering opti- mization problems of the form

1 n min F(x) := fi(x) +g(x). (6.1) x d n R i=1 ∈ � wheref i andg are convex lower semicontinuous convex functions.

6.1.2 Stochastic approximation To solve problem (6.1), one may usefirst order methods such as the ones described in the previous chapter. Computing a subgradient in this case require to compute subgradient off i,i=1, . . . , n and average them. The computational cost is of the order ofn subgradient computation andn vector operations. Whenn is very large, or even infinite, this could be prohibitive. Intuitively, if there is redundancy in the elements of the sum, one should be able to take advantage of it. For example, suppose thatf i =f, for alli, then blindly computing the gradient ofF has a cost of the ordern d, while onlyd operations (computing one gradient would suffice). × 65 66 CHAPTER 6. STOCHASTIC APPROXIMATION

More generally, one could rewrite the objective function in (6.1) in the following form:

F:x E[f I (x)] +g(x), �→ whereI denotes a uniform over 1, . . . , n . Stochastic approximation, or stochas- { } tic optimization algorithm allow to handle such objectives. The main algorithmic step is as follows:

For anyx R d, • ∈ Samplei uniformly at random in 1, . . . , n . • { } Perform an algorithmic step using only the value off (x) and f (x) or eventuallyv ∂f (x) • i ∇ i ∈ i The simple example can be extended to more general random variablesI and under proper in- tegrability and domination conditions, one can invert gradient (or subgradient) and expectation, assumingg=0 for simplicity d If for each value ofI,f I is continuously differentiable, we then have for anyx R , • ∈ E[ f I (x)] = E[f I (x)] = F(x) ∇ ∇ ∇ Assume thatf is convex for all realizations ofI. Assume that we have access to a random • I variablev such thatv ∂f (x) almost surely, then the expectation is convex and I I ∈ I E[v I ] ∂E[f I (x)] =∂F(x). ∈ Hence the process of using a single element of the sum in an algorithm can be seen as performing optimization based on noisy unbiased estimates of the gradient, or subgradient, of the objective. This intuition is described more formaly in the coming section.

6.2 Prototype stochastic approximation algorithm

This section describes Robbins-Monro algorithm for stochastic approximation. Consider a Lips- chitz maph:R p R p, the goal is tofind a zero ofh. The operator only has access to unbiased �→ noisy estimates ofh. The Robins-Monro algorithm is described as follows,(X k)k is a sequence ∈N of random variables such that for anyk N ∈

Xk+1 =X k +α k (h(Xk) +M k+1) (6.2)

where

(α k)k is a sequence of positive step sizes satisfying • ∈N n α = + k ∞ i=1 �n α2 <+ k ∞ i=1 �

(M k)k is a martingale difference sequence with respect to the increasing family ofσ-fields • ∈N =σ(X ,M , m k)=σ(X ,M ,...,M ). Fk m m ≤ 0 1 k This thatE[M k+1 k] = 0, for allk N. |F ∈ In addition, we assume that there exists a positive constantC such that • 2 sup E Mk+1 2 k C. k N � � |F ≤ ∈ � � 6.3. THE ODE APPROACH 67

+ 2 2 The intuition here is that our hypotheses on the step size ensure that the quantity ∞ E α Mk+1 k k=0 k� � |F K α M isfinite and hence the zero martingale k=0 k k+1 has square summable� increments� and � converges to a square integrable random varibleM inR p both almost surely and inL 2 (see for � example [29, Section 5.4]). The long term behaviour of such recursions is at the heart of thefield of stochastic approx- imation. The fact that the step sizes tends to0 and that the sum of perturbation stabilizes suggests that in the limit one obtains trajectories of a continuous time differential equation. This is formalized in the next section.

6.3 The ODE approach

For optimization we may chooseh= F(x) assuming thatF has Lispchitz gradient. We consider −∇ Robbins-Monro algorithm in this setting. This idea dates back to Ljung [38], see also [11] for an advanced presentation. An accessible exposition of the following result is found in [16],

Theorem 6.3.1. Conditioning on boundedness of X , almost surely, the (random) set of { k}k N accumulation point of the sequence is compact connected and∈ invariant by theflow generated by the continuous time limit:

˙x=h(x).

This theorems means that for any¯x accumulation point of the algorithm, the unique solution x:t R p of the continuous time ODE satisfyingx(0) = ¯x remains bounded for allt R. This �→ ∈ allows to conclude in the convex case.

Corollary 6.3.1. IfF is convex, differentiable and attains its minimum, settingh= F, −∇ conditioning on the event that supk Xk isfinite, almost surely, all the accumulation points of ∈N � � Xk are critical points ofF.

p Proof. Fix¯x R such that F(¯x)= 0, this means thatF(¯x) F ∗ >0. Consider the solution to ∈ ∇ � − ˙x= F(x), ∇ starting at¯x, we have

∂ F(x(t)) = F(x(t)) 2 0 ∂t �∇ � 2 ≥ ∂ 2 x(t) x ∗ = F(x(t)), x(t) x ∗ F(x(t)) F ∗ F(¯x) F ∗ >0. ∂t � − �2 �∇ − � ≥ − ≥ −

We deduce thatF is increasing along the trajectory and diverges, hence the solution escapes any compact set which means that¯x does not belong to a compact invariant set.

The power of the ODE approach lies in the fact that it allows to treat much more complicated situations beyond convexity and differentiability.

6.4 Rates for convex optimization

In the context of convex optimization problems of the form described in the introduction of this chapter, one can obtain precise covergence rate estimates using elementary arguments. 68 CHAPTER 6. STOCHASTIC APPROXIMATION

6.4.1 Stochastic subgradient descent Proposition 6.4.1. Consider the problem

1 n min F(x) := fi(x), x d n R i=1 ∈ � where eachf i is convex andL-Lipschitz. Choosex 0 R and a sequence of random variables ∈ (ik)k independently identically distributed uniformly on 1, . . . , n and a sequence of positive ∈N { } step sizes(α k)k . Consider the recursion ∈N x =x α v (6.3) k+1 k − k k v ∂f (x ) (6.4) k ∈ ik k Then for allK N,K 1 ∈ ≥ 2 2G2 K 2 L x0 x ∗ 2 + L k=0 αk E[F(¯xK ) F ∗] � − � − ≤ K 2 k=0 αk �

K α x � ¯x = k=0 k k where K K α . � k=0 k � Proof. Wefixk N and condition oni 1, . . . , ik so thatx k andx k+1 arefixed. We have for any ∈ k N ∈ 1 2 1 2 x x ∗ = x α v x ∗ 2� k+1 − �2 2� k − k k − �2 2 1 2 T αk 2 = x x ∗ +α v (x∗ x ) + v 2� k − �2 k k − k 2 � k�2 2 1 2 αk 2 x x ∗ +α (f (x∗) f (x )) + L . ≤ 2� k − �2 k ik − ik k 2

Conditioning onx k and taking expectation with respect toi k,

2 2 1 2 1 2 αkL E xk+1 x ∗ xk E xk x ∗ xk + +E[α k(fik (x∗) f ik (xk)) xk] 2� − �2| ≤ 2� − �2| 2 − | � � � � 2 2 1 2 αkL = x x ∗ + +α (F(x ∗) F(x )). 2� k − �2 2 k − k

Taking expectation with respect tox k, using tower property of conditional expectation, we have

2 2 1 2 1 2 αkL E xk+1 x ∗ E xk x ∗ + +α kE[(F(x ∗) F(x k))]. 2� − �2 ≤ 2� − �2 2 − � � � � By summing up, we obtain, for allK N,K 1 ∈ ≥ K 2 2 K 2 αkE[F(x k) F ∗] x0 x +L αk k=0 − � − ∗� k=0 k ≤ K � i=0 αi 2 k=0 α�i and the result follows from convexity� off. � Corollary 6.4.1. Under the hypotheses of Proposition 6.4.1, we have the following Ifα =α is constant, we have • k 2 2 x0 x ∗ L α E[F(¯xk) F ∗] � − � + . − ≤ 2(k + 1)α 2 6.4. RATES FOR CONVEX OPTIMIZATION 69

x x∗ /L In particular, choosingα = � 0− � , we have • i √k+1

x0 x ∗ L E[F(¯xk) F ∗] � − � . − ≤ √k+1

Choosingα = x x ∗ /(L√k) for allk, we obtain for allk • k � 0 − � x0 x ∗ 2L(1 + log(k)) E[F(¯xk) F ∗] =O � − � . − √ � k �

6.4.2 Stochastic proximal gradient descent This method is sometimes called FOBOS in the litterature. I could notfind a reference for the following result. Proposition 6.4.2. Consider the problem

1 n min F(x) := fi(x) +g(x) x d n R i=1 ∈ � where eachf i is convex withL-Lipschitz gradient andg is convex. Choosex 0 R and a sequence ∈ of random variables(i k)k independently identically distributed uniformly on 1, . . . , n and a ∈N { } sequence of positive step sizes(α k)k . Consider the recursion ∈N xk+1 = prox (xk α k/L fi (xk)). (6.5) αkg/L − ∇ k Assume the following

0<α k 1, for allk N. • ≤ ∈ f andg areG-Lipschitz for alli=1, . . . , n; • i Then for allK N,K 1 ∈ ≥ 2 2G2 K 2 L x0 x ∗ 2 + L k=0 αk E[F(¯xK ) F ∗] � − � − ≤ K 2 k=0 αk �

K α x ¯x = k=0 k k � where K K α . � k=0 k � Proof. Wefixk N and condition oni 1, . . . , ik so thatx k andx k+1 are deterministic. Note that ∈ the prox iteration gives α α k ∂g(x ) +x =x k f (x ) L k+1 k+1 k − L ∇ ik k α x x 2G k � k+1 − k�2 ≤ L

Fixk N, applying Lemma 5.4.1 withx=x k,z=x ∗ andy=x k+1, using the fact thatf ik has ∈ L/αk Lipschitz gradient,

L 2 L 2 fik (x∗) +g(x ∗) + x∗ x k 2 xk+1 x ∗ 2 2αk � − � − 2αk � − � f (x ) +g(x ) ≥ ik k+1 k+1 f ik (xk) +g(x k) 2G x k+1 x k 2 ≥ − � α − � f (x ) +g(x ) 4G 2 k ≥ ik k k − L 70 CHAPTER 6. STOCHASTIC APPROXIMATION

And 2 αk 1 2 1 2 2 αk (f (x ) +g(x ) F ∗) x∗ x x x ∗ + 4G L ik k k − ≤ 2� − k�2 − 2� k+1 − �2 L2

We have, considering tower expectation, with respect toi k first and the remaining in a second step

αk αk E (fik (xk) +g(x k) F ∗) =E E (fik (xk) +g(x k) F ∗) xk L − L − | � � � �αk �� =E E (F(x k) F ∗) xk L − | �αk� �� =E (F(x k) F ∗) L − � � 2 1 2 1 2 2 αk E x∗ x k E xk+1 x ∗ + 4G ≤ 2� − �2 − 2� − �2 L2 � � � � By summing, we obtain, for anyK N ∈ K α (F(x ) F ) 2 2G2 K 2 E k=0 k k ∗ L x x ∗ + α − � 0 − �2 L k=0 k � K � ≤ K � k=0 αk 2 k=0 αk � and the result follows by Jensen’s� inequality. � Corollary 6.4.2. Under the hypotheses of Proposition 7.4.1. Ifα =α is constant, we have for allk 1 • k ≥ 2 2 L x0 x ∗ G α F(¯x) F ∗ � − � + . k − ≤ 2(k + 1)α L

1 In particular, choosingα i = , fori=1...,k, for somek N, we have • √2k+2 ∈

2 G2 L x0 x ∗ 2 + L F(¯xk) F ∗ � − � . − ≤ √2k+2

Choosingα = 1/√2k+2 for allk, we obtain for allk • k 2 G2 L x0 x ∗ 2 + L log(k) F(x k) F ∗ =O � − � . − � √2k+2 �

6.5 Minimizing the population risk

The methods which we have seen can be used to minimize functions of the form

x E Z [f(x, Z)] �→ wherex denotes some model parameters andZ denotes a random variable describing our popula- tion. In this case,Z could the input output pair(X,Y) of a regression problem, for which we try to minimize the exepected prediction error over a certain parametric regression function class . F

2 2 R(f)=E (f(X) Y) = (f(x) y)) P(dx, dy). − − �X ×Y � � 6.5. MINIMIZING THE POPULATION RISK 71

This can be done by replacing thefinite sum by an expectation and of independant indices by iis samples of the random variableZ. The results are exactly the same. n Such a procedure are usually called “single pass” procedure: given a dataset(x i, yi)i=1 for a regression problem, performing one pass of a stochastic algorithm, looking at each point only once amount to performn step of the same stochastic algorithm on the population risk. This illustrates a strong relation between and . We have seen that in the setting, there is no hope to obtain estimators with statistical rates much faster than1/n in terms of mean squared error. Similarly, the rates which we obtained for stochastic algorithms are of the order of1/ √k. This is also optimal in a precise sense. These algorithms provide estimator with statistical efficiency of the order of1/ √n. The gap stands because we considered regression problems with squared loss, a very special structure, while here the convex functions which we considered are arbitrary. For strongly convex functions, stochastic optimization algorithms may show faster convergence rate of the order1/k. 72 CHAPTER 6. STOCHASTIC APPROXIMATION Chapter 7

Block coordinate methods

Block decomposition methods appeared as alternatives to solve optimization problems involving large number of dimensions. The idea is to reduce the complexity of a single iteration by updating only a few coordinates at a time. The use of such methods was advocated by Nesterov [44], extensions such as [50] appeared in the continuity of these works. The survey [64] is a good entry point to the litterature.

7.1 Motivation, larged

The Lasso estimator is given as follows: 1 ˆ�1 2 θ arg min Xθ Y +λ θ 1. ∈ θ d 2n� − � � � ∈R We have seen that the optimization problem has a favorable structure which allow to devise efficient algorithms. The cost of each iteration is depends on the dimension (hered 2) which for some problems may be limiting. A possible alternative is to update coordinates independantly, reducing the cost of each iteration. In general, this approach is not convergent for nonsmooth functions (can you see why?), how- ever, the Lasso problem, despite being nonsmooth,fits coordinate descent methods because the nonsmooth part is separable. We shall see two variations of such algorithms, deterministic and random, with convergence rate estimates in both cases. A good introduction to the topic cand be found in [64] and a pioneering work in optimization is described in [44]. The litterature on the subject has completely exploded in the past years.

7.2 Description of the algorithm

We consider optimization problems of the form

p

min F(x) =f(x) + gi(xi), x Rp ∈ i=1 � p wheref:R R hasL-Lipschitz gradient andg i :R R are convex lower semicontinuous �→ �→ univariate functions. We denote bye 1, . . . , ep the elements of the canonical basis. Block coordinate p descent algorithms are given a sequence of coordinate indices(i k) , and, starting atx 0 R k N ∈ updates coordinates one by one at each iteration. For example ∈

L 2 xk+1 = arg miny=x +te f(x k) + f(x k), y x k + y x k 2 +g ik (y) k ik �∇ − � 2 � − �

xk+1 = arg min f(y)+g ik (y). y=xk+teik

73 74 CHAPTER 7. BLOCK COORDINATE METHODS

Thefirst option corresponds to a block proximal gradient algorithm, the second option corresponds to exact block minimization. Block coordinate descent algorithms are usualy analysed under coercvity assumptions: p p Assumption 7.2.1. The sublevelset y R ,F(y) F(x 0) is compact, for anyy R such { ∈ ≤ } ∈ thatF(y) F(x ), y x ∗ R. ≤ 0 � − �2 ≤ 7.3 Convergence rate analysis using random blocks 7.3.1 Smooth setting The following technical Lemma is classical.

Lemma 7.3.1. Let(A k)k be a sequence of positive real numbers andγ>0 be such that ∈N A A γA 2 k − k+1 ≥ k 1 then for allk N,k 1,A k (γk) − . ∈ ≥ ≤ Proof. We have for allk N, ∈ 1 1 A A A2 A = k − k+1 γ k =γ k γ. Ak+1 − Ak AkAk+1 ≥ Ak+1Ak Ak+1 ≥

Hence for allk N, ∈ 1 1 +γk kγ. Ak ≥ A0 ≥

Proposition 7.3.1. Consider the problem

min f(x) x p ∈R p wheref:R R is convex differentiable withL-Lipschitz gradient. Choosex 0 R and a sequence �→ ∈ of random variables(i ) independently identically distributed uniformly on 1, . . . , p and a k k N { } sequence of positive step sizes.∈ Consider the recursion 1 x =x f(x ) (7.1) k+1 k − L∇ik k Then for allk N,k 1 ∈ ≥ 2pLR2 E[f(x k) f ∗] . − ≤ k

Proof. Fix,k N, and condition onx k andi 0, . . . , ik so thatx k+1 is deterministic. We remark ∈ thatt f(x +te ) is convex withL-Lipschitz gradient. Applying Lemma 7.3.1 withx=z=x , �→ k ik k y=x k+1, L 1 f(x ) f(x ) + x x 2 =f(x ) + f(x ) 2, k ≥ k+1 2 � k+1 − k�2 k+1 2L�∇i k �2 and in particularf is decreasing along the sequence. Taking expectation with respect toi k, we obtain

1 2 E[f(x k+1) xk] f(x k) f(x k) (7.2) | ≤ − 2pL�∇ �2 7.3. CONVERGENCE RATE ANALYSIS USING RANDOM BLOCKS 75

From convexity, we havef ∗ f(x) f(x) x x ∗ and using the fact thatf is decreasing ≥ − �∇ �� − � along the sequence, x x ∗ remains bounded. We have � k − � 2 (f(x ) f ∗) f(x) 2 k − �∇ � 2 ≥ R2 and

1 2 E[f(x k+1) xk] f ∗ f(x k) f ∗ f(x k) | − ≤ − − 2pL�∇ �2 2 (f(x k) f ∗) f(x ) f ∗ − k − − 2pLR2

2 2 Taking expectation with respect tox k and using the fact thatE Z E[Z] , we obtain ≥ � � 2 E[f(x k) f ∗] E[f(x k+1) f ∗] E[f(x k) f ∗] − . − ≤ − − 2pLR2

Applying Lemma 7.3.1, we obtain for allk N,k 1, ∈ ≥ 2pLR2 E[f(x k) f ∗] . − ≤ k

7.3.2 Extension to the nonsmooth setting The following is a simplification of the arguments given in [50].

Proposition 7.3.2. Consider the problem

p

min F(x) :=f(x) + gi(x) x d R i=1 ∈ � p p wheref:R R is convex differentiable withL-Lipschitz gradient, eachg i :R R is convex and �→ �→ lower semicontinuous and only depends on coordinatei. Choosex 0 R and a sequence of random ∈ variables(i ) independently identically distributed uniformly on 1, . . . , p and a sequence of k k N { } positive step sizes.∈ Consider the recursion L x = arg min f(x ) + f(x ), y x + y x 2 +g (y) (7.3) k+1 y k �∇ik k − k� 2 � − k�2 ik 1 = proxg /L xk ik f(x k) . (7.4) ik − L∇ � � 2 SetC = max R ,F(x ) F ∗ , whereR is given in Assumption 7.2.1, we have, for allk 1, 0 − ≥ � � 2pC E[F(x k) F ∗] . − ≤ k

Proof. Fix,k N, and condition onx k andi 0, . . . , ik so thatx k+1 is deterministic. We remark ∈ thatt f(x +te ) is convex withL-Lipschitz gradient. Noting that the iteration actually solves �→ k ik a univariate problem, applying Lemma 7.3.1 withx=z=x k,y=x k+1, L f(x ) +g (x ) f(x ) +g (x ) + x x 2. k ik k ≥ k+1 ik k+1 2 � k+1 − k�2 76 CHAPTER 7. BLOCK COORDINATE METHODS

By assumption,g depends only on coordinatei so that,g (x ) =g (x ) fori=i . i i k i k+1 � k L F(x ) F(x ) + x x 2, k ≥ k+1 2 � k+1 − k�2 2 So thatF is non increasing along the sequence and for anyk N, x k x ∗ 2 C, almost surely. p ∈ � − � ≤ We writeg= i=1 gi. From the definition of the proximity operator, we have L f(x ) +g (x �) f(x ) + f(x ), x x + x x 2 +g (x ) k+1 ik k+1 ≤ k �∇ik k k+1 − k� 2 � k+1 − k�2 ik k+1 L F(x ) f(x ) + f(x ), x x + x x 2 +g (x ) + g (x ). k+1 ≤ k �∇ik k k+1 − k� 2 � k+1 − k�2 ik k+1 i k i=i �� k Since eachg i only depends on coordinatei, prox g can be computed coordinate by coordinate. Taking expectation with respect toi and settingz = prox x 1 f(x ) , we obtain k k g/L k − L ∇ k

1 L� 2 � E[F(x k+1) xk] f(x k) + f(x k), zk x k + zk x k +g(z k) | ≤ p �∇ − � 2 � − �2 � � p 1 + − F(x ). (7.5) p k By definition of the proximity operator, we have for anyy R p, ∈ L f(x ) + f(x ), z x + z x 2 +g(z ) k �∇ k k − k� 2 � k − k�2 k L f(x ) + f(x ), y x + y x 2 +g(y) ≤ k �∇ k − k� 2 � − k�2 L F(y)+ y x 2 ≤ 2 � − k�2 In particular, for anyα [0, 1], ∈ L f(x ) + f(x ), z x + z x 2 +g(z ) k �∇ k k − k� 2 � k − k�2 k 2 α L 2 F(αx ∗ + (1 α)x ) + x∗ x ≤ − k 2 � − k�2 α2L αF(x ∗) + (1 α)F(x ) + C ≤ − k 2

The minimum is attained forα=(F(x ) F ∗)/C 1 so that k − ≤ L 2 f(x ) + f(x ), z x + z x +g(z ) F ∗ k �∇ k k − k� 2 � k − k�2 k − F(x k) F ∗ 1 − (F(x ) F ∗) ≤ − C k − � � Pluging this in (7.5), we obtain

F(x k) F ∗ F(x k) F ∗ E[F(x k+1) xk] F ∗ − 1 − | − ≤ p − 2C � � p 1 + − (F(x ) F ∗) p k −

F(x k) F ∗ = (F(x ) F ∗) 1 − (7.6) k − − 2pC � � 2 2 Taking expectation with respect tox k and using the fact thatE[Z ] E[Z] , we have ≥ 1 2 E[F(x k+1) F ∗] E[F(x k) F ∗] E[F(x k) F ∗] . (7.7) − ≤ − − 2pC − The result follows from Lemma 7.3.1. 7.4. CONVERGENCE RATES USING DETERMINISTIC BLOCKS 77

7.4 Convergence rates using deterministic blocks

Deterministic block selection may lead to similar theoretical guaranties, there is some computa- tional overhead, but as we should see, this is affordable for Lasso instances. A broader discussion on this aspect is found in [48]. Proposition 7.4.1. Consider the problem min f(x) x d ∈R p wheref:R R is convex differentiable withL-Lipschitz gradient. Choosex 0 R, and consider �→ ∈ the recursion 1 x =x f(x ) (7.8) k+1 k − L∇ik k

wherei k is the largest block of f(x k) in Euclidean norm. Then for allk N,k 1 ∇ ∈ ≥ 2pLR2 f(x ) f ∗ . k − ≤ k

Proof. The proof is essentially the same as in Proposition 7.3.1, we have for anyk N, ∈ f(x ) 2 p f(x ) 2 �∇ k �2 ≤ �∇ i k �2 and one obtain using the same arguments as in (7.2) 1 f(x ) f(x ) f(x ) 2 (7.9) k+1 ≤ k − 2pL�∇ k �2 and the rest of the analysis is the same. Using similar ideas, one obtains the same behaviour for deterministic block proximal gradient algorithm. Proposition 7.4.2. Consider the problem

p

min F(x) :=f(x) + gi(x) x d R i=1 ∈ � p p wheref:R R is convex differentiable withL-Lipschitz gradient, eachg i :R R is convex �→ �→ and lower semicontinuous and only depends on coordinatei. Choosex 0 R and consider the ∈ recursion L x = arg min f(x ) + f(x ), y x + y x 2 +g (y) (7.10) k+1 y k �∇ik k − k� 2 � − k�2 ik 1 = proxg /L xk if(x k) . (7.11) ik − L∇ � � wherei k is given by L 1 arg min y x , f(x ) + y x 2 +g(y) g (x ), y = prox x f(x ) i � − k ∇ i k � 2 � − k�2 − i k gi/L k − L∇i k � � �� 2 SetC = max R ,F(x ) F ∗ , whereR is given in Assumption 7.2.1, we have, for allk 1, 0 − ≥ � � 2pC F(x ) F ∗ . k − ≤ k 78 CHAPTER 7. BLOCK COORDINATE METHODS

Proof. Using the same arguments as in the proof of Proposition 7.3.2, we obtain for anyk N, ∈ L F(x ) f(x ) + f(x ), x x + x x 2 +g (x ) + g (x ) k+1 ≤ k �∇ik k k+1 − k� 2 � k+1 − k�2 ik k+1 i k i=i �� k L =F(x ) + f(x ), x x + x x 2 +g (x ) g (x ). k �∇ik k k+1 − k� 2 � k+1 − k�2 ik k+1 − ik k

Since eachg i only depends on coordinatei, prox g can be computed coordinate by coordinate. Settingz = prox x 1 f(x ) , andg= g , we deduce from the definition ofi , k g/L k − L ∇ k i i k � 1 � �L F(x ) F(x ) + f(x ), x x + z x 2 +g(z ) g(x ) k+1 ≤ k p �∇ k k+1 − k� 2 � k − k�2 k+1 − k � � 1 L =F(x ) + f(x ) + f(x ), z x + z x 2 +g(z ) g(x ) f(x ) k p k �∇ k k − k� 2 � k − k�2 k − k − k � � p 1 1 L = − F(x ) + f(x ) + f(x ), z x + z x 2 +g(z ) p k p k �∇ k k − k� 2 � k − k�2 k � � This is similar to (7.5) and the result follows from the same arguments.

7.5 Comments on complexity for quadratic problems

The Lasso problem is a special case for block descent methods since the objective is quadratic. This leads to the following remark Computing the gradient of the Lasso problem costs a matrix vector product which complexity • is of the order ofd 2.

Givenθ R d andβ=X T (Xθ Y), choosing θ˜ differing fromθ in at most one coordinate, • ∈ − computingX T (Xθ˜ Y) givenβ costs only of the order ofd operations by only considering − the corresponding column ofX T X. As a consequence the cost of performing one iteration of full proximal gradient for the Lasso problem is roughly equivalent to the cost of performingd iterations of random block proximal gradient. Given the value of the gradient, the added complexity of computing the deterministic block is of the order ofd as it requires only one path through the coordinates of the gradient and the current estimateθ. Hence the deterministic rule has similar complexity per iteratation as the random block rules.