Noname manuscript No. (will be inserted by the editor)

Stochastic proximal splitting algorithm for composite minimization

Andrei Patrascu · Paul Irofti

Received: date / Accepted: date

Abstract Supported by the recent contributions in multiple branches, the first-order splitting algorithms became central for structured nonsmooth optimization. In the large-scale or noisy contexts, when only stochastic information on the objective func- tion is available, the extension of proximal gradient schemes to stochastic oracles is heavily based on the tractability of the proximal operator corresponding to nons- mooth component, which has been deeply analyzed in the literature. However, there remained some questions about the difficulty of the composite models where the nonsmooth term is not proximally tractable anymore. Therefore, in this paper we tackle composite optimization problems, where the access only to stochastic infor- mation on both smooth and nonsmooth components is assumed, using a stochastic proximal first-order scheme with stochastic proximal updates. We provide sublinear 1  k convergence rates (in expectation of squared distance to the optimal set) under theO strong convexity assumption on the objective function. Also, linear convergence is achieved for convex feasibility problems. The empirical behavior is illustrated by numerical tests on parametric sparse representation models.

1 Introduction

In this paper we consider the following convex composite optimization problem:

min f(x) + h(x), x n ∈R

A. Patrascu and P. Irofti are with the Research Center for Logic, Optimization and Security (LOS), De- partment of Computer Science, Faculty of Mathematics and Computer Science, University of Bucharest. (e-mails: [email protected], [email protected]). The research of A. Patrascu was supported by a grant of the Romanian Ministry of Education and Research, CNCS - UEFISCDI, project number PN-III- arXiv:1912.02039v3 [math.OC] 2 Oct 2020 P1-1.1-PD-2019-1123, within PNCDI III. Also, the research work of P. Irofti was supported by a grant of the Romanian Ministry of Education and Research, CNCS - UEFISCDI, project number PN-III-P1-1.1- PD-2019-0825, within PNCDI III. 2 Andrei Patrascu, Paul Irofti

where f is the smooth component and h is a proper, convex, lower-semicontinuous function. In literature, many applications from statistics [22] or signal processing of- ten motivate noisy contexts allowing access only to stochastic first order information on smooth function f, having regularizer h as a typical proximally-tractable convex function. By proximally-tractable we mean that the proximal map of a given function is computable in closed form or, at most, in linear time. Therefore, in these situations the following stochastic model

min E[f(x; ξ)] + h(x), x n ∈R where ξ is a random variable, express better the previous real assumptions. However, the recent dimensionality inflation of machine learning [8,26,33] and signal process- ing [26,27] models gave birth to optimization problems with complicated regularizers or complicated many constraints. As practical examples we recall: parametric sparse represention [27], group lasso [8,26,33], CUR-like factorization [31], graph trend filtering [24,29], dictionary learning [27,32]. Motivated by all these models, in this paper instead of assuming proximal-tractable regularizer h, we consider that h is ex- pressed as an expectation of stochastic proximally-tractable components h( ; ξ) (i.e. · h(x) = E[h(x; ξ)]). Thus we focus on the following model:

min F (x) := Eξ Ω[f(x; ξ)] + Eξ Ω[h(x; ξ)], (1) x n ∈ ∈ ∈R where ξ is a random variable associated with probability space (P,Ω). Functions f( ; ξ): Rn R are smooth with Lipschitz gradients and h( , ξ): Rn ( , + ] · → · 7→ −∞ ∞ are proper convex and lower-semicontinuous, E[ ] is the expectation over respective random variable. In general, many existing primal· schemes encounter computational difficulties when a large (possibly infinite) number of constraints are present, since they are based on full projections onto complicated feasible set. Contributions. (i) We analyze a stochastic first-order splitting scheme relying on stochastic gradients and stochastic proximal updates, which naturally generalize the widely known Stochastic (SGD) and Stochastic Proximal Point (SPP) algorithms toward composite models with untractable regularizations; (ii) we 1 provide ( k ) iteration complexity estimates which were previously unknown for this type ofO schemes. In particular, for convex feasibility problems, the analysis yields naturally linear convergence rates. We briefly recall further the milestone results from stochastic optimization literature with focus on the complexity of stochastic first-order methods.

1.1 Previous work

Great attention has been given in the last decade to the behaviour of stochastic first order schemes, with special focus in stochastic gradient descent (SGD), on a variety of models under different convexity properties, see [10–13,16,22,25]. Since the anal- ysis of SGD naturally require a typical smoothness, appropriate extensions are nec- essary to attack nonsmooth models. These extensions are embodied by the stochastic Stochastic proximal splitting algorithm for composite minimization 3 proximal point (SPP) algorithm, which has been recently analyzed using various dif- ferentiability assumptions, see [1,5,9,18,23,28,30] and has shown surprinsing analyt- ical and empirical performances. In [28] is considered the typical stochastic learning model involving the expectation of random particular components f( ; ξ) defined by · T the composition of a smooth function and a linear operator, i.e.: f(x; ξ) = `(aξ x), n where aξ R . Their complexity analysis requires smoothness and strong convexity ∈ 1  to obtain in the quadratic mean and an kγ convergence rate, using vanishing step- size. The generalization of these convergenceO guarantees is undertaken in [18], where no linear composition structure is required and an (in)finite number of constraints are included in the stochastic model, i.e. h( ; ξ) = IXξ ( ). However, the analysis of [18] requires strong convexity and Lipschitz· gradient continuity· for each functional com- ponent f( ; ξ). Note that our analysis surpasses this restriction and provides a natu- ral generalization· of [18] to nonsmooth composite models. Further, in [5] a general asymptotic convergence analysis of slightly modified SPP scheme has been provided, under mild convexity assumptions on a finitely constrained stochastic problem. In [30] similar SPP algorithms are developed, which are also tailored for complicated feasible sets. The authors focus on mild convex models deriving optimal (1/√k) rates. Recently, in [1], the authors analyze SPP schemes for stochastic modelsO with ”shared minimizers” obtaining linear convergence results, for variable stepsize SPP.   Also, without shared minimizers assumption, they obtain for SPP 1 in convex O √k Lipschitz continuous case and, furthermore, 1  in strongly convex case. O k Splitting first-order schemes received significantly attention due to their natural in- sight and simplicity in contexts when a sum of two components are minimized (see [4,14]). However, when the information access is limited only to stochastic samples of the two components, extending the existing guarantees is not straightforward. No- tice that overall, on one hand, SPP avoids any splitting in composite models by treat- ing the constrained smooth problems as black box expectations (see [1,18,28]). On the other hand, until recently the composite nonsmooth models assumed proximal- tractable h in order to extend proof arguments of the results related to stochastic smooth optimization [22]. Therefore, only recently the full stochastic composite mod- els with stochastic regularizers have been properly tackled [24], where almost sure asymptotic convergence is established for a stochastic splitting scheme, where each iteration represents a typical proximal gradient update with respect to stochastic sam- ples of f and h. In our paper we analyze the nonasymptotic behaviour of this scheme and point the relations with other algorithms from the literature. The stochastic splitting schemes are also related to the model-based methods de- veloped in [6]. Here, the authors developed a unified algorithmic framework, which generates stochastic algorithms, for different models arising in learning applications. They assume their composite objective function to be the sum of a (weakly convex) stochastic component with bounded gradients and simple (proximally tractable) con- vex regularization. Although their framework is algorithmically more general, our analysis avoid these boundedness assumptions, allows objectives with a component having Lipschitz continuous gradient and do not require proximal tractability on reg- ularizations. 4 Andrei Patrascu, Paul Irofti

Notations. We use notation [m] = 1, , m . For x, y Rn denote the scalar { ··· } ∈ product x, y = xT y and Euclidean norm by x = √xT x. The projection op- h i k k erator onto set X is denoted by πX and the distance from x to the set X is de- noted distX (x) = minz X x z . The indicator function of a set X is denoted: ( ∈ k − k 0, if x X IX (x) = ∈ . We use notations ∂h(x; ξ) for the subdifferential set , otherwise ∞ and g (x; ξ) for a subgradient of h( ; ξ) at x. In differentiable case, f( ; ξ) is the h · ∇ · gradient of component ξ. Finally, we use E[ ] for (conditional) expectation operator. ·

1.2 Preliminaries

We denote the set of optimal solutions with X∗ and x∗ any optimal point for (1). Assumption 1 The objective function of our main problem (1) satisfies: (i) The function f( ; ξ) has L -Lipschitz gradient, i.e. there exists L > 0 such that: · f f n f(x; ξ) f(y; ξ) Lf x y , x, y R , ξ Ω. k∇ − ∇ k ≤ k − k ∀ ∈ ∈ and f is σ strongly convex, i.e. there exists σ > 0 satisfying: f − f σf 2 n f(x) f(y) + f(y), x y + x y x, y R . (2) ≥ h∇ − i 2 k − k ∀ ∈ n (ii) There exists subgradient mappings gF ( ; ξ): dom F ( ; ξ) Ω R such that · · × 7→ gF (x; ξ) ∂F (x; ξ), ξ Ω and E[gF (x; ξ)] ∂F (x). ∈ ∀ ∈ ∈ (iii) F ( ; ξ) has bounded gradients on the optimal set: there exists F∗ 0 such that  · 2 S ≥ E gF (x∗; ξ) ∗ < for all x∗ X∗. k k ≤ SF ∞ ∈ (iv) For any gF (x∗) ∂F (x∗) there exists bounded subgradients gF (x∗; ξ) ∈ 2 ∈ ∂F (x∗; ξ), such that E[gF (x∗; ξ)] = gF (x∗) and E[ gF (x∗; ξ) ] < ∗ . Moreover, k k SF for simplicity, we assume throughout the paper gF (x∗) = E[gF (x∗; ξ)] = 0. The first part of the above assumption is natural in the stochastic optimization prob- lems. The Assumption (ii) guarantee the existence of a subgradient mapping. The third part Assumption (iii) is standard in the literature related to proximal stochastic algorithms. Notice that most stochastic subgradient algorithms require bounded gra- dients on the entire domain [10,13,16], which is more restrictive than condition (iii). The assumption (iv) needs a more consistent discussion, detailed in [17]. However, for completeness we include it in the remark below.

Remark 1 Denote the set E[∂F ( ; ξ)] = E[gF ( ; ξ)] gF ( ; ξ) ∂F ( ; ξ) . In gen- · { · | · ∈ · } eral, for convex functions it can be easily shown E [∂F (x; ξ)] ∂F (x) for all x dom(F ) (see [21]). However, (iv) is guaranteed by the stronger⊆ equality: ∈ E [∂h(x; ξ)] = ∂h(x). (3) Discrete case. Let us consider finite discrete domains Ω = 1, , m . Then [20, Theorem 23.8] guarantees that the finite sum objective function{ ··· from (1)} satisfy (3) if T ri(dom(h( ; ξ))) = . The ri(dom( )) can be further relaxed to dom( ) for ξ Ω · 6 ∅ · · ∈ Stochastic proximal splitting algorithm for composite minimization 5

polyhedral components. In particular, let X1, ,Xm be finitely many closed convex m ··· T satisfying qualification condition: ri(Xi) = , then also (3) holds, i.e. X (x) = i=1 6 ∅ N Pm Xi (x) (see ( by [20, Corrolary 23.8.1])). Again, ri(Xi) can be relaxed to the set i=1 N itself for polyhedral sets. As pointed by [3], the (bounded) linear regularity property m of Xi i=1 implies the intersection qualification condition. {Under} support of these arguments, observe that (iv) can be easily checked for most finite-sum examples arisen in convex learning problems. Continuous case. In the nondiscrete case, sufficient conditions for (3) are discussed in [21]. Based on the arguments from [21], an assumption equivalent to (iv) is con- sidered in [24] under the name of 2 integrable representation of x∗ (definition in [24, − Section B]). On short, if h( ; ξ) is normal convex integrand with full domain then x∗ · admits an 2 integrable representation gh(x∗; ξ) ξ Ω, and implicitly (iv) holds. We mention− that deriving a more complicated{ } result∈ similar to Lemma 3 we could avoid assumption (iv). However, since (iv) facilitates the simplicity and naturality of our results and while our target applications are not excluded, we assume throughout the paper that (iv) holds. T Let closed convex sets Xξ ξ Ω and X = ξ Ω Xξ, then a favorable ”conditioning” property for most projection{ } ∈ methods is the∈ linear regularity property: there exists κ > 0 such that

2 2 n E[dist (x)] κdist (x) x R . (4) Xξ ≥ X ∀ ∈ Given some smoothing parameter µ > 0, define Moreau envelope of h(x; ξ) and the prox operator as follows:

1 2 hµ(x; ξ) := min h(z; ξ) + z x z n 2µk − k ∈R 1 2 proxh,µ(x; ξ) := arg min h(z; ξ) + z x . z n 2µk − k ∈R

The approximate hµ( ; ξ) inherits the same convexity properties of h( ; ξ) and ad- · 1 · ditionally has Lipschitz continuous gradient with constant µ , see [19]. In partic-

ular, when h(x; ξ) = IXξ (x) the prox operator becomes the projection operator

proxh,µ(x; ξ) = πXξ (x).

2 Stochastic Splitting Proximal Gradient Algorithm

In the following section we present the Stochastic Splitting Proximal Gradient (SSPG) algorithm and analyze its nonasymptotic convergence towards the optimal set of the original problem (1). The asymptotic convergence of vanishing stepsize SSPG have been analyzed in [24]. 0 n Let x R be a starting point and µk k 0 be a nonincreasing positive sequence of stepsizes.∈ { } ≥ 6 Andrei Patrascu, Paul Irofti

Stochastic Splitting Proximal Gradient algorithm (SSPG): For k 0 compute ≥ 1. Choose randomly ξk Ω w.r.t. probability distribution P 2. Update: ∈

yk = xk µ f(xk; ξ ) − k∇ k k+1 prox k x = h,µk (y ; ξk) 3. If the stoppping criterion holds, then STOP, otherwise k = k + 1. The SSPG iteration k+1 prox k k  is mainly a Stochas- x = h,µk x µk f(x ; ξk); ξk tic Proximal Gradient iteration based on− stochastic∇ proximal maps [24]. Thus, the first step of algorithm SSPG consists of a varying-stepsize (vanilla) stochastic gradi- ent update, while the second step rely on a stochastic proximal update, or equivalently a gradient step in the direction of the randomly sampled gradient of expected Moreau envelope hµ( ). Further results will state that a diminishing stepsize is an appropri- ate choice to· obtain convergence in expectation. By varying our central model, this general SSPG scheme recovers several well-known stochastic first order algorithms. (i) In the smooth case (h = 0), SSPG reduces to vanilla SGD [10]:

xk+1 = xk µ f(xk; ξ ). − k∇ k (ii) By considering proximal-tractable regularizers (i.e. h( ; ξ) = h( )) or simple · · convex sets (i.e. h( ; ξ) = IX ( ), with πX ( ) computable in closed form), then we recover Proximal (or· Projected,· respectively)· SGD [22]:

xk+1 = prox xk µ f(xk; ξ ) or xk+1 = π xk µ f(xk; ξ ) . h,µk − k∇ k X − k∇ k (iii) For nonsmooth objective functions, when f = 0, SSPG is equivalent with SPP iteration [1,18,28]: k+1 prox k x = h,µk (x ; ξk).

(iv) For CFPs, i.e. f = 0, h( ; ξ) = IXξ ( ), the SSPG iteration is simplified since · · k k and k+1 prox k . Thus it recovers Randomized Alternating y = x x = h,µk (x ; ξk) Projections scheme [2]: xk+1 = π (xk). Xξk Therefore, the below results will implicitly represent unifying convergence rates for these algorithms, under stated assumptions.

3 Iteration complexity in expectation

In this section we assume that the function f is strongly convex and derive conver- gence rates for this particular case. We recall a few elementary inequalities that will be used in our proofs. The proof of the following lemma can be found in [15]. Lemma 1 ([15]) Under Assumption 1(i), the following inequality holds:

Lf 2 n f(x; ξ) f(y; ξ) + f(y; ξ), x y + x y x, y R . ≤ h∇ − i 2 k − k ∀ ∈ Stochastic proximal splitting algorithm for composite minimization 7

A second elementary bound the we will use is: let v, x, y Rn and ν > 0, then ∈ 1 ν v, x y + x y 2 v 2. (5) h − i 2ν k − k ≥ − 2 k k

We denote the history of index choices by Ξk = ξ0, , ξk 1 . { ··· − } 1 k Lemma 2 Let Assumption 1 hold and µk . Then the sequence x k 0 gen- ≤ 2Lf { } ≥ erated by SSPG satisfies:

k+1 2 k 2 E[ x x∗ ] (1 σf µk) E[ x x∗ ] k − k ≤ −  k − k  k+1 1 k+1 k 2 + 2µkE F (x∗) F (x ; ξk) x x . − − 4µk k − k

Proof First notice that from the optimality conditions of the subproblem minz h(z; ξ)+ 1 z y 2, the following relation holds: 2µ k − k

k+1 1 k+1 k gh(x ; ξk) + x y = 0. (6) µk − Further we obtain the main recurrence:

k+1 2 k 2 k+1 k k k+1 k 2 x x∗ = x x∗ + 2 x x , x x∗ + x x k − k k − k h − − i k − k k 2 k+1 k k+1 k+1 k 2 = x x∗ + 2 x x , x x∗ x x k − k h − − i − k − k k 2 k k+1 k+1 k+1 k 2 = x x∗ + 2 µ f(x ; ξ ) + µ g (x ; ξ ), x∗ x x x k − k h k∇ k k h k − i−k − k k 2 k k+1 k+1 k 2 x x∗ + 2µ f(x ; ξ ), x∗ x x x ≤ k − k kh∇ k − i − k − k k+1 + 2µk [ h(x∗; ξk) h(x ; ξk)]  − k 2 k k+1 k 1 k+1 k 2 = x x∗ 2µk f(x ; ξk), x x + x x (7) k − k − h∇ − i 4µk k − k  k+1 k k 1 k+1 k 2 + h(x ; ξ ) + 2µ f(x ; ξ ), x∗ x x x + 2µ h(x∗; ξ ). k kh∇ k − i − 2k − k k k

1 By using in (7) the stepsize bound µk and Lemma 1, we obtain: ≤ 2Lf

1 µk ≤ 2Lf k+1 2 k 2 x x∗ x x∗ k − k ≤ k − k  L  2µ f(xk; ξ ), xk+1 xk + f xk+1 xk 2 + h(xk+1; ξ ) − k h∇ k − i 2 k − k k

k k 1 k+1 k 2 + 2µ f(x ; ξ ), x∗ x x x + 2µ h(x∗; ξ ) kh∇ k − i − 2k − k k k Lemma 1   k 2 k+1 k x x∗ 2µ F (x ; ξ ) f(x ; ξ ) ≤ k − k − k k − k

k k 1 k+1 k 2 + 2µ f(x ; ξ ), x∗ x x x + 2µ h(x∗; ξ ). kh∇ k − i − 2k − k k k 8 Andrei Patrascu, Paul Irofti

Further we take expectation w.r.t. ξk in both sides and use the strong convexity prop- k erty (2) (with x = x and y = x∗) to finally derive the following:

k+1 2 k 2 k k+1  E[ x x∗ Ξk] x x∗ + 2µkE[ f(x ; ξk) F (x ; ξk) Ξk] k − k | ≤ k − k − | k k 1  k+1 k 2  + 2µk f(x ), x∗ x E x x Ξk + 2µkh(x∗) h∇ − i − 2 k − k | k 2  k k+1  x x∗ + 2µkE f(x ; ξk) F (x ; ξk) Ξk ≤ k − k − |  k σf k 2 1  k+1 k 2  + 2µk F (x∗) f(x ) x x∗ E x x Ξk − − 2 k − k − 2 k − k |   k 2 k+1 1 k+1 k 2 =(1 σf µk) x x∗ +2µkE F (x∗) F (x ; ξk) x x Ξk . − k − k − − 4µk k − k |

Finally, by taking full expectation E[ ], over the entire index history, we obtain the above result. ·

Further we present some lower bounds on the second term from the right hand side, which will allow the synthesis of general complexity estimates over stochastic and deterministic contexts.

Lemma 3 Given µ > 0, let Assumption 1 hold. Then F satisfies the following rela- tions: given k 0 ≥ h k+1 1 k+1 k 2i  2 (i) E F (x ; ξk) F ∗ + x x µkE gF (x∗; ξ) . − 4µk k − k ≥ − k k (ii) Let Xξ ξ Ω be convex sets satisfying linear regularity with constant κ, such that { T} ∈ X := ξ Ω Xξ = . Also let h( ; ξ) = IXξ ( ), then ∈ 6 ∅ · ·   k+1 1 k+1 k 2 E F (x ; ξk) F ∗ + x x − 4µk k − k  2 4µk 2 κ  2 k  2µkE f(x∗; ξ) f(x∗) + E distX (x ) ≥ − k∇ k − κ k∇ k 16µk

Proof In order to prove (i) let x∗ X∗ and gF (x∗; ξ) ∂F (x∗; ξ), by convexity of f( ; ξ) we have: ∈ ∈ · k+1 1 k+1 k 2 F (x ; ξk) F ∗ + x x − 4µk k − k k+1 1 k+1 k 2 gF (x∗; ξk), x x∗ + x x ≥ h − i 4µk k − k k k+1 k 1 k+1 k 2 gF (x∗; ξk), x x∗ + gF (x∗; ξk), x x + x x . (8) ≥ h − i h − i 4µk k − k On one hand, based on Assumption 1 (iv), we observe that the first term of the right hand side has null expectation:

 k   k  E gF (x∗; ξk), x x∗ = E E [gF (x∗; ξk) Ξk] , x x∗ h − i | − Assump. 1(iv)  k  = E gF (x∗), x x∗ = 0. (9) − Stochastic proximal splitting algorithm for composite minimization 9

On the other hand, for any x∗ X∗, the second term can be lower bounded by (5): ∈

k+1 k 1 k+1 k 2 2 gF (x∗; ξk), x x + x x µk gF (x∗; ξk) . (10) h − i 4µk k − k ≥ − k k

We take full expectation (over the entire index history) in (8) and use relations (9)- (10) to get the first part (i).

To derive the second part (ii), we proceed slightly different:

k+1 1 k+1 k 2 F (x ; ξk) F ∗ + x x − 4µk k − k k+1 1 k+1 k 2 k+1 1 k+1 k 2 = f(x ; ξk) f ∗ + x x + h(x ; ξk) + x x − 8µk k − k 8µk k − k k+1 1 k+1 k 2 1 k 2 f(x ; ξk) f ∗ + x x + min h(z; ξk) + z x ≥ − 8µk k − k z 8µk k − k k+1 1 k+1 k 2 k f(x∗; ξk), x x∗ + x x + h4µk (x ; ξk) ≥ h∇ − i 8µk k − k k = f(x∗; ξ ), x x∗ + h∇ k − i k+1 k 1 k+1 k 2 k f(x∗; ξk), x x + x x + h4µk (x ; ξk). (11) h∇ − i 8µk k − k

We proceed to bound each term of right hand side, similarly as in (i). By using the optimality condition (i.e. f(x∗), z x∗ 0 for all z X), the expectation of the first term can be lowerh∇ bounded as− follows:i ≥ ∈

 k  E f(x∗; ξk), x x∗ h∇ − i  k  = E E [ f(x∗; ξk) Ξk] , x x∗ ∇ | −  k  = E f(x∗), x x∗ ∇ −  k k k  = E f(x∗), πX (x ) x∗ + f(x∗), x πX (x ) h∇ − i h∇ − i O.C.  k k  E f(x∗), x πX (x ) ≥ h∇ − i C.S. q  k   2 k  E f(x∗) distX (x ) f(x∗) E dist (x ) , (12) ≥ −k k ≥ −k k X

where in the second inequality we used Cauchy-Schwartz inequality and in the last p one we used E[Y ] E[Y 2]. After taking full expectation in (11) and using (5), ≤ 1 h 2 i κ 2 (12) and the linear regularity property hµ(x) := E dist (x) dist (x) 2µ Xξ ≥ 2µ X 10 Andrei Patrascu, Paul Irofti

yields the following:   k+1 1 k+1 k 2 E F (x ; ξk) F ∗ + x x − 4µk k − k (5)+(12) q  2  2 k  k 2µkE f(x∗; ξ) f(x∗) E dist (x ) + E[h4µk (x )] ≥ − k∇ k − k k X lin.reg. q 2  2 k  κ 2 k 2µkE[ f(x∗; ξ) f(x∗) E distX (x ) + E[distX (x )] ≥ − k∇ k − k k 8µk (5)  2 4µk 2 κ 2 k 2µkE f(x∗; ξ) f(x∗) + E[distX (x )]. ≥ − k∇ k − κ k∇ k 16µk In the last inequality we also used (5).

Next we present the main recurrences which will finally generate our nonasymptotic convergence rates.

k Theorem 2 Let Assumptions 1 hold, then the SSPG sequence x k 0 satisfies: { } ≥ 1 (i) Let µk for all k 0, then: ≤ 2Lf ≥ k+1 2 k 2 2 E[ x x∗ ] (1 σf µk)E[ x x∗ ] + µ Σ, k − k ≤ − k − k k  2 where Σ = 2E gF (x∗; ξ) . k k (ii) In particular, let h( ; ξ) = IXξ ( ) such that Xξ ξ Ω are linearly regular sets with constant κ. Then· the following· recurrence{ holds:} ∈

k+1 2 k 2 2 κ 2 k E[ x x∗ ] (1 σf µk)E[ x x∗ ] + µ Σ E[dist (x )], k − k ≤ − k − k k − 8 X

 2 8 2 where Σ = 4E f(x∗; ξ) + f(x∗) . k∇ k κ k∇ k Proof The proof result straightforwardly from Lemmas 2 and 3.

1 Remark 2 Consider deterministic setting F ( ; ξ) = F ( ) and µk = , then SSPG · · 2Lf becomes the proximal gradient algorithm and Theorem 2(i) holds with gF (x∗; ξ) = gF (x∗) = 0, implying that Σ = 0. Thus the well-known iteration complexity es-   timate Lf log(1/) [4,14] of proximal gradient algorithm is recovered up to a O σf constant from Theorem 2.

The above recurrences generates immediately the following convergence rates.

Corollary 3 Under Assumption 1, the following convergence rates hold: 1 k 2 1  (i) Let µk = γ , γ (0, 1) then: E[ x x∗ ] γ k ∈ k − k ≤ O k  1  if  k µ0σf > e 1 1 k 2 O ln k  − (ii) Let µk = , then : E[ x x∗ ] if µ0σf > e 1 k O k − k − k ≤  2 ln(1+µ0σf )  1  if µ σ < e 1. O k 0 f − Stochastic proximal splitting algorithm for composite minimization 11

(iii) For constant stepsize µk = µ > 0, the recurrence from Theorem 2(i) implies:

k 2 k 0 2 µ E[ x x∗ ] (1 µσf ) x x∗ + Σ, (13) k − k ≤ − k − k σf  2 where Σ = 2E gF (x∗; ξ) . k k (iv) Moreover, consider convex feasibility problem where f = 0, h( ; ξ) = IXξ ( ) · · with κ linearly regular sets Xξ ξ Ω and constant stepsize µk = µ. Then the SSPG − k { } ∈ sequence x k 0 converges linearly as follows: { } ≥ k 2 k  κ 2 0 E[dist (x )] 1 dist (x ). X ≤ − 8 X Proof The proof for the first two results (i) and (ii) follows similar lines with [18, Corrolary 15]. However, for completeness we present it in appendix. For (iii), notice that Theorem 2(i) straightforwardly implies: k 2 k 1 2 2 E[ x x∗ ] (1 σf µ)E[ x − x∗ ] + µ Σ k − k ≤ − k − k k 1 k 0 2 2 X− i (1 µσ ) x x∗ + µ Σ (1 µσ ) ≤ − f k − k − f i=0 k 0 2 µ k (1 µσf ) x x∗ + Σ[1 (1 µσf ) ] ≤ − k − k σf − − k 0 2 µ (1 µσf ) x x∗ + Σ. ≤ − k − k σf

Now, let CFP settings f = 0, h( ; ξ) = IXξ ( ) hold. Theorem 2(ii) states that: · · k+1 2 k 2 κ 2 k E[ x x∗ ] E[ x x∗ ] E[dist (x )] x∗ X∗. k − k ≤ k − k − 8 X ∀ ∈ T k Since in this case, X = ξ Xξ = X∗, then by choosing x∗ = πX (x ) and using in the LHS that xk+1 π (xk) dist (xk+1), then we obtain: k − X k ≥ X 2 k+1 k+1 k 2  κ 2 k E[dist (x )] E[ x πX (x ) ] 1 E[dist (x )], X ≤ k − k ≤ − 8 X which yields the linear convergence rate of SSPG. Remark 3 Although sublinear (1/k) convergence rates for strongly convex ob- jectives are typically obtained inO literature for many first-order stochastic schemes [10,18], these rates are novel due to their generalization potential. Regarding second rate (ii), although it expresses a geometric decrease of the initial   residual term, this rate states that, after 1 log 1  iterations, the sequence O µσf  k x k 0 will remain (in expectation) in a bounded neighborhood of the optimal point { } ≥ 2 µ x : x x∗ Σ . This fact suggests that only sufficiently small constant { k − k ≤ σf } stepsizes guarantee the convergence of SSPG sequence. In the convex feasibility setting, SSPG reduces to Randomized Alternating Projec- tions algorithm for which the obtained linear rate obeys the rates from the literature up to a constant. However, we believe that using some refinements of proof arguments there might be obtained optimal rates w.r.t. the constants. 12 Andrei Patrascu, Paul Irofti

4 Application to Parametric Sparse Representation

Sparse representation [7] starts with the given signal y Rm and aims to find the ∈ sparse signal x Rn by projecting y to a much smaller subspace through the over- ∈ m n complete dictionary T R × . Among the many Dictionary Learning techniques, we focus on the multi-parametric∈ cosparse model proposed in [27], since it could effi- ciently illustrates the empirical capabilities of SSPG. The multi-parametric cosparse representation problem is given by: 2 min T x y 2 x k − k (14) s.t. ∆x δ, k k1 ≤ where T and x correspond to the dictionary and, respectively, the resulting sparse rep- p n resentation, with sparsity being imposed on a scaled subspace ∆x with ∆ R × . In 1 2 ∈ pursuit of (1), we move to the exact penalty problem minx 2m T x y 2 + λ ∆x 1. In order to limit the solution norm we further regularize the unconstrainedk − k objectivek k using an `2 term as follows:

1 2 α 2 min T x y 2 + λ ∆x 1 + x 2. (15) x 2mk − k k k 2 k k The decomposition which puts the above formulation into model (1) consists of: 1 α f(x; ξ) = (T x y )2 + x 2 (16) 2 ξ − ξ 2 2 k k2

where Tξ represents line ξ of matrix T , and h(x; ξ) = mλ ∆ x . (17) | ξ | To compute the SSPG iteration for the sparse representation problem, we note that 1 2 the proximal operator proxh,µ(x; ξ) = arg min mλ ∆ξz + z x is given by z n | | 2µ k − k ∈R ( ∆ξx T ∆ξx x 2 ∆ if | | 2 µ ∆ξ ξ mλ ∆ξ prox (x; ξ) = − k k k k ≤ h,µ T ∆ξx x µmλsgn(∆ x)∆ if | | 2 > µ ξ ξ mλ ∆ξ − k k Also, the gradient of the smooth function f( ; ξ) is given by · f(x; ξ) = T T T + αI  x T T y ∇ ξ ξ n − ξ ξ and with that we are ready to formulate the resulting particular variant of SSPG. SSPG - Sparse Representation (SSPG-SR): For k 0 compute ≥ 1. Choose randomly ξk Ω w.r.t. probability distribution P 2. Update: ∈ ∆ yk yk = I µ T T T + αI  xk + µ T T y , β = ξk n k ξk ξk n k ξk ξk k 2 − ∆ξk ( k k k T if k+1 y βk∆ξk βk mλµ x = k − T | | ≤ y µmλsgn(βk)∆ if βk > mλµ − ξk | | 3. If the stopping criterion holds, then STOP, otherwise k = k + 1. Stochastic proximal splitting algorithm for composite minimization 13

cvx 140 prox sspg 120

100

80 time (s) 60

40

20

0

0 20 40 60 80 100 120 atoms

Fig. 1 Increasing problem dimension. Features are 4 times the number of atoms. (α = 0.5, λ = 5, ε = 10−6)

We use randomly generated data with a standard normal distribution. 1 In the follow- ing we present numerical simulations that show-case the application of SSPG to the multi-parametric sparse representation problem and compare it to CVX and Proxi- mal Gradient method (denoted prox) [14]. Notice that at each iteration Proximal k Gradient method requires, given current point xPG, the computation of: – gradient f(xk ) = 1 T T (T xk y) ∇ PG m PG − – proximal update z(xk ) = arg min λ ∆z + L z (xk 1 f(xk )) 2 PG z k k1 2 k − PG − L ∇ PG k Our first experiment investigates the impact of the problem size on execution time. We vary the atoms in the dictionary starting from n = 5 up to n = 125 in increments of 5 and maintain a ratio of 4 features per atom such that m = 4n. First ? 6 CVX is used to determine x within a ε = 10− margin. Then Proximal Gradient method and SSPG-SR are executed until x? is reached with the same approximation error. Figure 1 depicts the results. The experiment is stopped after n = 125 when CVX execution times become too large. We can clearly see a similar yet gentler increasing trend for Proximal Gradient method (prox) and indeed other experiments have shown it to reach an impas shortly after. SSPG-SR on the other hand maintains a steady pace throughout the tests. Next, we focus on the iterations of Proximal Gradient method and SSGP-SR. To this end we design a similar experiment consisting of 4 tests where we vary the

1 Data generating code available at https://github.com/pirofti/SSPG 14 Andrei Patrascu, Paul Irofti

time (s) time (s) 0 1 2 0 2 4 6 8 107 105 103 100

2 1 2

∥ 10− ∥ ⋆ ⋆

x 5 x 5 10− − 10− m = 30 m = 180 − x x

∥ 10 ∥ 9 10− 10− prox 13 10 15 10− sspg −

104 104

1 1

2 10− 10− 2 ∥ ∥ ⋆ ⋆

x 6 6 x 10− 10− − m = 330 m = 480 − x x

∥ 11 11 ∥ 10− 10−

16 16 10− 10−

0 10 20 0 20 40 time (s) time (s)

Fig. 2 Proximal Gradient method versus 10 rounds of SSPG iterations. Features are 6 times the number of atoms. (α = 0.2, λ = 5 · 10−4, ε = 10−6) problem size by increasing the number of features with a ratio of 6 features per atom ? 6 (m = 6n). As before, we use CVX to determine x with ε = 10− . With identical parameters and initialization we execute 10 rounds of SSPG-SR and compare them to the Proximal Gradient method iterations. We time the execution of each iteration and record its progress xk x? . The result is shown in Figure 2. In all 4 panels it is clearly seen that thek 10− SSPG-SRk rounds perform much faster than Proximal Gradient method at least in a first phase. Further increases of the problem dimension lead to a stall in Proximal Gradient method. While SSPG-SR reaches a solution in less than a second in all the tests, Proximal Gradient method goes from 5 seconds in the first, to 9, 24 and 50 in the last. We note that even though the stopping criterion 6 was the same for both methods (ε = 10− ), Proximal Gradient method gets us much closer to x? because, in our parameter setting, the SR problem is well-conditioned.

5 Conclusion

In this paper we presented preliminary guarantees for stochastic gradient schemes with stochastic proximal updates, which unify some well-known schemes in the liter- ature. For future work, would be interesting to analyze convergence rate on (non)convex functions satisfying more relaxed convexity conditions (i.e. quadratic growth) and the empirical behaviour of SSPG scheme under various stepsize choices. Stochastic proximal splitting algorithm for composite minimization 15

References

1. Hilal Asi and John C Duchi. Stochastic (approximate) proximal point meth- ods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257–2290, 2019. 2. Heinz Bauschke, Frank Deutsch, Hein Hundal, and Sung-Ho Park. Accelerating the convergence of the method of alternating projections. Transactions of the American Mathematical Society, 355(9):3433–3461, 2003. 3. Heinz H Bauschke, Jonathan M Borwein, and Wu Li. Strong conical hull in- tersection property, bounded linear regularity, jameson’s property (g), and error bounds in . Mathematical Programming, 86(1):135–160, 1999. 4. Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009. 5. Pascal Bianchi. Ergodic convergence of a stochastic proximal point algorithm. SIAM Journal on Optimization, 26(4):2235–2260, 2016. 6. Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019. 7. Michael Elad. Sparse and redundant representations: from theory to applications in signal and image processing. Springer Science & Business Media, 2010. 8. David Hallac, Jure Leskovec, and Stephen Boyd. Network lasso: Clustering and optimization in large graphs. In Proceedings of the 21th ACM SIGKDD inter- national conference on knowledge discovery and data mining, pages 387–396, 2015. 9. Jayash Koshal, Angelia Nedic, and Uday V Shanbhag. Regularized iterative stochastic approximation methods for stochastic variational inequality problems. IEEE Transactions on Automatic Control, 58(3):594–609, 2012. 10. Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic ap- proximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pages 451–459, 2011. 11. Angelia Nedic.´ Random projection algorithms for convex set intersection prob- lems. In 49th IEEE Conference on Decision and Control (CDC), pages 7655– 7660. IEEE, 2010. 12. Angelia Nedic.´ Random algorithms for convex minimization problems. Mathe- matical programming, 129(2):225–253, 2011. 13. Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009. 14. Yu Nesterov. Gradient methods for minimizing composite functions. Mathemat- ical Programming, 140(1):125–161, 2013. 15. Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. 16. Lam M Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtarik,´ Katya Scheinberg, and Martin Taka´c.ˇ Sgd and hogwild! convergence without the 16 Andrei Patrascu, Paul Irofti

bounded gradients assumption. arXiv preprint arXiv:1802.03801, 2018. 17. Andrei Patras¸cu.˘ New nonasymptotic convergence rates of stochastic proximal point algorithm for stochastic convex optimization. Optimization, pages 1–29, 2020. 18. Andrei Patrascu and Ion Necoara. Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization. The Journal of Machine Learning Research, 18(1):7204–7245, 2017. 19. R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009. 20. R.T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, New Jersey, 1988. 21. R.T. Rockafellar and R.J.-B. Wets. On the interchange of subdifferentiation and conditional expectation for convex functionals. Stochastics, 7(1):173–182, 1982. 22. Lorenzo Rosasco, Silvia Villa, and Bang Congˆ Vu.˜ Convergence of stochastic proximal gradient algorithm. Applied Mathematics & Optimization, pages 1–27, 2019. 23. Ernest K Ryu and Stephen Boyd. Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. Author website, early draft, 2016. 24. Adil Salim, Pascal Bianchi, and Walid Hachem. Snake: a stochastic proximal gradient algorithm for regularized problems over large graphs. IEEE Transac- tions on Automatic Control, 64(5):1832–1847, 2019. 25. Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pega- sos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011. 26. Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22):6013–6023, 2015. 27. Florin Stoican and Paul Irofti. Aiding dictionary learning through multi- parametric sparse representation. Algorithms, 12(7):131, 2019. 28. Panos Toulis, Dustin Tran, and Edo Airoldi. Towards stability and optimality in stochastic gradient descent. In Artificial Intelligence and Statistics, pages 1290– 1298, 2016. 29. Rohan Varma, Harlin Lee, Jelena Kovacevic, and Yuejie Chi. Vector-valued graph trend filtering with non-convex penalties. IEEE Transactions on Signal and Information Processing over Networks, 2019. 30. Mengdi Wang and Dimitri P Bertsekas. Stochastic first-order methods with ran- dom constraint projection. SIAM Journal on Optimization, 26(1):681–717, 2016. 31. Xiao Wang, Shuxiong Wang, and Hongchao Zhang. Inexact proximal stochastic gradient method for convex composite optimization. Computational Optimiza- tion and Applications, 68(3):579–618, 2017. 32. Yael Yankelevsky and Michael Elad. Dual graph regularized dictionary learn- ing. IEEE Transactions on Signal and Information Processing over Networks, 2(4):611–624, 2016. 33. Wenliang Zhong and James Kwok. Accelerated stochastic gradient method for composite regularization. In Artificial Intelligence and Statistics, pages 1086– 1094, 2014. Stochastic proximal splitting algorithm for composite minimization 17

6 Appendix

Proof (of Corollary 3) For simplicity denote θk = (1 µkσf ) , then Theorem 2 implies that: −

k ! k  k   k+1 2 Y 0 2 X Y 2 E x x∗ θi x x∗ + Σ  θj µ . k − k ≤ k − k i i=0 i=0 j=i+1

1 t By using the Bernoulli inequality 1 tx 1+tx (1 + x)− for t [0, 1], x 0, then we have: − ≤ ≤ ∈ ≥

u u u u P 1 γ Y Y  µ0  Y 1/iγ − i θ = 1 σ (1 + µ σ )− = (1 + µ σ ) i=l . (18) i − iγ f ≤ 0 f 0 f i=l i=l i=l On the other hand, if we use the lower bound

u u+1 X 1 Z 1 dτ = ϕ1 γ (u + 1) ϕ1 γ (l). (19) iγ ≥ τ γ − − − i=l l then we can finally derive:

k  k  m  k  k  k  X Y 2 X Y 2 X Y 2  θj µi =  θj µi +  θj µi i=0 j=i+1 i=0 j=i+1 i=m+1 j=i+1

m k  k  (18)+(19) X ϕ1−γ (i+1) ϕ1−γ (k) 2 X Y (1 + µ0σf ) − µ + µm+1  (1 µjσf ) µi ≤ i − i=0 i=m+1 j=i+1 m ϕ1−γ (m) ϕ1−γ (k) X 2 (1 + µ σ ) − µ ≤ 0 f i i=0 k  k  µm+1 X Y +  (1 µjσf ) (1 (1 σf µi)) σ − − − f i=m+1 j=i+1 m ϕ1−γ (m) ϕ1−γ (k) 2 X 1 = (1 + µ σ ) − µ 0 f 0 i2γ i=0 k  k k  µm+1 X Y Y +  (1 µjσf ) (1 µjσf ) σ − − − f i=m+1 j=i+1 j=i   1 2γ k ϕ1−γ (m) ϕ1−γ (k) m − 1 µm+1 Y (1 + µ0σf ) − − + 1 (1 µjσf ) ≤ 1 2γ σ − − − f j=m+1

ϕ1−γ (m) ϕ1−γ (k) µm+1 (1 + µ0σf ) − ϕ1 2γ (m) + . ≤ − σf 18 Andrei Patrascu, Paul Irofti

By denoting the second constant θ˜ = 1 , then the last relation implies the 0 1+µ0σf following bound:

 k+1 2 ˜ϕ1−γ (k) 0 2 ˜ϕ1−γ (k) ϕ1−γ (m) µm+1 E x x∗ θ0 x x∗ + θ0 − ϕ1 2γ (m)Σ + Σ. k − k ≤ k − k − σf

2 k 2 Denote rk = E[ x x∗ ]. To derive an explicit convergence rate order we analyze upper bounds onk function− kφ. (i) First assume that γ (0, 1 ). This implies that 1 2γ > 0 and that: ∈ 2 − 1 2γ 1 2γ     k  − k  − k k 2 1 2 ϕ1 2γ ϕ1 2γ = − . (20) − 2 ≤ − 2 1 2γ ≤ 1 2γ − − x 1 On the other hand, by using the inequality e− for all x 0, we obtain: ≤ 1+x ≥

k−2     ϕ1−γ (k) ϕ1−γ ( ) k k−2 ˜ k ˜ − 2 (ϕ1−γ (k) ϕ1−γ ( 2 )) ln θ0 θ0 ϕ1 2γ = e − ϕ1 2γ − 2 − 2 1−2γ k  k ϕ1 2γ (20) 21−2γ (1 2γ) − 2 − ≤ 1 + [ϕ (k) ϕ ( k 1)] ln 1 ≤ 1 [k1 γ ( k 1)1 γ ] ln 1 1 γ 1 γ 2 θ˜ 1 γ − 2 − θ˜ − − − − 0 − − − 0 1−2γ k γ γ   21−2γ (1 2γ) 1 γ 2 k− 1 = − = − = . k1−γ 1 1 γ 1 1 2γ 1 1 γ 1 γ [1 ( ) ] ln 1 2γ 2 − [1 ( ) − ] ln O k 1 γ 6 − θ˜ 6 θ0 − − 0 − − Therefore, in this case, the overall rate will be given by:

1−γ     2 (k ) 2 1 1 r θO r + . k+1 ≤ 0 0 O kγ ≈ O kγ

1 k If γ = , then the definition of ϕ1 2γ ( ) provides that: 2 − 2     2 (√k) 2 (√k) 1 1 r θ˜O r + θ˜O (ln k) + . k+1 ≤ 0 0 0 O O √k ≈ O √k

1 k  1 When γ ( 2 , 1), it is obvious that ϕ1 2γ 2 2γ 1 and therefore the order of the convergence∈ rate changes into: − ≤ −

1−γ     2 (k ) 2 1 1 r θ˜O [r + (1)] + . k+1 ≤ 0 0 O O kγ ≈ O kγ

1 ln ˜ ˜ln k+1 1  θ0 (ii) Lastly, if γ = 1, by using θ0 k we obtain the second part of our result. ≤