<<

arXiv:1601.05972v3 [math.SP] 1 Jul 2017 oet mlmn h loihspooe nti ae sav is paper this Perugia” in di proposed Risparmio algorithms di the Cassa D https://sites.google.com/site/stefaniasardellitti/c implement P. “Fondazione of to the st work code by The G. (e-mail: ICT-318784. funded Via Nr. was Perugia, Project, Italy TROPIC of by [email protected]) Rome, University supported (e-mail: Engineering, Italy 00184 Perugia, of 06125 Dept. 18, the Eudossiana with Via sergio.barbarossa@uniro [email protected], Dept., DIET hri.Ti prahapist nietdgah n the and graph- reference graphs the the undirected and uses to [5] it applies e.g. and approach see This unit, theory central therein. graph the one spectral as first Laplacian on The suggested. rooted been graph of have is over each approaches defined basic [14], signals Two bas etc. [13], filtering a variation, building [8], like minimal [4], motivations, Graph with different [5], called from coming e.g., so them see, the signals, hav of introduced graph GFT of of introduction been definitions Alternative analysis the (GFT). spectral on Transform Fourier the based by is played which is centra GSP A [8]–[12]. [ in principle transforms uncertainty concept and basic graph-based sampling the [1]–[3], [7], defined introducing filtering signals by graph-based graph to a of images of or vertices proces signals the signal over time discrete from classical litera theory recent the (DSP) the extend in to works attempt Many ture vertices graph. the weighted) on (possibly defined a gene are com signals promising of grids, discrete analysis a where and networks, represents smart processing representation, GSP to the on. applications for networks tool so potential and many economic networks, its regulatory and of social because years from last the in xeso ftegahctsz,kona h Lov continuous the a as basis minimize known size, Fourier that cut vectors graph graph orthonormal the the propose of of we builds extension set and that the pape graphs directed this approach as of In alternative case matrix. general an adjacency been the or eigen-decomposit address have Laplacian the Fouri we key graph on GFT Graph the based A called of literature, either on. so the definitions economic the in so Alternative is suggested and and signals (GFT). networks, social these Transform biological as analyzing or for such tool data applications, big many networks, in relevant om oa aito,clustering. variation, total form, fcetslto ehdbsdo nepii-mlctgra explicit-implicit size an cut on an propose balanced based algorithm. we method the and non-convex solution of again efficient relaxation is problem continuous formulated The a minimize th two extend to propose dev we Finally, we properly constraints. orthogonality problem, methods, handling the optimization of iterative non-convexity alternative the with cope To .SreltiadS abrsaaewt ainaUniversi Sapienza with are Barbarossa S. and Sardellitti S. rp inlpoesn GP a trce o finteres of lot a attracted has (GSP) processing signal Graph Abstract ne Terms Index tfnaSardellitti, Stefania Teaayi fsgasdfie vragahis graph a over defined signals of analysis —The Gahsga rcsig rp ore Trans- Fourier Graph processing, signal —Graph .I I. nteGahFuirTransform Fourier Graph the On NTRODUCTION ebr IEEE Member, ode-supplement o ietdGraphs Directed for egoBarbarossa, Sergio , a.t.P iLrnois Lorenzo Di P. ma1.it). hswr a been has work This . s extension. asz ´ yo Rome, of ty uat 93, Duranti method e Lorenzo i ial at ailable sdfor ised Matlab . o of ion role l dient plex sing 4]– efa- of er is s, r, e s s - t , . elw IEEE Fellow, alca,wihrpeettebssta iiie the minimizes grap that the basis of the eigenvectors represent the which by constituted Laplacian, is basis Fourier nudrce rpsweetemnmzto fthe motivated of minimization well the is where approach graphs This undirected variation. on total graph norm ul nteLpainmti.Hnea rhnra basis orthonormal an Hence matrix. f the Laplacian quadratic minimizing the the minimizing on to built equivalent is variation total inldfie vrtepit vrie) hssga sban is signal as [22]. this labels [21], (vertices), these construction at points by look limited the associat we over If to sufficientl defined cluster. are is signal each points clustering/classification to label two of different if goal where established The cloud, are close. betwe point edges points whereas the of themselves, to points pairs the graph are thes vertices a the tackle associating methods withou by either Graph-based problems clusters, supervision. detect to limited input is the with goal applications, the these and In cloud learnin [22]. point semi-supervised [21], can and in GSP unsupervised suggested context, in as smooth role learning to key machine a happens the play what Within with sa signals. analogy are time in signals band-limited, These analysis. be under to r data carries sparsity the its on G and information the sparse typically cases, is such signals v In these exhibit can clusters. they different that while across signals (clusters), arbitrarily nodes graph interconnected within of highly smooth of are for analysis that motivations signals the i.e. major properties, is clustering the GFT of the one is applications, using instability some these method tackle In to [20]. decomposition suggested siz recently alternative been moderate have although into for incurs [19], graph even often matrices constant instabilities, decomposition numerical a Jordan well-known that num the the Finally, guarantee of [18]. [17], computation not variation it total does zero properti then has desirable signal it and some example, variat respect unitary not total for does not the [4], Second, is in products. introduced scalar transform orthogona preserve resulting not not does are the vectors they that basis general so in the but issues First, independent, important linearly some investigation. way raises the further [4] However, the requiring in framework. paved proposed metho processing approach definition signal filtering This GFT algebraic linear [16]. shift-invariant the [15], to all signals whic of graph operator, basis for shift the the signal of at the association the is with GFT on matrix the rooted and adjacency is as graph matrix, method eigenvectors second adjacency This generalized the basis. associated of build the decomposition method defines That Jordan of [4]. the case [1], alternative in challenging on proposed and An was general case. graphs, do more graph directed the properties directed for these the valid However, in approach, anymore matrix. hold Laplacian not the of tors n al iLorenzo, Di Paolo and , ℓ 2 nr oa aito ed oteeigenvec- the to leads variation total -norm ebr IEEE Member, ℓ subsets 2 elevant -norm Tof FT erical sues orm sa is or t a e ary ion are l es, en ds d- 2 id g, h y l, h a e e s s 1 - 2

In this paper, we propose a novel alternative approach II. MIN-CUT SIZE AND ITS LOVASZ´ EXTENSION to build the GFT basis for the general case of directed In this section, we recall the definitions of cut size and graphs. Rather than starting from the decomposition of one Lov´asz extension, as they will form the basic tools for our of the graph matrix descriptors, either adjacency or Laplacian, definition of GFT. We consider a graph = , consisting we start identifying an objective function to be minimized of a set of N vertices (or nodes) = G1,...,N{V E}along with V { } and then we build an orthogonal matrix that minimizes that a set of edges = aij i,j , such that aij > 0 if there is E { } ∈V objective function. More specifically, we choose as objective a direct link from node j to node i, or aij = 0 otherwise. function the graph cut size, as its minimization leads to We denote with the cardinality of , i.e. the number of identifying clusters. We consider the general case of directed elements of . A|V| signal s on a graph isV defined as a mapping graphs, which subsumes the undirected graphs as a particular from the vertexV set to a real vectorG of size N = , i.e. case. The cut function is a set function and its minimization s : R. Let A denote the N N adjacency matrix|V| with V → × is NP-hard, however exploiting the sub-modularity property entries given by the edge weights aij for i, j = 1,...,N. of the cut size, it has been shown that there exists a lossless The graph Laplacian is defined as L := D A where the convex relaxation of the cut size, named its Lov´asz extension in-degree matrix D is a diagonal matrix whose− ith diagonal [23], [24], whose minimization preserves the optimality of the entry is di = j aij . solution of the original non-convex problem. Interestingly, the One of theP basic operations over graphs is clustering, Lov´asz extension of the cut size gives rise to an alternative i.e. the partition of the graph onto disjoint subgraphs, such definition of total variation of a graph signal that captures that the vertices within each subgraph (cluster) are highly the edges’ directivity. Furthermore, in the case of undirected interconnected, whereas there are only a few links between graphs, the Lov´asz extension reduces to the l1 norm total different clusters. Finding a good partition can be formulated variation of a graph signal, which represents the discrete as the minimization of the cut size [28], whose definition counterpart of the total variation of continuous-time signals, is reported here below. Let us consider a subset of vertices which plays a fundamental role in the continuous time Fourier , and its complement set in denoted by ¯. The edge Transform, see, e.g., [17], [13]. We define the GFT basis boundaryS⊂V of is defined as the setV of edges withS one end in as the set of orthonormal vectors that minimize the Lov´asz and the otherS end in ¯. The cut size between and ¯ is extension of the cut size. Unfortunately, even though the ob- definedS as the sum of theS weights over the boundaryS [28],S i.e. jective function is convex, the resulting problem is non-convex, cut( , ¯) := aji. (1) because of the orthogonality constraint imposed on the basis S S vectors. Thus, to find a (possibly local) solution of the problem i∈SX,j∈S¯ in an efficient manner, we exploit two recently developed Finding the partition that minimizes the cut size in (1) is an methods that are specifically tailored to handle non-convex NP-hard problem. To overcome this difficulty, we exploit the orthogonality constraints, namely, the splitting orthogonality sub-modularity property of the cut size [24], which ensures constraints (SOC) method [25], and the proximal alternating that its Lov´asz extension is a convex function [24]. We briefly minimized augmented Lagrangian (PAMAL) method [26]. recall some of the main definitions and properties here below. SOC method is quite simple to implement and, even if no Given the set and its power set 2V , i.e. the set of all its convergence proof has been provided yet, extensive numerical subsets, let us considerV a real-valued set function F :2V R. results validate the effectiveness and robustness of such a The cut size in (1) is an example of set function,→ with strategy. Conversely, PAMAL algorithm, which hybridizes the F ( ) := cut( , ¯). Every element of the power set 2V may augmented Lagrangian method and the proximal minimization beS associatedS toS a vertex of the hyper-cube 0, 1 N . Namely, scheme, is known to guarantee convergence. Furthermore, any a set can be uniquely identified to the{ indicator} vector limit point of each sequence generated by PAMAL method S⊆V 1S , i.e. the vector which is 1 at entry j, if j , and satisfies the Karush-Kuhn Tucker conditions of the original 0 otherwise. Then, a set-function F can be defined∈ S on the non-convex problem [26]. Finally, to prevent the resulting vertices of the hyper-cube 0, 1 N . The Lov´asz extension of basis vectors to be excessively sparse vectors, we consider a graph function F [23], [24],{ allows} the extension of a set- the minimization of a continuous relaxation of the balanced function defined on the vertices of the hyper-cube 0, 1 N , to cut size. To solve the corresponding non-convex fractional the full hypercube [0, 1]N and hence to the entire{ space} RN . problem, we adopt an efficient and convergent algorithm based We recall its definition hereafter. on the explicit-implicit gradient method [27]. Definition 1: Let F :2V R be a set function with F ( )= The paper is organized as follows. Sec. II introduces the 0. Let x RN be ordered→ w.l.o.g. in increasing order∅ such graph signal variations as the continuous Lov´asz extension of ∈ that x1 x2 ... xN . Define C0 , and Ci , j : ≤ ≤ ≤ V {N ∈V the min-cut size. In Sec. III, we define the GFT as the set xj > xi for i> 0. Then, the Lov´asz extension f : R R of optimal orthonormal vectors minimizing the graph signal of F , evaluated} at x, is given by: → variation, and in Sec. IV we illustrate the optimization methods N used for solving the resulting non-convex problem. Therefore, f(x) = xi(F (Ci ) F (Ci)) in Sec. V we conceive the GFT as the solution of a balanced −1 − Xi=1 min cut problem, while Sec. VI illustrates some numerical (2) N−1 examples validating the effectiveness of the proposed ap- = F (Ci)(xi xi)+ x F ( ). proaches. Finally, Sec. VII draws some conclusions. +1 − 1 V Xi=1 3

Note that f(x) is piecewise affine w.r.t. x, and F ( )= f(1 ) III. GRAPH FOURIER BASIS AND DIRECTED S S for all . An interesting class of set functions is given by TOTAL VARIATION S⊆V the submodular set functions, whose definition follows next. Alternative definitions of GFT have been proposed in the Definition 2: A set function F : 2V R is submodular if literature, depending on the different perspectives used to em- and only if, , , it satisfies the→ following inequality: phasize specific signal features. In case of undirected graphs, ∀A B⊆V the GFT of a vector s was defined as [5] F ( )+ F ( ) F ( )+ F ( ). A B ≥ A∪B A∩B sˆ = UT s, (5) A fundamental property of a submodular set function is that its Lov´asz extension is a convex function. This is formally where the columns of matrix U are the eigenvectors of the T stated in the following proposition [24, p.23]. L, i.e. L = UΛU . This definition is basi- Proposition 1: Let F :2V be a submodular function cally rooted on the clustering properties of these eigenvectors, and f be its Lov´asz extension.→ Then, R it holds see, e.g., [30]. In fact, by definition of eigenvector, the Fourier basis used in (5) can be thought as the solution of the following min F ( ) = min f(x) = min f(x). sequence of optimization problems: S⊆V S x∈{0,1}N x∈[0,1]N u uT Lu GQV u N k = arg min k k := arg min ( k) u RN u RN Moreover, the set of minimizers of f(x) on [0, 1] is the k∈ k∈ (6) convex hull of the minimizers of f(x) on 0, 1 N. uT u s.t. k ℓ = δkℓ, ℓ =1, . . . , k, The cut size function in (1) is known for{ being} submodular, see, e.g., [24], [29]. More specifically, as shown in [24, p.54], for k = 2,...,N, where δkℓ is the Kronecker delta, and we the cut function is equal to the positive linear combination of used the property that the quadratic form built on the Laplacian is the ℓ2-norm, or graph quadratic variation (GQV), i.e. the function Gij : (1S )i[1 (1S )j ], i.e. S 7→ − N 2 cut( )= ajiGij . GQV(x) := aji(xi xj ) . S − i,jX∈V i,j=1X,j>i

The function Gij is the extension to of a function Gij Thus, the Fourier basis obtained from (6) coincides with V defined only on the power set of i, j , where G ( i )=1 the set of orthonormal vectors that minimize the ℓ2-norm ij e and all other values are zero, so{ that,} from (2), its{ Lov´asz} total variation. In all applications where the graph signals e exhibit a cluster behavior, meaning that the signal is relatively extension is Gij (xi, xj ) = [xi xj ]+ with [y]+ := max y, 0 . Therefore the Lov´asz extension− of the cut size function,{ in the} smooth within each cluster, whereas it can vary arbitrarily from e general case of directed graphs, is given by: cluster to cluster, the GFT defined as in (5) helps emphasizing the presence of clusters [30]. However, the identification of N the Laplacian eigenvectors as the orthonormal vectors that f(x)= aji[xi xj ] := GDV(x). (3) − + minimize the GQV is only valid for undirected graphs, for i,jX=1 which the quadratic form built on the Laplacian reduces to We term this function the Graph Directed Variation (GDV), the GQV. For directed graphs, the quadratic form in (6) cap- as it captures the edges’ directivity. For undirected graphs, tures only properties associated to the symmetrized Laplacian L L LT imposing aij = aji, the Lov´asz extension of the cut size boils (i.e., s = ( + )/2), and hence it cannot capture the down to edges’ directivity. The generalization to directed graphs, was N proposed in [4] as sˆ = V−1s, (7) f(x)= aji xi xj := GAV(x). (4) | − | i,j=1X,i>j where V comes from the Jordan decomposition of the non- −1 Interestingly, this function, which we call Graph Absolute symmetric adjacency matrix A, i.e. A = VJV . To estimate Variation (GAV), represents the discrete counterpart of the l1 variations of the graph Fourier basis and to identify an order norm total variation, which plays a key role in the classical among frequencies, the total variation of a vector was defined of continuous time signals [17], [13]. in [4] as It is easy to show that the directed variation GDV satisfies the TVA(s)= s Anorm s 1, (8) k − k following properties: where Anorm :=A/ λmax(A) . The previous definition leads to i) GDV(x) 0, x RN ; | | ≥ ∀ ∈ the elegant theory of algebraic over graphs ii) GDV(x)=0, x = c1 with c 0; [1,4,15,16]. However, there are some critical issues associated iii) GDV(α x) = ∀α GDV(x), α ≥ 0, i.e. it is positively ∀ ≥ to that definition that need to be further explored. First, the homogeneous; definition of total variation as given in (8) does not ensure iv) GDV(x + y) GDV(x)+ GDV(y), x, y RN . ≤ ∀ ∈ that a constant graph signal has zero total variation, and this GDV is neither a proper norm nor a semi-norm, since, in this collides with the common meaning of total variation [17], latter case, it should be absolutely homogeneous. However, it [13], [18]. Second, the columns of V are linearly independent meets the desired property ii) ensuring that a constant graph complex generalized eigenvectors, but in general they are not signal has zero total variation. orthogonal. This gives rise to a GFT that does not preserve 4 inner products when passing from the observation to the trans- solutions and the choice of the initial points, ensuring a fast formed domain. Furthermore, the computation of the Jordan convergence rate, is usually nontrivial. To cope with these decomposition incurs into serious and intractable numerical issues, in this section we present two alternative iterative instabilities when the graph size exceeds even moderate values algorithms to solve the non-convex, non-smooth problem , [19] and more stable matrix decomposition methods have to hinging on some recently developed methods for solvingP be adopted to tackle its instability issues [20]. To overcome non-differentiable problems with non-convex constraints [25], some of these criticalities, very recently the authors of [14] [26]. The first method, introduced in [25], called splitting proposed a shift operator based on the directed Laplacian of a orthogonality constraints (SOC) method, is based on the graph. Using the Jordan decomposition, the graph Laplacian alternating method of multipliers (ADMM) [35], [36] and the is decomposed as split Bregman method [37], [38]. The SOC method leads to −1 L = VLJLVL (9) some important benefits, as it is simple to implement and the resulting non-convex sub-problem with orthonormal constraint and the GFT is defined in [14] as admits a closed form solution. Although no convergence proof s −1s of SOC method has been provided yet, numerical results ˆ = VL . (10) validate its value and robustness. To quantify oscillations in the graph harmonics and to order An alternative optimization method that tackles the non- the frequencies, the total variation was defined in [14] as convex minimization problem and guarantees convergence is the PAMAL algorithm recentlyP developed in [26]. The TVL(s)= L s 1. (11) k k algorithm combines the augmented Lagrangian method with This definition of total variation ensures a zero value for con- proximal alternating minimization. A convergence proof was stant graph signals. Furthermore, the eigenvalues with small provided in [26]. More specifically, this method has the so- absolute value correspond to low frequencies. Nevertheless, called sub-sequence convergence property, i.e. there exists −1 the GFT given by F = VL is still a non-unitary transform at least one convergent sub-sequence, and any limit point and its computation is affected by the numerical instabilities satisfies the Karush-Kuhn Tucker (KKT) conditions of the associated to the Jordan decomposition. original nonconvex problem. Building on these algorithms, in In this paper, we propose a novel method to build the the sequel we introduce two efficient optimization strategies graph Fourier basis as the set of N orthonormal vectors that build the basis for the Graph Fourier Transform, as the xi,i =1,...,N, that minimizes the total variation defined in solution of problem . (3), which represents the continuous convex Lov´asz extension P of the graph cut size in (1). The first vector is certainly the constant vector, i.e. x1 = b 1, with b = 1/√N, as this (unit- norm) vector yields a total variation equal to zero. Let us A. SOC method N×N introduce the matrix X := (x1,..., xN ) R containing all the basis vectors. Thus, the search for∈ the GFT basis can The SOC algorithm was developed in [25] and tackles be formally stated as the search for the orthonormal vectors orthogonality constrained problems by iteratively solving a that minimize the directed total variation in (3), i.e. convex problem and a quadratic problem that admits a closed- form solution. More specifically, introducing an auxiliary N variable P = X to split the orthogonality constraint, problem min GDV(X) := GDV(xk) ( ) X∈RN×N P is equivalent to Xk=1 P T s.t. X X = I, x1 = b1. min GDV(X) × The constraints are used to find an orthonormal basis and X,P∈RN N (12) T to prevent the trivial null solution. Although the objective s.t. X = P, x1 = b1, P P = I. function is convex, problem is non-convex due to the orthogonality constraint. In theP next section, we present two alternative optimization strategies aimed at solving the non- The first constraint is linear and, as discussed in [25], it can convex, non-differentiable problem in an efficient manner. be solved using Bregman iteration. Therefore, by adding the P Bregman penalty function [37], problem (12) is equivalent to the following simple two-step procedure: IV. OPTIMIZATION ALGORITHMS To avoid handling the non-convex orthogonality constraints directly, several methods have been proposed in the literature k k , β k−1 2 (X , P ) arg min GDV(X)+ X P + B F based on the solution of a sequence of unconstrained problems X,P∈RN×N 2 k − k T approaching the feasibility condition, such as the penalty meth- s.t. x1 = b1, P P = I; ods [31], [32] and the augmented Lagrangian based methods Bk = Bk−1 + Xk Pk, [33], [34]. The penalty method is generally simple, but it − suffers from slow-convergence and ill-conditioning. On the other hand, the standard augmented Lagrangian method solves where β is a strictly positive constant. Similarly to ADMM a sequence of sub-problems that usually have no analytical and split Bregman iteration [39], the above problem can be 5

Algorithm 1: SOC method PT P = I , which represents the Stiefel manifold [40]. For 0 N×N 0 T 0 0 0 } Set β > 0, X ∈ R , X X = I, x1 = b1, P = X , any set , its indicator function is defined as B0 = 0, k = 1. S 0, if X Repeat X k δS ( )= ∈ S (14) Find X as solution of Pk in (13),  + , otherwise. Yk = Xk + Bk−1, ∞ T Given these symbols, problem (12) is equivalent to the fol- Compute SVD decomposition Yk = QS¯ R¯ , T lowing one: Pk = Q¯ R¯ , Bk = Bk−1 + Xk − Pk, , min f(X, P) GDV(X)+ δS1 (x1)+ δSt (P) ( e) k = k + 1, X,P∈RN×N P until convergence. s.t. H(X, P) , P X = 0. − The basic idea to solve a problem in the form of e was P solved by iteratively minimizing with respect to X and P: proposed in [26], and combines the augmented Lagrangian method [41], [33] with the alternating proximal minimization k , β k−1 k−1 2 algorithm. The result is known as the PAM method [42], which 1. X arg min GDV(X)+ X P + B F X∈RN×N 2 k − k deals with non-smooth, non-convex optimization. According s.t. x1 = b1 ( k) to the augmented Lagrangian method, we add a penalty term P to the objective function in order to associate a high cost 2. Pk , P Xk Bk−1 2 arg min ( + ) F to unfeasible points. In particular, the augmented Lagrangian P∈RN×N k − k function associated to the non-smooth problem , is T e s.t. P P = I ( k) P Q ρ 2 (X, P, Λ)= f(X, P)+ Λ, H(X, P) + H(X, P) F , 3. Bk = Bk−1 + Xk Pk. L h i 2k k − (13) where ρ is a positive penalty coefficient, Λ RN×N repre- The interesting aspect of this formulation is that subproblem sents the multipliers matrix, while the matrix∈ inner product T k is convex and the second constrained quadratic problem is defined as A, B , tr(A B). The proposed augmented P h i k has a closed-form solution, as illustrated in the following Lagrangian method reduces problem to a sequence of Q e proposition. problems that alternately update, atP each iteration k, the k k k Proposition 2: Define Y = X + B −1 and let following three steps: k k k T 1. Compute the critical point (X , P ) of the function Y = QS¯ R¯ k (X, P, Λ ; ρk) by solving L ¯ ¯ RN×N k k k k be its SVD decomposition, where Q, R are unitary (X , P ) , min (X, P, Λ ; ρ ); (15) matrices, and S RN×N is the diagonal∈ matrix with entries X,P∈RN×N L ∈ k the singular values of Y . Then, the optimal solution of the 2. Update the multiplier estimates Λk; T k ¯ ¯ k quadratic non-convex problem k in (13) is P = QR . 3. Update the penalty parameter ρ . Q Proof. See the proof of Theorem 2.1 in [25]. We will show next how to implement the previous steps, which Combining (13) and Proposition 2, the main steps of the are described in detail in Algorithm 2. SOC method are summarized in Algorithm 1. It is important Computation of the critical points (Xk, Pk). The optimal to remark that the choice of the coefficient β strongly affects solution (Xk, Pk) of problem (15) is computed using an the convergence behavior of the algorithm: a large value of β approximate algorithm, i.e. finding a subgradient point Θk will force a stronger equality constraint, while a too small ∂ (Xk, Pk, Λk; ρk) satisfying, with a prescribed tolerance∈ β might not be able to guarantee the solution to satisfy valueL ǫk, the following inequality the orthogonality constraint. Then, a proper tuning of the Θk ǫk (16) coefficient β is important to ensure a fast convergence of k k∞≤ k the algorithm. Although, as remarked in [25], the convergence with P t. To evaluate such point, we exploit a coordinate- analysis of SOC algorithm is still an open problem, we will descent method∈ S with proximal regularization based on the show next that the numerical results testify the validity and PAM method proposed in [43]. More specifically, at the k- robustness of this method when applied to our case. th outer iteration of the algorithm, we compute (Xk, Pk) by iteratively solving, at each inner iteration n, the following proximal regularization of a two blocks Gauss-Seidel method: B. PAMAL method Xk,n = arg min (X, Pk,n−1, Λk; ρk) × X∈RN N ,x1=b1 L As an alternative efficient method to tackle the non- k,n−1 convexity of problem , we propose here an approach based c1 k,n−1 2 + X X ( ˜k,n) on PAMAL algorithmP [26]. The method solves the orthogo- 2 k − kF P nality constrained problem by iteratively updating the primal Pk,n = arg min (Xk,n−1, P, Λk; ρk) P RN×N L variables and the multipliers estimates. To this end, let us ∈ k,n−1 reformulate the problem as follows. Let us introduce the sets c2 k,n−1 2 ˜ N×N + P P F ( k,n) 1, defined as 1 , x = b1 , and t , P R : 2 k − k Q S S { ± } S { ∈ 6

k,n Algorithm 2: PAMAL method where the proximal parameters ci can be arbitrarily chosen k k as long as they satisfy Given the parameters {ǫ }k∈N, 0 < ǫ < 1, τ ∈ [0, 1), γ > 1, k k N×N k k = 1, ρ > 0, Λ ∈ R , Λmin ≤ Λ ≤ Λmax. k,n N 0 0, c>¯ 0. Repeat ≤ ≤ ∞ ∈ (17) Step.1: Compute (Xk, Pk) as in Algorithm 3 such that there exists k k k k k k k k T k Θ ∈ ∂L(X , P , Λ ; ρ ) with k Θ k∞≤ ǫ , (P ) P = I. The first convex problem ˜k,n can be solved through any P Step.2: Update the multiplier estimates convex optimization numerical tool, whereas the second prob- k+1 k k k k Λ = [Λ + ρ (P − X )]T lem in ˜k,n admits a closed-form solution as stated in the Q where [·]T is the projection on T ,{Λ:Λmin ≤Λ ≤ Λmax}. following proposition. k k k Proposition 3: Define the matrix Step.3: Set R = P − X , and update the penalty parameter as k k k−1 ρ if k R k∞≤ τ k R k∞ ρk+1 = , , k,n−1 k,n−1 k k,n k k k,n−1 −1 γρk otherwise F (c2 P + ρ X Λ )(ρ + c2 )  − k = k + 1, with SVD decomposition F = QΣTT , where Q, T RN×N until convergence. are unitary matrices, while Σ is a diagonal matrix with∈ entries given by the singular values of F. The optimal solution of the Algorithm 3: PAM method for solving step 1 in Algorithm 2 k,n T 1,0 1,0 non-convex problem ˜k,n is given by P = QT . Let (X , P ) be any finite initialization. For k ≥ 2, set 0 0 −1 −1 Proof. See AppendixQ A. (Xk, , Pk, )=(Xk , Pk ), n = 0. Repeat Algorithm 2 describes the outer loop of the PAMAL method k,n Step.1: Set n = n + 1. Compute X by solving problem P˜k,n. whereas in Algorithm 3 we report the inner iterations needed Step.2: Pk,n QTT where Q, T come from the following SVD ˜ ˜ = to solve problems k,n and k,n in step 1 of Algorithm 2. decomposition P Q k,n−1 k,n−1 k k,n k The inner iterations are terminated when there exists a sub- T c2 P +ρ X −Λ QΣT 1 . k,n k,n k,n k k = k k,n− gradient point Θ X P Λ satisfying ρ +c2 ∂ ( , , ; ρ ) k k k,n k,n k k,n k,n k k,n∈ L k,n , k,n k,n Step.3: Set (X , P )=(X , P ), Θ = Θ , Θ ∞ ǫ , P t, where Θ (Θ1 , Θ2 ) k,n k until k Θ k∞≤ ǫ . kwith thek subgradients≤ given∈ S by

Θk,n = ck,n−1(Xk,n−1 Xk,n)+ ρk(Pk,n−1 Pk,n) 1 1 − − k,n k,n−1 k,n−1 k,n convergence to a critical point [43, Th. 6.2], provided that the Θ = c (P P ). k 2 2 penalty parameters ρ N in Algorithm 2 satisfy some mild − (18) k∈ conditions, as stated{ in} the following theorem. Update of the multipliers and penalty coefficients. The rule k,n k,n Theorem 1: Denote by (X , P ) N the sequence for updating the multipliers matrix in Step 2 of Algorithm n∈ generated by Algorithm 3.{ The function } in (15) satisfies 2 needs some further discussion. We adopt the classical k the Kurdyka-Łojasiewicz (K-Ł) property1.L Then Θk,n defined first-order approximation by imposing that the estimates of by (18) satisfies multipliers must be bounded. Then, we explicitly project the k,n k,n k,n k k multipliers matrix on the compact box set , Λ : Λmin Θ ∂ (X , P , Λ ; ρ ), n N. (19) T { ≤ ∈ L ∀ ∈ Λ Λmax with < [Λmin]i,j [Λmax]i,j < , ≤ } −∞ ≤ ∞ Also, if γ > 1, ρ1 > 0, for each k N, it holds i, j. The boundedness of the multipliers is a fundamental ∈ ∀ k,n assumption needed to preserve the property that global min- Θ ∞ 0, as n . (20) imizers of the original problem are obtained if each outer k k → → ∞ iteration of the penalty method computes a global minimum Proof. See Appendix B. of the subproblem. Unfortunately, assumptions that imply The convergence claim for Algorithm 2 to a stationary solution of problem e is stated in the following theorem. boundedness of multipliers tend to be very strong and often P k k hard to be verified. Nevertheless, following [26], [41], [44], we Theorem 2: Let (X , P ) k∈N be the sequence generated by Algorithm 2. Suppose{ ρ1 >} 0 and γ > 1. Then, the set of also impose the boundedness of the multipliers. This implies k k limit points of (X , P ) k N is non-empty, and every limit that, in the convergence proofs, we will assume that the true { } ∈ point satisfies the KKT conditions of the original problem e. multipliers fall within the bounds imposed by the algorithm, P see, e.g. [26]. Regarding the setting of the remaining param- Proof. The proof follows similar arguments as in [26, Th. 3.1- eters of the proposed algorithm, we will assume that: i) the 3.5], and thus is omitted due to space limitation. k Remark 1. Note that both Algorithms 1 and 3 at each step sequence of positive tolerance parameters ǫ k∈N is chosen k { } k of their loops have to compute the SVD of an N N such that limk→∞ ǫ = 0; ii) the penalty parameter ρ is × updated according to the infeasibility degree by following the matrix. Therefore, at each iteration their computational cost is proportional to (N 3). So, clearly, there is a complexity issue rule described in step 3 of Algorithm 2 [26], [33]. O Convergence Analysis. We now discuss in details the conver- that deserves further investigations to enable the application gence properties of the proposed PAMAL method. Assume to large size graphs. In this paper, we have not investigated k,n methods to reduce the complexity of the approach exploiting, that: i) the proximal parameters ci ∀k,n are arbitrarily { } k for instance, the sparsity of the graphs under analysis. Also, we chosen as long as they satisfy (17); ii) the sequence ǫ k∈N k { } have not optimized the selection of the parameters involved in is chosen such that limk→∞ ǫ =0; iii) the penalty parameter k ρ is updated according to the rule described in Algorithm 2. 1The reader can refer to Appendix B for a definition of the Kurdyka- The PAM method, as given in Algorithm 3, guarantees global Łojasiewicz (K-Ł) property. 7 both SOC and PAMAL methods. However, even if complexity Algorithm 4 : Balanced graph signal variation is an issue, the proposed approach is more numerically stable For k = 2,...,N n 0 n than the only method available today for the analysis of Set n = 0, xk = x nonzero vector with m(xk ) = 0, α> 0, 0 < ǫ ≪ 1. directed graphs, based on the Jordan decomposition. Repeat Remark 2. The two alternative methods proposed above to wn n ∈ sign(xk ), solve the non-convex problem are robust to random initial- vn = wn − mean(wn)1, P hn = xn + αvn, izations, as testified also by the numerical results presented k n n+1 E(xk ) hn 2 in the sequel. In terms of implementation complexity, SOC xˆk = arg min f(xk)+ k xk − k2, ∈X b 2α algorithm is easier to code even though, to the best of our xk k n+1 n+1 n+1 knowledge, a theoretical proof of its convergence is still yk = xˆk − m(xˆk ), yn+1 lacking. n+1 k xk = n+1 , n = n + 1, k yk k2 n n−1 INIMIZATION OF BALANCED TOTAL VARIATION until | E(xk ) − E(xk ) |< ǫ, V. M n+1 n+1 xˆk The minimization of the total variation as in (12) is inspired xk = n+1 , k xˆk k2 by the min-cut problem. However, in some cases, this might end. favor the appearance of very sparse vectors or of very small clusters, possibly also isolated nodes. One way to prevent these undesired solutions passes through the introduction of the N bases xk k=1, with x1 = b1, by iteratively solving, for balanced cut [45], [46]. A popular definition for the balanced k =2,...,N{ } , the following problem cut of undirected graph is the Cheeger cut [47], which is given b by: min E(xk) ( k) x ∈RN cut( , ¯) k P (25) min S S . (21) s.t. x T x = δ , ℓ =1, . . . , k. S⊆V min( , ¯ ) k ℓ k,ℓ |S| |S| Note that min( , ¯ ) attains its maximum when = ¯ = Note that problem b is non-convex in both the constraints set Pk N/2, so that,|S| for| aS| given value of cut( , ¯), the|S| minimum|S| and the objective function. Recently, several algorithms [45], occurs when and ¯ have approximatelyS equalS size. While the [49], [48], [46], have been proposed to minimize relaxations problem statedS aboveS is NP-hard, a tight continuous relaxation of the balanced cut problem that are similar to (22). Typi- of the balanced cut problems has recently been shown to cally, these algorithms give excellent numerical performance, provide excellent clustering results [46,48,49]. In [49], [27] although theoretical convergence proofs are not available. For it was proved that the balanced Cheeger cut problem in (21) instance, in [27], the authors proposed an algorithm minimiz- for undirected graphs admits the following exact continuous ing (22), along with a proof of convergence to a critical point relaxation of the original problem. This method is a new steepest descent a x x algorithm based on the explicit-implicit gradient [50] of the i j,i>j ji i j f(x ) min | − | (22) function E(x ) , k where B(x )= x (i) m(x ) . N k B(x ) k i k k x∈R P P xi m(x) k | − | i | − | The explicit-implicit subgradient of theP non-smooth function P where m(x) stands for the median value of x. Note that since E(xk) is given by it holds xi m(x) = 0, x span 1 , problem (22) i | − | ∀ ∈ { } xn+1 xn ∂x f(xn+1) E(xn)∂x B(xn) is well-definedP if x 1. Then, the problem in (22) can be k − k = k k − k k k (26) ⊥ n n recast as: τ − B(xk )

aji xi xj or i j,i>j | − | min . (23) n+1 x∈RN ,x⊥1 P P m x n i xi ( ) n+1 n n ∂xk f(xk ) n E(xk ) n | − | xk = xk τ n + τ n ∂xk B(xk ). (27) In [49] it was proved thatP (22) is an exact relaxation of the − B(xk ) B(xk ) Cheeger cut problem and, for any minimizer x, there is a Let us now consider the following proximal minimization number ν such that, i, the binary solution xν (i)=1 if x(i) > ∀ problem ν and xν (i)=0 for x(i) ν, is also a minimizer of the ≤ Cheeger cut problem. Then, from the equivalence of problems B(xn) xn+1 , x k x gn 2 (22) and (23), this result holds true also for any minimizer of k arg min f( k)+ n k . (28) x ∈RN 2τ k − k (23). In the sequel, we formulate the problem of finding the k Fourier basis minimizing the balanced total variation in both Any stationary solution of (28) will be also solution of the cases of directed and undirected graphs. To this end, let us subgradient equation define the function τ n x ∂ f(x )+ x gn =0, (29) , f( k) n xk k k E(xk) (24) B(xk ) − xk(i) m(xk) i | − | P so that at step n +1 one gets where f(xk) = GAV(xk) in (4), or f(xk) = GDV(xk) in (3), in case of undirected or directed graphs, respectively. τ n xn+1 gn x (30) According to problem (22), we can find a set of Fourier k = n ∂xk f( k). − B(xk ) 8

3 2 3 3 2 2 1 1 15 4 15 15 1 4 4

11 11 11 5 5 14 14 5 14 12

12 12

7 6 6 6 7 13 7 13 13 9 9 9

8 8 10 8 10 10

(a) (b) (c) Fig. 1: Examples of graphs with: (a) 2 directed links; (b) 3 directed links; (c) 1 directed cycle.

n+1 Replacing in this last equality the expression of xk given of N = 15 nodes with three clusters, connected by a) 2 in (27), we obtain the following set of two equations to be directed links, b) 3 directed links, and c) a directed cycle. As 15 iteratively updated: a first example, in Fig. 2 we report the basis vectors xk k=1 obtained through Algorithm 2 for graph (a) in Fig.{ 1.} The E(xn) gn = xn + τ n k wn with wn ∂x B(xn) intensity of the vector entries is encoded in the color associated k B(xn) k k k ∈ to each vertex. Directed and undirected edges are represented n n+1 B(xk ) n 2 by arrowed and continuous lines, respectively. The order cho- xk = arg min f(xk)+ n xk g b 2τ k − k sen to plot the basis vectors corresponds to increasing values xk ∈Xk of the directed variation GDV(xk) (reported on top of each b N T where we define , xk R : xk xℓ = 0, for ℓ = subgraph). It is possible to notice that the basis vectors tend Xk { ∈ 1,...,k 1 . Note that b is a set of linear constraints to identify clusters and, furthermore, the value assumed by the − } Xk since, for each vector xk, the previously computed vectors basis vectors within each cluster is exactly constant. This is xℓ, for ℓ =1,...,k 1, are assumed to be known. The norm a useful property in view of applications to unsupervised or − one constraint is satisfied through a simple projection of the semi-supervised clustering, where the label (signal) associated optimal solution on the unitary sphere. As shown in [49], [27], to each cluster is exactly constant within the cluster. This the algorithm decreases the objective function and preserves property does not hold with current methods based on the the zero mean properties of the successive iterates. It was also eigenvectors of either Laplacian or adjacency matrices, whose observed in [27] that a faster convergence rate can be achieved n behavior within each cluster is only smooth but not exactly n B(xk ) when the step size is chosen as τ = α n with α> 0. constant. To grasp the reason for this difference, it is worth E(xk ) The formal description of the iterative optimization method noticing that, in case of undirected graphs, the above property is given in Algorithm 4, where we denote by sign(a) and is a consequence of having minimized an ℓ1-norm (see, e.g., mean(a), respectively, the element-wise sign and the mean (4)), rather than an ℓ2-norm, as in the case of the Laplacian value of a vector a. The convergence analysis of the algorithm eigenvectors. It is interesting to remark from Fig. 2 how there to a critical point of E was derived in [49], [27] for undirected are three basis vectors that yield a zero directed variation. In graphs. However, since for directed graphs f(xk) preserves all particular, besides the constant vector x1, vectors x2 and x3, the required properties (i.e., it is non-smooth and convex), the even if not constant, yield zero variation just by assigning convergence results in [49], [27] hold also for the minimization values to the entries of the cluster 11 15 smaller than of the balanced directed variation. the values of clusters 1 5 and 6{ 10÷. Since} there is no directed edge between{ clusters÷ } 1 {5 ÷and} 6 10 , there are { ÷ } { ÷ } VI. NUMERICAL RESULTS two ways to enforce the previous property, still maintaining In this section, we present some numerical results to assess vector orthogonality. As a further example, let us consider the effectiveness of the proposed strategy for building the graph (b) in Fig. 1, where we added a directed link from node 7 GFT basis. First, we illustrate some examples of application to node 5. From Fig. 3 we observe that, in this case, the number and then we compare the proposed approach with alternative of basis vectors having zero directed variation reduce to two, definitions of GFT basis, as given in [5], [4], [14]. In all our since the presence of the new directed link leads to only one experiments, the parameters of SOC and PAMAL methods possible way, besides the constant vector, to have GDV = 0 are set as (unless stated otherwise): β = 100, τ = 0.5, still preserving basis orthogonality. In Fig. 4, we report the 1 k k optimal basis, computed using Algorithm 2, for the graph with γ = 1.5, ρ = 50, ǫ = (0.9) , k N, Λmin = 1000 I 1 k,n∀ ∈ − · a directed cycle depicted in Fig. 1c. Interestingly, in this case, Λmax = 1000 I, Λ = 0, c = ci =¯c =0.5, i,k,n. Examples of· bases for directed graphs. For∀ the sake of there can only be one vector that yields zero directed variation: understanding the structure of the GFT basis vectors obtained the constant vector. In fact, the cyclical structure of the graph with our methods, we start considering the simple directed now prevents the existence of non-constant vectors able to null graphs depicted in Fig. 1, i.e. a directed graph composed the directed variation. The properties described above are a 9

GDV(x1) = 0 GDV(x2) = 0 GDV(x3) = 0 GDV(x1) = 0 GDV(x2) = 0 GDV(x3) = 2

1 1 0 0.2 0.2 0.2 0.5 0.5 −0.2 0 0 −0.4 0 0 0 −0.6 −0.5 −0.2 −0.5 −0.2 −0.2 −0.8

GDV(x4) = 2.24 GDV(x5) = 2.46 GDV(x6) = 3 GDV(x4) = 2.7 GDV(x5) = 2.73 GDV(x6) = 3 0.2 0.2 0.2 0 0 0.2 0 0.2 0 −0.2 −0.2 0 −0.2 0 −0.2 −0.4 −0.4 −0.2 −0.4 −0.4 −0.2 −0.6 −0.6 −0.4 −0.6 −0.6 −0.4 −0.8 −0.8 −0.8 −0.8 −0.6

GDV(x7) = 3.37 GDV(x8) = 3.46 GDV(x9) = 3.53 GDV(x7) = 3.2 GDV(x8) = 3.5 GDV(x9) = 3.5

0.6 0.8 0.8 0.8 0.8 0.5 0.4 0.6 0.6 0.6 0.6 0.2 0.4 0 0.4 0.4 0.4 0.2 0 0.2 0.2 0.2 0 −0.2 −0.5 0 0 0 −0.2

GDV(x ) = 3.67 10 GDV(x ) = 4.08 GDV(x ) = 4.25 11 12 GDV(x ) = 3.5 x . x 0.8 0.8 10 GDV( 11) = 4 1 GDV( 12) = 4.2 0.8 0.6 0.6 0.5 0.5 0.6 0.5 0.4 0.4 0.4 0.2 0.2 0 0 0.2 0 0 0 0 −0.2 −0.2 −0.5 −0.5 −0.2 −0.5

−0.4 −0.4 −0.4 x GDV( 13) = 4.6 x . x GDV( 14) = 4 6 GDV(x15) = 5 GDV( 13) = 4.2 GDV(x14) = 4.5 GDV(x15) = 5

0.6 0.6 0.2 0.4 0.5 0.5 0.4 0 0.4 0.2 0.2 −0.2 0.2 0 0 0 0 −0.4 0 −0.2 −0.6 −0.2 −0.2 −0.5 −0.4 −0.5 −0.8

Fig. 2: Optimal basis vectors xk, k =1,..., 15 for Algorithm Fig. 3: Optimal basis vectors xk, k =1,..., 15 for Algorithm 2 and the directed graph in Fig. 1a. 2 and the graph in Fig. 1b. unique and an interesting consequence of the edge directivity. suggests that bases associated to different local minima behave In fact, as can be observed from Fig. 5, the optimal bases similarly in terms of total variation. Additionally, since the for the corresponding undirected graph (obtained by simply PAMAL algorithm solves the orthogonality constrained, non- removing edge directivity) have only one vector with zero convex problem by iteratively updating the primal variables variation, the constant vector. Conversely, in the case shown and the multipliers, the objective function evaluated at each before, we have had three, two, and one vectors yielding zero (inner and outer) iteration does not necessarily follow a variation. monotonous decay, as can be noticed by the lower subplot Convergence test. Since the optimization problem is non- in Fig. 6. convex, there is of course the possibility that theP proposed Comparison with alternative GFT bases. We compare methods fall into a local minimum. Furthermore, while PA- now the GFT basis found with our methods with the bases MAL method guarantees convergence, SOC algorithm might associated to either the Laplacian or the adjacency matrix, also fail to converge because, theoretically speaking, there is as proposed in [5], [4] and references therein. To compare no convergence analysis. To test what happens, we considered the results, we applied all algorithms to several indepen- several independent initializations of both SOC and PAMAL dent realizations of random graphs. We chose as family of algorithms in the search for a basis for the graph of Fig. 1a. random graphs the so called scale-free graphs, as they are In Fig. 6, we report the average behavior ( the standard known to fit many situations of practical interest [51]. In deviation) of the directed variation versus the± iteration index the generation of random scale-free graphs, it is possible to m, which counts the overall number of (outer and inner) set the minimum degree dmin of each node. To compare iterations for Algorithm 1 and 2. The curves refer to 200 our method with the GFT definition proposed in [1], since independent initializations of algorithms SOC and PAMAL, the eigenvectors of an asymmetric matrix can be complex using the same initialization for both. We can observe that in and the directed total variation GDV, as defined in (3), all cases the algorithms converge but indeed there is a spread does not represent a valid metric for complex vectors, we in the final variation, meaning that both methods can incur into restricted the comparison to undirected scale-free graphs, in local minima. Nonetheless, the spread is quite limited, which which case the adjacency and Laplacian matrices are real and 10

x x x GDV( 1) = 0 GDV( 2) = 0.54 GDV( 3) = 1.3 GAV(x1) = GQV(x1) = 0 GAV(x2) = 2.68, GQV(x2) = 2.6 GAV(x3) = 2.8, GQV(x3) = 1.6 0.2 1 0.3 0.2 1 0.6 0 0.2 0.4 0.5 0 0.5 −0.2 0.1 −0.4 0.2 0 0 0 −0.2 −0.6 0 −0.1 −0.5 −0.5 −0.8 −0.2 x x x GDV( 4) = 2.4 GDV( 5) = 3 GDV( 6) = 3.3 GAV(x4) = 2.88, GQV(x4) = 2.39 GAV(x5) = 3.65, GQV(x5) = 2.93 GAV(x6) = 3.88, GQV(x6) = 3.39 0.2 0.6 0 0.2 0.4 0.2 0 0.4 −0.2 −0.2 0 0 0.2 0.2 −0.4 −0.4 −0.2 −0.2 0 0 −0.6 −0.6 −0.4 −0.4 −0.2 −0.2 −0.8 −0.8 −0.6 −0.6

x . x . x . x . , x . GDV( 7) = 3 3 GDV( 8) = 3 4 GDV( 9) = 3 4 GAV( 7) = 3 88 GQV( 7) = 3 39 GAV(x8) = 3.88, GQV(x8) = 3.39 GAV(x9) = 4, GQV(x9) = 2.93

0.2 0.8 0.6 0.6 0.2 0.2 0 0.6 0.4 0.4 0 0 −0.2 0.4 0.2 0.2 −0.2 −0.2 −0.4 0.2 0 0 −0.4 −0.4 −0.6 0 −0.2 −0.2 −0.6 −0.6 −0.8 −0.2

x x GDV( 10) = 3.5 GDV( 11) = 4 x x x GDV( 12) = 4.2 GAV( 10) = 4.08, GQV( 10) = 3.66 GAV(x11) = 4.11, GQV(x11) = 3.61 GAV(x ) = 4.49, GQV(x ) = 4.16 12 12 0.8 0.8 0.8 0.5 0.6 0.5 0.6 0.2 0.6

0.4 0.4 0 0.4 0 0.2 0 0.2 0.2 −0.2 0 0 0 −0.4 −0.2 −0.2 −0.5 −0.2 −0.5 −0.6 −0.4 −0.4 −0.4

x x x x x x x x x GDV( 13) = 4.5 GDV( 14) = 5 GDV( 15) = 5.2 GAV( 13) = 4.61, GQV( 13) = 4.33 GAV( 14) = 4.94, GQV( 14) = 4.5 GAV( 15) = 5.65, GQV( 15) = 4.9 0.4 0.6 0.6 0.8 0.2 0.5 0.5 0.4 0.4 0.6 0 0.4 0.2 −0.2 0.2 0 0 0.2 −0.4 0 0 0 −0.6 −0.2 −0.5 −0.5 −0.2 −0.2

−0.8 x Fig. 4: Optimal basis vectors xk, k =1,..., 15 for Algorithm Fig. 5: Optimal basis vectors k, k =1,..., 15 for Algorithm 2 and the graph in Fig. 1c. 2 and the undirected counterpart of the graph in Fig. 1c. symmetric, so that their eigenvectors are real. In the sequel, are averaged over 100 independent realizations of scale-free N graphs, vs. the average minimum degree, under the same we will use the notations GAV(X) := k=1 GAV(xk) and N settings of Fig. 7. Interestingly, even if our basis vectors X∗ do GQV(X) := k=1 GQV(xk) to denote, respectively,P the total graph absoluteP and quadratic variation of a matrix X. In Fig. not coincide with V or U, they provide the same GQV, within 7, we compare the following metrics: a) GAV(X∗), derived by negligible numerical inaccuracies. Indeed, the invarianceof the solving problem through the SOC and PAMAL methods; metric GQV(X), for any square, orthogonal matrix X, can be P N T b) GAV(V), where V are the eigenvectors of the adjacency easily proved from the equality GQV(X)= k=1 xk Lxk = T T matrix according to the GFT defined in (7); c) GAV(U), trace(X LX), by observing that trace(X LXP ) = trace(L) where U are the eigenvectors of the Laplacian matrix by for any orthogonal matrix X. Interestingly, this implies that, ∗ assuming the GFT as in (5), that for undirected graphs is for undirected graphs, our orthogonal matrix X can be equivalent to the GFT defined in (10). More specifically, Fig. obtained by applying an orthogonal transform to the Laplacian 7 shows the previous metrics vs. the minimum degree of the eigenvectors basis. graph averaged over 100 independent realizations of scale- Complexity issues. Clearly, looking at both SOC and PAMAL free graphs of N = 20 nodes. As we can notice from Fig. methods, complexity is a non trivial issue which deserves 7, the bases built using SOC and PAMAL algorithms yield a further investigations, especially when the size of the graph significantly lower total variation than the conventional bases increases. To get an idea of computing time, in Fig. 9 we report built with either adjacency or Laplacian eigenvectors. This is the execution time of both SOC and PAMAL algorithms, as primarily due to the fact that our optimization methods tend a function of the number of vertices in the graph. The results to assign constant values within each cluster. Finally, in Fig. 8 have been obtained running a non-compiled Matlab program, we compare the alternative basis vectors using as performance with no optimization of the parameters involved, by setting metric the GQV. So, in Fig. 8 we report the GQV(X∗) metric ρ1 = β = 20. The program ran on a laptop having a processor derived from the SOC and PAMAL methods with GQV(V) Intel Core i7-4500, CPU 1.8, 2.4 GHz. The graphs under and GQV(U) obtained, respectively, from the eigenvectors of test were generated as geometric random graphs with equal the adjacency and the Laplacian matrix. Again, the results percentage of directed links as N increases. 11

100 400 SOC algorithm GQV(X⋆), SOC GQV(X⋆), PAMAL V m 80 GQV( ) σ 350 GQV(U) ± m 60

GDV 300

40 0 100 200 300 400 500 Iteration index m 250

100 GQV PAMAL algorithm 200 m

σ 80 ± m 60 150 GDV

40 0 20 40 60 80 100 120 140 160 100 4 6 8 10 12 14 16 Iteration index m Average minimum degree Fig. 6: Average directed variation ( the standard deviation) Fig. 8: Average GQV versus the average minimum degree for SOC and PAMAL methods vs. the± iteration index m for the according to alternative GFT definitions for undirected scale- graph of Fig. 1a, by averaging over 200 random initializations free graphs with nodes. of the algorithms. N = 20

103 900 SOC algorithm X⋆ , GAV( ) SOC PAMAL algorithm GAV(X⋆), PAMAL 800 GAV(V) GAV(U)

700 102

600

500 GAV

1 400 10 Execution time [min]

300

200 100 50 100 150 200 250 100 4 6 8 10 12 14 16 N Average minimum degree Fig. 9: Execution time vs. the number of nodes for RGGs with Fig. 7: Average absolute total variation versus the average 25% of directed links and β = ρ1 = 20. minimum degree according to alternative GFT definitions for undirected scale-free graphs with N = 20 nodes. edges’ directivity. Balanced total variation. In some cases, the solution of the Examples with real networks. As an application to real total variation problem in (12) can cut the graph in subsets graphs, in Fig. 10 we considered the directed graph obtained of very different cardinality. As an extreme case, it may be from the street map of Rome, incorporating the true directions not uncommon to have a subset composed of only one node of traffic lanes in the area around Mazzini square. The graph and the other set containing all the rest of the network. To is composed of 239 nodes. Even though, the scope of this prevent such a behavior, Algorithm 4 aims at minimizing the paper is to propose a method to build a GFT basis, so that balanced total variation. An example of its application to the we do not dig further into applications, this an example that graph of Fig. 10 is reported in Fig. 12, where we show some has interesting applications of GSP. The problem in this case basis vectors computed using Algorithm 4. Comparing these is to build a map of vehicular traffic in a city, starting from vectors with the corresponding ones obtained with PAMAL a subset of measurements collected along road side units or algorithm, see, e.g. Fig. 11, we can see how clusters of single sent by cars equipped with ad hoc equipment. The problem nodes are now avoided. can be interpreted as the reconstruction of the entire graph signal from a subset of samples and then it builds on graph VII. CONCLUSION sampling theory [10]. In Fig. 11 we report some basis vectors In this paper we have proposed an alternative approach to obtained by using Algorithm 2 with ρ1 = 10. We can observe build an orthonormal basis for the Graph Fourier Transform that the basis vectors highlight clusters, while capturing the (GFT). The approach considers the general case of a directed 12

APPENDIX A. Closed-form solution for problem ˜k,n In this section we provide a closed-formQ solution for the non-convex problem ˜k,n. This problem can be equivalently Q written as k,n P = arg min gk,n−1(P) P∈RN×N (31) s.t. PT P = I

k , k k,n−1 ρ k,n−1 2 where gk,n−1(P) Λ , P X + 2 P X F + k,n−1 h − i k − k c2 k,n−1 2 2 P P F . Our proof consists of two steps: i) first,k we find− the stationaryk solutions by solving the KKT necessary conditions; ii) then, we prove that the resulting closed-form solution is a global minimum of the non-convex problem (31). The Lagrangian function P associated to (31) L can be written as k k k,n−1 ρ k,n−1 2 P = Λ , P X + P X F L h − i 2 k − k (32) ck,n−1 + 2 P Pk,n−1 2 + Λ , PT P I 2 k − kF h 1 − i N×N where Λ1 R is the multipliers’ matrix associated to the orthogonality∈ constraint. The KKT conditions become then k k,n−1 k,n−1 k,n−1 Fig. 10: Directed graph associated to street map of Rome a) P P =P[I(ρ + c )+2Λ1] c P ∇ L 2 − 2 (Piazza Mazzini). ρkXk,n−1 + Λk = 0, − b) Λ PT P I = 0 1 ⊥ − (33) T , where we chose Λ1 = Λ1 . Hence, defining B I + graph and then it includes the undirected case as a particular k k,n−1 2Λ1/(ρ + c ), from equation a) one gets: example. The search method starts from the identification of 2 an objective function and then looks for an orthonormal basis PB = F (34) that minimizes that function. More specifically, motivated by ck,n−1Pk,n−1 + ρkXk,n−1 Λk the need to detect clustering behaviors in graph signals, we with F , 2 − . Let QΣTT be k k,n−1 chose as objective function the cut size. We showed that this ρ + c2 the SVD decomposition of F. From (34), it turns out approach leads, without loss of optimality, to the minimization of a function that represents a directed total variation of graph PB = QΣTT (35) signals, as it captures the edges’ directivity. Interestingly, in and, using the orthogonality condition b) in (33), it holds case of undirected graphs, this function converts into an ℓ1- norm total variation, which represents the graph (discrete) BT B = TΣ2TT B = TΣTT . (36) counterpart of the ℓ -norm total variation that plays a key role ⇒ 1 Therefore, replacing B in (35), we get in the classical Fourier Transform of continuous-time signals [17]. We compared our basis vectors with the eigenvectors PTΣTT = QΣTT P = QTT . (37) ⇒ of either the Laplacian or adjacency matrix, assuming as ⋆ k,n T performance metric either our graph absolute variation or the It remains to prove that P = P = QT is a global graph quadratic variation. As expected, our method outper- minimum for problem (31). To this end, it is sufficient to show forms the other methods when using the absolute variation, as that it is built by minimizing that metric. However, what has been ⋆ T gk,n−1(P ) gk,n−1(P), P : P P = I (38) interesting to see was that our basis performs as well as the ≤ ∀ ⋆ 2 2 alternative basis when we assumed as performance metric the i.e., using the equalities P F = P F = N, we have to prove that P : PT P =kI, it resultsk k k graph quadratic variation. Before concluding, we wish to point ∀ out that, as always, our alternative approach to build a GFT trace(P⋆T (Λk ρkXk,n−1 ck,n−1Pk,n−1)) basis has its own merits and shortcomings when compared − − 2 ≤ (39) trace(PT (Λk ρkXk,n−1 ck,n−1Pk,n−1)). to alternative approaches. For example, having restricted the − − 2 search to the real domain, differently from available methods, Using the above definition of F, (39) reduces to our method fails to find the complex exponentials as the trace(P⋆T F) trace(PT F), P : PT P = I (40) GFT basis in the case of circular graphs. Furthermore, other ≥ ∀ methods like the ones in [1] starting from the identification of and since P⋆ = QTT , the final inequality to hold true is the adjacency matrix as the shift operator, are more suitable trace(Σ) trace(TT PT QΣ), P : PT P = I. (41) than our approach to devise a filtering theory over graphs. ≥ ∀ 13

GDV(x3) =0 GDV(x5) =0

0

0.5 -0.05

-0.1 0.4

-0.15 0.3

-0.2

0.2

-0.25

-0.3 0.1

-0.35 0

GDV(x17) =0 GDV(x27) =0.44

0.4 0.5

0.35

0.4 0.3

0.25 0.3

0.2

0.2 0.15

0.1

0.1

0.05

0 0

x GDV( 29) =0.57 GDV(x63) =0.91 0

0.3

-0.1 0.2

0.1

-0.2 0

-0.1 -0.3

-0.2

-0.4 -0.3

-0.4 -0.5

-0.5

Fig. 11: Optimal basis vectors xk, k =3, 5, 17, 27, 29, 63 for Algorithm 2 and the graph in Fig. 10. 14

GDV(x2) =0 GDV(x3) =0 0.15

0.3

0.1

0.2

0.05 0.1

0 0

-0.1 -0.05

-0.2

-0.1

-0.3

-0.15

GDV(x4) =0.026 GDV(x5) =0.34 0.4 0.05

0.35

0.3

0 0.25

0.2

0.15 -0.05

0.1

0.05

0 -0.1

GDV(x6) =0.4 GDV(x7) =0.53

0.05 0.05

0

-0.05 0

-0.1

-0.05 -0.15

-0.2

-0.1 -0.25

Fig. 12: Optimal basis vectors xk, k =2,..., 7 for Algorithm 4 and the graph in Fig. 10. 15

T T T T Define Z := T P Q so that Z Z = I. Then, from (41) we k(W) when W ∞ . Clearly, the term f2(P) get isL coercive.→∞ The remainingk k terms→ ∞ in (45) can be written as trace Σ trace ZT Σ Z ZT Z I (42) k ( ) ( ), : = . ρ k k ≥ ∀ f (X)+ gk(X, P)= GDV(X)+ X, X ρ P + Λ , X 1 2 h i−h i This last inequality holds because Σii > 0 and Zii Zii k T ≤| | ρ 1, i, where the latter is implied by Z Z = I [40]. Λk P P 2 ≤ ∀ + , + F . Additionally, Zii = 1, i, if and only if Z = I, so that the h i 2 k k ∀ ⋆ T 2 equality in (42) holds if and only if Z = I or P = QT . Since P t it holds P = N. Thus, from the inequalities ∈ S k kF A, B A F B F and B F B 1, it holds B. Proof of Theorem 1 h k i≥−k k kk k k k ≤k k Λ , P √N Λ 1, so that one gets For lack of space, we omit here the details of the proof, h i≥− k k ρk which proceeds using similar arguments as in the proof of f (X)+ g (X, P) GDV(X)+ X, X ρk X 1 k 2 1 Proposition 2.5 in [26]. However, to invoke this correspon- ≥ h i− k k ρkN dence, we need to prove that the following properties hold true: Λk, X √N Λk + −h i− k k1 2 i) the function k in (15) satisfies the Kurdyka-Łojasiewicz L k k (K-Ł) property; ii) k is a coercive function. To prove point where we used the inequality ρ P, X ρ X 1. Observe L k h i≤ k k i), let us first introduce some definitions [52]. that the sequence ρ k∈N is non-decreasing when γ > 1 so n k 1 { } Definition 3: A semi-algebraic subset of R is a finite union that ρ >ρ . Then the function f1(X)+gk(X, P) is coercive ρk of sets of the form being GDV(X)+ X, X a positive function. n 2 h i x R : P1(x)=0, . . . ,Pk(x)=0, { ∈ (43) Q (x) > 0,...,Ql(x) > 0 REFERENCES 1 } where and are polynomial in vari- [1] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing on P1,...,Pk Q1,...,Ql n graphs,” IEEE Trans. Signal Process., vol. 61, no. 7, pp. 1644–1656, ables. Apr. 2013. Definition 4: A function f : Rn R is said to be semi- [2] S. K. Narang and A. Ortega, “Perfect reconstruction two-channel wavelet algebraic if its graph, defined as gph→ x x x filterbanks for graph structured data,” IEEE Trans. Signal Process., f := ( ,f( )) vol. 60, no. 6, pp. 2786–2799, 2012. n { | ∈ R , is a semi-algebraic set. [3] ——, “Compact support biorthogonal wavelet filter banks for arbitrary It is} shown [ cf. [42], Th. 3] that the semi-algebraic functions undirected graphs,” IEEE Trans. Signal Process., vol. 61, no. 19, pp. 4673–4685, 2013. satisfy the K-Ł property. [4] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing on Definition 5: A function φ(x) satisfies the Kurdyka- graphs: Frequency analysis,” IEEE Trans. Signal Process., vol. 62, Łojasiewicz (K-Ł) property at point x¯ dom(∂φ) if there no. 12, pp. 3042–3054, Jun. 2014. exists θ [0, 1) such that ∈ [5] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Van- ∈ dergheynst, “The emerging field of signal processing on graphs: Ex- φ(x) φ(x¯) θ tending high-dimensional data analysis to networks and other irregular | − | (44) domains,” IEEE Signal Process. Mag., vol. 30, no. 3, pp. 83–98, May dist(0, ∂φ(x)) 2013. [6] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on is bounded around x¯. graphs via ,” Appl. Comput. Harmon. Anal., vol. 30, The global convergence of the PAM method established in pp. 129–150, 2011. [7] S. K. Narang, G. Shen, and A. Ortega, “Unidirectional graph-based [43] requires the objective function to satisfy the K-Ł property. wavelet transforms for efficient data gathering in sensor networks,” in Define W := (X, P) and consider the function k in (15), Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Mar. i.e. L 2010, pp. 2902–2905. k k [8] I. Pesenson, “Sampling in Paley-Wiener spaces on combinatorial k(W)= (X, P, Λ ; ρ )= f (X)+ f (P)+ gk(X, P) L L 1 2 graphs,” Trans. of the American Math. Society, vol. 360, no. 10, pp. (45) 5603–5627, Oct. 2008. where f1(X) = GDV(X), f2(P) = δSt (P) and gk(X, P) = [9] A. Agaskar and Y. M. Lu, “A spectral graph uncertainty principle,” IEEE k Trans. Inform. Theory Λk, P X + ρ P X 2 . Observe that f (X) = , vol. 59, no. 7, pp. 4338–4356, Jul. 2013. h − i 2 k − kF 1 [10] M. Tsitsvero, S. Barbarossa, and P. Di Lorenzo, “Signals on graphs: N Uncertainty principle and sampling,” IEEE Trans. Signal Process., ajimax(xi xj , 0) is the weighted sum of the func- vol. 64, no. 18, pp. 4845–4860, Sep. 2016. − i,jX=1 [11] M. Tsitsvero and S. Barbarossa, “On the degree of freedom of signals on graphs,” in Proc. European Signal Process. Conf., Nice, Sep. 2015, tions fij (xi, xj ) = max(xi xj , 0). Being a finite sum of − pp. 1521–1525. semi-algebraic functions also a semi-algebraic function, it is [12] S. Chen, R. Varma, A. Sandryhaila, and J. Kovaˇcevi´c, “Discrete signal sufficient to show that fij is semi-algebraic. Assume, w.l.o.g. processing on graphs: Sampling theory,” IEEE Trans. Signal Process., vol. 63, no. 24, pp. 6510–6523, Dec. 2015. yij = xi xj so that z = fij (yij )= max(yij , 0). The graph − [13] X. Zhu and M. Rabbat, “Approximating signals supported on graphs,” in of fij becomes Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Mar. 2012, pp. 3921–3924. gphfij= (yij ,z): z = yij ,yij 0 (yij ,z): z =0,yij 0 { ≥ }∪{ ≤ } [14] R. Singh, A. Chakraborty, and B. S. Manoj, “Graph Fourier transform based on directed Laplacian,” in Proc. Int. Conf. Signal Process. and according to Definition 3 it is a semi-algebraic set. Commun. (SPCOM), Jun. 2016, pp. 1–5. Then f1(X) as sum of semi-algebraic functions is also semi- [15] M. P¨uschel and J. M. F. Moura, “Algebraic signal processing theory: Foundation and 1-D time,” IEEE Trans. Signal Process., vol. 56, no. 8, algebraic. Since f2(P) and gk(X, P) are semi-algebraic func- pp. 3572–3585, Aug. 2008. tions it follows that k(W) is also semi-algebraic. It remains L [16] ——, “Algebraic signal processing theory: 1-D space,” IEEE Trans. to prove point ii) to assess that k is a coercive function, i.e. Signal Process., vol. 56, no. 8, pp. 3586–3599, Aug. 2008. L 16

[17] S. Mallat, A wavelet tour of signal processing: The sparse way. [44] E. G. Birgin, D. Fern´andez, and J. M. Mart´ınez, “On the boundedness Accademic Press, 2009. of penalty parameters in an augmented Lagrangian method with con- [18] F. Lozes, A. Elmoataz, and O. L´ezoray, “Partial difference operators on strained subproblems,” Optimization Methods and Software, vol. 27, pp. weighted graphs for image processing on surfaces and point clouds,” 1001–1024, 2012. IEEE Trans. Image Process., vol. 23, no. 9, pp. 3896–3909, Sep. 2014. [45] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE [19] G. H. Golub and J. H. Wilkinson, “Ill-conditioned eigensystems and Trans. Pattern Analysis and Mach. Lear., vol. 22, no. 8, pp. 888–905, computation of the Jordan canonical form,” SIAM Review, vol. 18, no. 4, Aug. 2000. pp. 578–619, Oct. 1976. [46] M. Hein and S. Setzer, “Beyond spectral clustering - tight relaxations of [20] B. Girault, “Signal Processing on Graphs - Contributions to an Emerging balanced graph cuts,” in Advances in Neural Inform. Process. Systems Field,” Theses, Ecole normale sup´erieure de lyon - ENS LYON, Dec. (NIPS), 2011, pp. 2366–2374. 2015. [Online]. Available: https://tel.archives-ouvertes.fr/tel-01256044 [47] J. Cheeger, “A lower bound for the smallest eigenvalue of the Laplacian,” [21] A. Gadde, A. Anis, and A. Ortega, “Active semi-supervised learning Problems in Analysis, R.C. Gunning, ed., Princeton Univ. Press, pp. using sampling theory for graph signals,” in Proc. 20th ACM SIGKDD 195–199, 1970. Int. Conf. Knowledge Discovery and Data Mining, ser. KDD ’14. New [48] M. Hein and T. B¨uhler, “An inverse power method for nonlinear York, NY, USA: ACM, 2014, pp. 492–501. eigenproblems with applications in 1-spectral clustering and sparse [22] A. Anis, A. E. Gamal, S. Avestimehr, and A. Ortega, “Asymptotic PCA,” in Advances in Neural Inform. Process. Systems (NIPS), 2010, justification of bandlimited interpolation of graph signals for semi- pp. 847–855. supervised learning,” in Proc. IEEE Int. Conf. Acoust., Speech Signal [49] A. Szlam and X. Bresson, “Total variation and Cheeger cuts,” in Proc. Process. (ICASSP), Apr. 2015, pp. 5461–5465. 27th Int. Conf. on (ICML), 2010, pp. 1039–1046. [23] L. Lov´asz, “Submodular functions and convexity,” in A. Bachem et al. [50] S. Boyd and N. Parikh, Proximal Algorithms. Foundations and Trends (eds.) Math. Program. The State of the Art, Springer Berlin Heidelberg, in Optimization, 2013, vol. 1, no. 3. pp. 235–257, 1983. [51] R. Albert and A.-L. Barab´asi, “Statistical mechanics of complex net- [24] F. Bach, “Learning with submodular functions: A convex optimization works,” Rev. Mod. Phys, pp. 47–97, 2002. perspective,” Foundations and Trends in Machine Learning, vol. 6, no. [52] J. Bochnak, M. Coste, and M. F. Roy, Real Algebraic Geometry. 2–3, pp. 145–373, 2013. Springer-Verlag, Berlin, 1998. [25] R. Lai and S. Osher, “A splitting method for orthogonality constrained problems,” J. Scientific Computing, vol. 58, no. 2, pp. 431–449, Feb. 2014. [26] W. Chen, H. Ji, and Y. You, “An augmented Lagrangian method for l1- regularized optimization problems with orthogonality constraints,” SIAM J. Scientific Computing, vol. 38, no. 4, pp. B570–B592, 2016. [27] X. Bresson, T. Laurent, D. Uminsky, and J. H. von Brecht, “Convergence and energy landscape for Cheeger cut clustering,” in Advances in Neural Inform. Process. Systems (NIPS), 2012, pp. 1394–1402. [28] M. Newman, Networks: An Introduction. New York, NY, USA: Oxford Univ. Press, 2010. [29] L. Jost, S. Setzer, and M. Hein, “Nonlinear eigenproblems in data analysis: Balanced graph cuts and the ratioDCA-Prox,” in Extraction of Quantifiable Information from Complex Systems, Springer Intern. Publishing, vol. 102, pp. 263–279, 2014. [30] F. R. K. Chung, Spectral Graph Theory. American Math. Soc., 1997. [31] J. Nocedal and S. J. Wright, Numerical Optimization. Springer, 2006. [32] F. Bethuel, H. Brezis, and F. H´elein, “Asymptotics for the minimization of a Ginzburg-Landau functional,” Calculus of Variations and Partial Differential Equations, vol. 1, no. 2, pp. 123–148, 1993. [33] D. P. Bertsekas, Constraint optimization and Lagrange multiplier meth- ods. Belmont Massachusetts: Athena Scientific, 1999. [34] M. Fortin and R. Glowinski, Augmented Lagrangian Methods: Applica- tions to the Numerical Solution of Boundary-Value Problems. North Holland, 2000, vol. 15. [35] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, 2010, vol. 3, no. 1. [36] R. Glowinski and P. Le Tallee, Augmented Lagrangian and Operator- Splitting Methods in Nonlinear Mechanics. SIAM, 1989. [37] W. Yin, S. Osher, D. Goldfarb, and J. Darbon, “Bregman iterative algorithms for l1-minimization with application to compressed sensing,” SIAM J. Imag. Sciences, vol. 1, pp. 143–168, 2008. [38] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin, “An iterative regu- larization method for total variation-based image restoration,” Multiscale Model. Simul., vol. 4, no. 2, pp. 460–489, 2005. [39] T. Goldstein and S. Osher, “The split Bregman method for l1-regularized problems,” SIAM J. Imag. Sciences, vol. 2, no. 2, pp. 323–343, 2009. [40] J. H. Manton, “Optimization algorithms exploiting unitary constraints,” IEEE Trans. Signal Process., vol. 50, no. 3, pp. 635–650, Mar. 2002. [41] R. Andreani, E. G. Birgin, J. M. Mart´ınez, and M. L. Schuverdt, “On augmented Lagrangian methods with general lower–level constraints,” SIAM J. Optimiz., vol. 18, no. 4, pp. 1286–1309, 2007. [42] J. Bolte, S. Sabach, and M. Teboulle, “Proximal alternating linearized minimization for nonconvex and nonsmooth problems,” Math. Program., vol. 146, no. 1–2, pp. 459–494, Aug. 2014. [43] H. Attouch, J. Bolte, and B. F. Svaiter, “Convergence of descent methods for semi–algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods,” Math. Program., vol. 137, no. 1–2, pp. 91–129, Feb. 2013.