A structured diagonal Hessian approximation method with evaluation complexity analysis for nonlinear least squares

Hassan Mohammad1,2 and Sandra A. Santos2 1Department of Mathematical Sciences Faculty of Physical Sciences Bayero University, Kano, Nigeria [email protected] 2Department of Applied Institute of Mathematics, and Scientific Computing University of Campinas, Campinas, Brazil [email protected] June 9, 2018

Abstract This work proposes a Jacobian-free strategy for addressing large-scale nonlinear least- squares problems, in which structured secant conditions are used to define a diagonal approxi- mation for the Hessian . Proper safeguards are devised to ensure descent directions along the generated sequence. Worst-case evaluation analysis is provided within the framework of a non-monotone . Numerical experiments contextualize the proposed strategy, by ad- dressing structured problems from the literature, also solved by related and recently presented conjugate- and multivariate spectral gradient strategies. The comparative computa- tional results show a favorable performance of the proposed approach.

Keywords: Nonlinear least-squares problems ; Large-scale problems; Jacobian-free strategy; Evaluation complexity; Global convergence; Computational results

1 Introduction

Nonlinear least-squares problems constitute a special class of unconstrained optimization problems that involves the minimization of the norm of the squares of the so-called residual functions, usually assumed to be twice continuously differentiable. These problems may be solved either by general unconstrained minimization methods, or by specialized methods, that take into account the par- ticular structure of the objective (Bj¨orck, 1996). Nonlinear least-squares problems arise in many applications, such as data fitting, optimal control, parameter estimation, experimental design, and imaging problems (see, e.g. (Cornelio, 2011; Golub and Pereyra, 2003; Henn, 2003; Kim et al., 2007; Li et al., 2012; L´opez et al., 2015; Tang, 2011), among others). Most of the classical and modified algorithms for solving nonlinear least-squares problems require the computation of the gradient vector and the Hessian matrix of the objective function. The Hessian, in this case, is the sum of two terms. The first one involves only the of the residue, whereas the second involves the second-order of the functions. In practice, computing and storing the complete Hessian matrix is too expensive, because the exact second-order derivatives of the residual functions are rarely available at a reasonable cost. Such a matrix also requires a large amount of storage when considering large-scale problems. However, if the residual of the objective function is nonzero at the minimizer, the computation of the exact second-order

1 derivatives of the residual function might become relevant. Nonzero residual problems occur in many practical instances (see (Nocedal and Wright, 2006; Sun and Yuan, 2006)). Newton-type methods, that require the exact Hessian matrix, may be too expensive for such problems. As a consequence, alternative approaches, which rest upon function evaluations and first-order derivatives, have been developed (see (Nazareth, 1980) for details). This is the case of the Gauss-Newton method (GN) and the Levenberg-Marquardt method (LM) (Hartley, 1961; Levenberg, 1944; Marquardt, 1963; Morrison, 1960). Both GN and LM methods neglect the second-order term of the Hessian matrix of the objective function of the problems. As a result, they are expected to perform well for solving small residual problems. To address large residual problems, these methods may perform poorly (Dennis and Schnabel, 1996). Even the requirement of computing the Jacobian matrix for the aforementioned methods may contribute to some inefficiency, in the case of large-scale problems. Thus, Jacobian-free approaches are of interest for solving such instances (see for example, (Knoll and Keyes, 2004; Xu et al., 2012, 2016)). Many studies have been made to the development of methods for minimizing the sum of squares of nonlinear functions (see, e.g. (Al-Baali, 2003; Al-Baali and Fletcher, 1985; Bartholomew-Biggs, 1977; Betts, 1976; Fletcher and Xu, 1987; Lukˇsan, 1996; McKeown, 1975; Nazareth, 1980)). Along the -free philosophy, we should mention the seminal work of (Powell, 1965), as well as the recent class of algorithms developed in (Zhang et al., 2010), based on polynomial interpolation- based models. From another standpoint, and inspired by the regularizing aspect of the LM method, further analyses of the performance of methods with second- and higher-order regularization terms have been developed (Bellavia et al., 2010; Birgin et al., 2017; Cartis et al., 2013, 2015; Grapiglia et al., 2015; Nesterov, 2007; Zhao and Fan, 2016). Structured quasi-Newton (SQN) methods have been proposed (Brown and Dennis, 1970; Dennis, 1973) to overcome the difficulties that arise when solving large residual problems using GN or LM methods. These methods combine the Gauss-Newton and a Quasi-Newton step in order to make good use of the structure of the Hessian of the objective function. SQN methods show compelling and improved numerical performance compared to the classical methods (Spedicato and Vespucci, 1988; Wang et al., 2010). However, their performance on large-scale problems is not encouraging, because they require a considerable large amount of matrix storage. Based on this drawback, matrix-free approaches are preferable for solving large-scale nonlinear least-squares problems. Motivated by the above discussion, we propose an alternative approximation of the complete Hessian matrix of the objective function of the nonlinear least-squares problem using a that carries some information of both the first- and the second-order terms of the Hessian matrix. Our proposed approach is Jacobian-free, not requiring the storage of the Jacobian matrix, but only the action of its transpose upon a vector. Therefore, it is suitable for very large-scale problems. The remaining of this paper is organized as follows. In Section 2 we discuss some preliminaries, in Section 3 we present the proposed method, and its evaluation complexity analysis is developed in Section 4. Section 5 contains the numerical experiments. Some conclusions and prospective future work are given in Section 6. Throughout this paper, ∥.∥ stands for the Euclidean norm of vectors and the induced 2-norm of matrices.

2 Preliminaries

In this section, we first recall the diagonal quasi-Newton method for general unconstrained opti- mization problems, as described in (Deng and Wan, 2012), where the authors provide details about the manuscript (Shi and Sun, 2006), in Chinese. According to (Deng and Wan, 2012), (Shi and Sun, 2006) presented a sparse (diagonal) quasi- Newton method for the unconstrained optimization problem

min f(x), (1) x∈Rn

2 where f : Rn → R is continuously differentiable function. Their idea was motivated by the work of (Barzilai and Borwein, 1988) and (Raydan, 1993, 1997). Considering an iterative procedure for solving (1), in which the sequence {xk} is generated, the unconstrained optimization problem

1 2 1 2 min ∥Hkyk−1 − sk−1∥ = min ∥(λkI)yk−1 − sk−1∥ , k = 1, 2,..., (2) Hk 2 λk 2 where sk−1 = xk − xk−1 and yk−1 = ∇f(xk) − ∇f(xk−1), is solved to obtain a scalar multiple of the identity that approximates the inverse of the Hessian. i Shi and Sun assumed Hk to be a diagonal matrix, say, Hk = diag(hk), (i = 1, 2, ..., n), and solved the following constrained problem ∑n ( ) 1 i i − 2 min hkyk−1 sk−1 , (3) ℓ ≤hi ≤u 2 k k k i=1

i ≤ i ≤ where ℓk and uk are given bounds for hk such that 0 < ℓk hk uk, and so Hk is safely positive definite. The solution of the problem (3) is given by  si si  k−1 ≤ k−1 ≤  yi , if ℓk yi uk  k−1 k−1  i  sk−1 ℓk, if i < ℓk i yk−1 hk = (4)  si  k−1 uk, if yi > uk  k−1  i i hk−1, if yk−1 = 0. Shi and Sun’s algorithm avoids the computation and storage of matrices in its iteration, making it suitable for large-scale problems. However, the numerical results depend on the bound parameters ℓk and uk. In a similar perspective, (Han et al., 2008) presented a multivariate spectral for unconstrained optimization. They found a solution to the problem ∑n ( ) 1 i i − i 2 min bksk−1 yk−1 (5) bi 2 k i=1 as follows

 i i  yk−1 yk−1  i , if i > 0 sk−1 sk−1 bi = (6) k  T  sk−1sk−1 T , otherwise. sk−1yk−1 i In order to bound the diagonal entries bk, the authors used the technique introduced by (Raydan, i ≤ i ≥ 1 i 1997), that is, for some 0 < ϵ < 1 and δ > 0, if bk ϵ or bk ϵ , then bk = δ. Now, let us consider a special case of (1), namely, the least-squares problem (LS) of the form

∑m 1∥ ∥2 1 2 min f(x), f(x) = F (x) 2 = (Fi(x)) , (7) x∈Rn 2 2 i=1 where F : Rn → Rm (m ≥ n) is a twice continuously differentiable mapping. The gradient and the Hessian of the objective function in (7) are given by

∑m T ∇f(x) = Fi(x)∇Fi(x) = J(x) F (x), (8) i=1

3 ∑m ∑m 2 T 2 ∇ f(x) = ∇Fi(x)∇Fi(x) + Fi(x)∇ Fi(x) i=1 i=1 (9) = J(x)T J(x) + S(x),

2 T m×n where Fi(x) is the i-th component of F (x), ∇ Fi(x) is its Hessian, matrix J(x) = ∇F (x) ∈ R is the Jacobian of the residual function F at x, and S(x) is a representing the second term of (9). For simplicity, we denote the residual function computed at the k-th iterate ∇ T by Fk = F (xk), and also fk = f(xk), Jk = J(xk), and gk = fk = Jk Fk. To refer to the i-th i component of a vector vk, we use vk.

2.1 Building blocks for approximating the Hessian In what follows, we provide the necessary elements for developing our proposed diagonal Hessian approximation, taking into account the special structure of the objective function of the nonlinear least-squares problem. The first-order term of the Hessian of the objective function of (7) at a certain iterate k − 1, k ≥ 1, is given by ∑m T A(xk−1) = ∇Fi(xk−1)∇Fi(xk−1) , (10) i=1 so that the secant equation satisfied by the updating matrix A(xk) can be obtained as follows. From the Taylor’s expansion of Fi(xk−1), since sk−1 = xk − xk−1, we have

T Fi(xk−1) = Fi(xk) − ∇Fi(xk) sk−1 + o(∥sk−1∥), (11)

R → R o(ξ) where o : + is such that limξ→0 ξ = 0. Pre-multiplying (11) by ∇Fi(xk) and summing both sides from i = 1 to m, we have ( ) T T T ∥ ∥ Jk Jksk−1 = Jk γk−1 + Jk 1m o( sk−1 ),

m where γk−1 = Fk − Fk−1, and 1m ∈ R is a vector with all entries equal to one. ≈ T Let Ak Jk Jk be such that Ak is a diagonal matrix which approximately satisfies the following secant condition Aksk−1 ≈ yˆk−1, (12) where T yˆk−1 = Jk γk−1. (13) Now, to approximate the Hessian term with the second-order derivatives, from the Taylor’s expansion of ∇Fi(xk−1), one can get

2 ∇ Fi(xk)(xk − xk−1) = ∇Fi(xk) − ∇Fi(xk−1) + o(∥sk−1∥), (14)

i R → Rn ∈ { } o (ξ) in which o : + is such that each component i 1, . . . , n satisfies limξ→0 ξ = 0. Pre-multiplying (14) by Fi(xk) and summing both sides from i = 1 to m, implies ( ) − T T ∥ ∥ S(xk)sk−1 = (Jk Jk−1) Fk + Fk 1m o( sk−1 ), (15) with 1m as before. The left-hand side of (15) is the product of S(xk), the second-order term of the Hessian, by sk−1. Let Bk ≈ S(xk) be such that Bk is a diagonal matrix approximately satisfying

Bksk−1 ≈ y¯k−1, (16) with T y¯k−1 = (Jk − Jk−1) Fk. (17)

4 The choice of the right-hand side of (16) as the vectory ¯k−1 in (17) was analyzed by (Dennis et al., 1981), and shown to be the better choice among the ones by (Betts, 1976; Dennis, 1973). Next, we provide an auxiliary result that will be important for defining the matrices Ak and Bk in view of the secant conditions (12) and (16), respectively. Lemma 2.1 Let D = diag(d) be a diagonal matrix in Rn×n, and let c and s be vectors in Rn. Then, the solution of the constrained linear least-squares problem with simple bounds 1 min ∥diag(d)s − c∥2 d∈Rn 2 (18) subject to − d ≤ 0 is given by   ci ci i , if i > 0 di = s s i = 1, 2, ..., n. (19)  ci ≤ i 0, if si 0 or s = 0, 1 ∥ − ∥2 Proof Let Φ(d) = 2 diag(d)s c . For d to be a local solution of (18), it must satisfy the following Karush-Kuhn-Tucker (KKT) conditions, in which µ ∈ Rn: (i) ∇Φ(d) − µ = 0; (ii) µ ≥ 0; (iii) d ≥ 0; (iv) µT d = 0. ∑ 1 n i i − i 2 Now, Φ(d) = 2 i=1(d s c ) , and thus, for all i = 1, 2, . . . , n, ∂ Φ(d) = (disi − ci)si, ∂di i.e. ∇Φ(d) = diag(diag(d)s − c)s. From the KKT conditions (i) and (ii), µ = ∇Φ(d), and thus

(disi − ci)si ≥ 0, ∀i, (20) so that, from the KKT condition (iv)

∑n di(disi − ci)si = 0. (21) i=1 Now, by the KKT condition (iii) di ≥ 0, ∀i. (22) From (20), if si = 0, then di is free to assume any nonnegative value; otherwise, if either si > 0 or si < 0, then ci di ≥ . si Therefore, for all i ∈ {1, 2, ..., n} such that si ≠ 0 we have

ci di ≥ . (23) si Consequently, from (22) and (23), it follows that { } ci di ≥ max 0, , ∀i. (24) si

5 Now, considering Equation (21), with the assumption that si ≠ 0, we obtain ∑ di(disi − ci)si = 0. (25) i|si=0̸

From (20) and (22), we notice that the nonnegative terms of the sum (25) will only add up to zero if each term is zero. Thus, because si ≠ 0, either di = 0 or disi − ci = 0. As a result, whenever i ̸ i i ci i s (= 0) and c have the same sign, d = si ; otherwise, d = 0. This reasoning, together with (24), give (19) and complete the proof. According to Lemma 2.1, the resulting diagonal matrix is positive semidefinite. However, to obtain a descent direction that will be used with a suitable line search technique, we will require our diagonal matrices to be positive definite. With this aim, we provide next our safeguarding strategy for assembling the diagonal matrices Ak and Bk, so that the zeros of relation (19) are replaced by strictly positive values.

2.2 Safeguarding the components of the new diagonal matrices In order to retain the positive definiteness of the Hessian approximation, the components of the i i diagonal matrices Ak = diag(ak) and Bk = diag(bk), i.e. for all i = 1, 2, ..., n, ak and bk, respectively must be strictly positive. By applying Lemma 2.1 to the secant conditions (12) and (16), for i = 1, 2, ..., n, expression (19) gives,

 i i  yˆk−1 yˆk−1  si , if si > 0. i k−1 k−1 ak =  yˆi  k−1 ≤ i 0, if i 0 or sk−1 = 0 sk−1 and  i i  y¯k−1 y¯k−1  si , if si > 0. i k−1 k−1 bk =  y¯i  k−1 ≤ i 0, if i 0 or sk−1 = 0. sk−1 i i i If sk−1 = 0, then ak and bk are free to assume any non-negative safeguarding value. So we focus i ̸ i i i i on the case where, for all i = 1, 2, ..., n, sk−1 = 0, andy ˆk−1, sk−1 andy ¯k−1, sk−1 have different signs. We consider the following two possibilities, in which β ∈ (0, 1) is a shrinking factor, and δ > 0 is a tolerance for ensuring strictly positive values.

i Case I: Assume that sk−1 > 0.

• i ≤ T i − T i ≤ T i ≤ T i Ify ˆk−1 0 then (Jk Fk) (Jk Fk−1) 0, and thus (Jk Fk) (Jk Fk−1) . In this case, we redefine { { } }

i T i T i yˆk−1 = β max max (Jk Fk) , (Jk Fk−1) , δ , (26)

i i yˆk−1 so that ak = i > 0. sk−1 • i ≤ T i − T i ≤ T i ≤ T i Ify ¯k−1 0 then (Jk Fk) (Jk−1Fk) 0, so that (Jk Fk) (Jk−1Fk) . In this case, we redefine { { } }

i T i T i y¯k−1 = β max max (Jk Fk) , (Jk−1Fk) , δ , (27)

i i y¯k−1 obtaining bk = i > 0. sk−1

6 i Case II: Assume that sk−1 < 0.

• i ≥ T i − T i ≥ T i ≥ T i Ify ˆk−1 0 then (Jk Fk) (Jk Fk−1) 0, i.e. (Jk Fk) (Jk Fk−1) . In this case, we redefine { { } }

i − T i T i yˆk−1 = β max max (Jk Fk) , (Jk Fk−1) , δ , (28)

i i yˆk−1 so that ak = i > 0. sk−1 • i ≥ T i − T i ≥ T i ≥ T i Ify ¯k−1 0 then (Jk Fk) (Jk−1Fk) 0, and thus (Jk Fk) (Jk−1Fk) . In this case, we redefine { { } }

i − T i T i y¯k−1 = β max max (Jk Fk) , (Jk−1Fk) , δ , (29)

i i y¯k−1 obtaining bk = i > 0. sk−1 3 The model algorithm

In this section we present the structured diagonal Hessian approximation model algorithm. The search direction of the new structured diagonal Hessian approximation can be obtained as follows:

(Ak + Bk)dk = −gk, (30) where { I, if k = 0 Ak + Bk = (31) diag(ak) + diag(bk), if k ≥ 1 is a diagonal matrix, whose components are computed by   yˆi +¯yi k−1 k−1 i ̸ i , if s − = 0 i i s − k 1 ak + bk = k 1 i = 1, 2..., n, (32)  i 1, if sk−1 = 0, where the vectorsy ˆk−1 andy ¯k−1 are defined in (13) and (17), respectively, with some components possibly redefined in (26)–(29), the safeguards of the previous section. Additionally, we safeguard Hk = Ak + Bk for very small and very large values by means of a projection of its components into a given scalar interval [ℓ, u] such that 0 < ℓ ≤ 1 and u ≥ 1. Hence, the i-th component of matrix Hk is  { { } }  yˆi +¯yi k−1 k−1 i ̸ max min i , u , ℓ , if s − = 0 i s − k 1 hk =  k 1 (33) i 1, if sk−1 = 0. Our local structured diagonal Hessian approximation scheme for iteratively solving nonlinear least-squares problems is given by − −1 xk+1 = xk Hk gk, (34) To globalize our algorithm, we have used the efficient non-monotone line search by (Zhang and Hager, 2004). In this line search, if dk is a descent direction for the function f at xk, then the step length α should satisfy the following sufficient decrease condition:

≤ T ∈ f(xk+1) Ck + θαgk dk, θ (0, 1), (35) where ηkQkCk + f(xk+1) C0 = f(x0),Ck+1 = , (36) Qk+1

7 and Q0 = 1, Qk+1 = ηkQk + 1, ηk ∈ [0, 1]. The sequence {Ck} in the above line search scheme is a convex combination of the function values f(x0), f(x1), ..., f(xk). The parameter ηk controls the degree of monotonicity. If ηk = 0 for each k, then the line search is purely monotone (Armijo-type), otherwise it is non-monotone. Based on these ideas, our globally convergent model algorithm is stated in Algorithm ??.

Algorithm 1: Structured Diagonal Hessian Approximation Method (SDHAM) n 1 input : Given x0 ∈ R , β, θ ∈ (0, 1), 0 ≤ ηmin ≤ ηmax ≤ 1, 0 < ℓ ≤ 1 ≤ u, δ, ε > 0, and kmax ∈ N.

2 Set k = 0,Hk = I,Qk = 1. Compute Fk and gk, and set Ck = fk; 3 while ∥gk∥ > ε and k ≤ kmax do − −1 4 Compute dk = Hk gk; 5 Initialize α = 1; T 6 while f(xk + αdk) > Ck + αθgk dk, do 7 α = α/2; end 8 Set αk = α. Compute the next iterate xk+1 = xk + αkdk, and set sk = αkdk, and γk = Fk+1 − Fk; 9 Compute Hk+1 using Equation (33), where

i T i i − T i yˆk = (Jk+1γk) , y¯k = ((Jk+1 Jk) Fk+1) , i = 1, 2, ..., n

i i i andy ˆk, y¯k are safeguarded for possible different signs with sk, using Equations (26)-(29) and δ > 0; 10 Choose ηk ∈ [ηmin, ηmax] and compute Qk+1 and Ck+1 using Equation (36); 11 Set k = k + 1. end

−1 Remark 3.1 Since the matrix Hk is diagonal, Hk is obtained by inverting the diagonal compo- nents of Hk, which are not zero, according to (33) and the safeguarding rule in subsection 2.2. −1 The product at line 4 of Algorithm 1 is simply the product between the elements of Hk and the corresponding components of gk, computed in O(n) operations.

Remark 3.2 To compute the gradient gk, Algorithm 1 requires the product of the Jacobian trans- T pose and the residual function (i.e. gk = Jk Fk). Furthermore, the proposed structured diagonal approximation of the Hessian matrix Hk requires the vectors yˆk−1 and y¯k−1. Both yˆk−1 and y¯k−1 are obtained from the product of the Jacobian transpose and a vector. For structured problems, with a known sparsity pattern, this can be obtained by coding a loop-free subroutine that computes the product directly, without forming nor storing the Jacobian matrix.

4 The evaluation complexity analysis

We now turn to analyze the evaluation complexity of Algorithm 1 (SDHAM). First we state our assumptions, admitted valid for the subsequent analysis:

n n A. The level set L(x0) = {x ∈ R : f(x) ≤ f(x0)} is bounded, where x0 ∈ R is a given initial iterate. B. The objective function f is twice continuously differentiable. C. The gradient ∇f(x) = J(x)T F (x) is uniformly continuous on an open convex set Ω that contains the level set L(x0).

8 Remark 4.1 Under Assumptions A and B, Assumption C implies that there exists M > 0 such that ∥∇2f(x)∥ ≤ M for all x ∈ Ω. Moreover, if ∇f is Lipschitz on Ω, then Assumption C holds, so that Assumption C is weaker than the Lipschitz continuity of ∇f on Ω.

Next we prove that the directions generated by the Algorithm SDHAM satisfy the direction assumption of (Zhang and Hager, 2004). Lemma 4.2 The directions generated by the Algorithm SDHAM satisfy the following relations T ≤ − ∥ ∥2 (a) gk dk c1 gk , and

(b) ∥dk∥ ≤ c2∥gk∥, for all k, where c1, c2 > 0. −1 Proof (a) From the definition of dk, and the boundedness of Hk , T − T −1 gk dk = gk Hk gk ∑n − i 2 i = (gk) /hk i=1 1 ∑n ≤ − (gi )2 u k i=1 2 = −c1∥gk∥ , where 0 < c1 = 1/u.

≤ i ≤ (b) It follows from the boundedness of ℓ hk u, for all k, and all i, and from the definition of dk, that

2 T −2 ∥dk∥ = g H gk k (k ) ∑n gi 2 = k hi i=1 k 1 ∑n ≤ (gi )2, ℓ2 k i=1 therefore ∥dk∥ ≤ c2∥gk∥, with 0 < c2 = 1/ℓ, concluding the proof. The well definiteness of Algorithm SDHAM is proved next. Proposition 4.3 Under Assumption C, the Algorithm SDHAM is well defined.

n Proof Let xk ∈ R be an iterate generated by the Algorithm SDHAM, and assume that ∥gk∥= ̸ 0. − −1 T ≤ − ∥ ∥2 From Lemma 4.2 (a), the direction dk = Hk gk satisfies gk dk c1 gk < 0, being thus of descent for f from xk. Therefore, from Lemma 4.2 (b) and Assumption C, there exists a small enough step length αˆ > 0 such that ≤ T f(xk +αd ˆ k) Ck + θαgˆ k dk is verified and the next iterate xk+1 = xk +αd ˆ k is well defined. For proving the existence ofα ˆ, assume, for the sake of a contradiction, that

T f(xk + αjdk) > Ck + θαjgk dk, (37) for all {αj}, αj ≥ 0, αj strictly decreasing, and such that limj→∞ αj = 0. Notice that, for all k, we 1 ∥ ∥2 ≥ have f(xk) = 2 Fk 0, and Ck is a convex combination of Ck−1 and f(xk). Since C0 = f(x0), it follows that Ck is also non-negative for all k.

9 Now, in view of Lemma 4.2 (b), relation (37) gives

2 f(xk + αjdk) > Ck − θαjc2∥gk∥ . (38)

2 Taking the limit as j → ∞ in (38), as θαjc2∥gk∥ ≠ 0, we reach

f(xk) ≥ Ck. (39)

But ηk−1Qk−1Ck−1 + f(xk) Ck = , ηk−1Qk−1 + 1 i.e., Ck is between Ck−1 and f(xk). Due to (39), it follows that Ck = f(xk). Consequently, ηk−1Qk−1Ck−1 = 0. As Ck−1 ≠ 0 and Qk−1 ≠ 0, then ηk−1 = 0, so the non-monotone line search turns into monotone. But, in such a case, (37) becomes

T f(xk + αjdk) > f(xk) + θαjgk dk, (40) − ⇒ f(xk + αjdk) f(xk) T = > θgk dk. (41) αj → ∞ T ≥ T T Taking (41) to the limit as j , Assumption C gives gk dk θgk dk, and since gk dk < 0, we have θ ≥ 1, a contradiction, concluding the proof.

The following result is obtained from Lemma 1.1 of (Zhang and Hager, 2004), being auxiliary for the analysis that comes in the sequence.

Lemma 4.4 The iterates generated by the Algorithm SDHAM satisfy fk ≤ Ck ≤ ζk, for all k ≥ 0, where ζk is given by 1 ∑k ζ = f . (42) k k + 1 i i=0 Next, another auxiliary result for establishing the complexity analysis of the Algorithm SDHAM is presented.

Lemma 4.5 The sequence of iterates {xk} generated by the Algorithm SDHAM is contained in the level set L(x0).

Proof From the mechanism of Algorithm SDHAM, and Lemma 4.2 (a), we have

≤ T f(xk+1) Ck + αkθgk dk 2 ≤ Ck − αkθc1∥gk∥ . (43)

Combining the updating relations (36) and (43),

ηkQkCk + fk+1 Ck+1 = Qk+1 η Q C + C − α θc ∥g ∥2 ≤ k k k k k 1 k , Qk+1 and since Qk+1 = ηkQk + 1, we obtain

2 αkθc1∥gk∥ Ck+1 + ≤ Ck, ∀k. (44) Qk+1

10 Notice that, by Lemma 4.4, fk ≤ Ck. Therefore, from (44) we have

fk+1 ≤ Ck+1 ≤ Ck ≤ Ck−1 ≤ · · · ≤ C0 = f0, so that {xk} ⊂ L(x0), and the proof is complete. The worst-case evaluation complexity result of the Algorithm SDHAM is established below, resting upon the hypothesis that the gradient of f is Lipschitz continuous.

Theorem 4.6 Let f be the objective function of problem (7), and suppose that its gradient is Lipschitz continuous in Rn, with constant L > 0. Assume that the Algorithm SDHAM stops with ∥gk∥ ≤ ε. Then, the Algorithm SDHAM needs at most ⌊ ⌋ f(x ) ϱ 0 ε2 function evaluations, where ( { })−1 (1 − θ)c1 ϱ = θc1 min 1, 2 , (45) Lc2 and c1, c2 are the constants of Lemma 4.2. Proof Under the Lipschitz continuity hypothesis, due to the Taylor’s expansion ∫ 1 f(x + d) = f(x) + ∇f(x + ξd)T d dξ, 0 it follows that L f(x + d) ≤ f(x) + ∇f(x)T d + ∥d∥2, (46) 2 for all x, d ∈ Rn. Now, let us take α > 0 such that − ≤ 2(1 θ)c1 α 2 := ν, (47) Lc2 from which ( ) ( ) 2(1 − θ) −gT d 2(1 − θ) −gT d α ≤ k k ≤ k k , 2 ∥ ∥2 ∥ ∥2 Lc2 gk L dk where the first inequality comes from Lemma 4.2 (a), and the second, from Lemma 4.2 (b). There- fore, 2(θ − 1) gT d ≤ k k α 2 , L ∥dk∥ and hence L gT d + α∥d ∥2 ≤ θgT d . k k 2 k k k

Multiplying the inequality above by α, adding f(xk) in both sides, and noticing that, from Lemma 4.4, f(xk) ≤ Ck, we obtain L f(x ) + αgT d + α2∥d ∥2 ≤ C + αθgT d . k k k 2 k k k k Applying relation (46), we have

≤ T f(xk + αdk) Ck + αθgk dk. (48)

11 Since there exists α > 0, upper bounded by ν given in (47), that satisfies the non-monotone decrease condition (48), then the backtracking mechanism of the Algorithm SDHAM guarantees that { } { } − ≥ ν (1 θ)c1 αk min 1, = min 1, 2 . (49) 2 Lc2 As a result, the amount of necessary backtracks to achieve (48) is bounded by a constant that depends only on θ, c1, c2 and L. Moreover, whenever (48) is satisfied, we obtain ≤ − ∥ ∥2 f(xk+1) Ck θαkc1 g{k } − ≤ − (1 θ)c1 ∥ ∥2 Ck θc1 min 1, 2 gk , Lc2 where the first inequality comes from Lemma 4.2 (a), and the second one, from (49). Hence, if ∥gk∥ > ε, then { } − ≤ − (1 θ)c1 2 f(xk+1) Ck θc1 min 1, 2 ε . (50) Lc2

From Lemma 4.5, Ck ≤ f(x0) for all k, and thus, the total amount of function evaluations such that ∥gk∥ > ε cannot exceed      f(x ) − f   {0 min }  , − (1 θ)c1 2 θc1 min 1, 2 ε Lc2 in which fmin = 0 is a lower bound for the objective function f of problem (7), that may be attained or not, depending on whether the problem has zero or non-zero residue. Consequently, if the Algorithm SDHAM stops with ∥gk∥ ≤ ε, then it has performed, at most, ⌊ ⌋ f(x ) ϱ 0 , ε2 function evaluations, where the constant ϱ satisfies (45), concluding the proof.

Remark 4.7 We stress that Theorem 4.6 also encompasses the well definiteness of the Algorithm SDHAM. Proposition 4.3, however, demands a slightly weaker hypothesis, cf. the reasoning of Remark 4.1. Moreover, alternatively to Theorem 4.6, the global convergence of Algorithm SDHAM could have been obtained along the lines of Theorem 2.2 of (Zhang and Hager, 2004).

5 Numerical experiments

In this section we present the numerical experiments to assess the effectiveness of the Algorithm SDHAM for solving structures NLS problems. SDHAM was compared with the methods proposed by (Kobayashi et al., 2010) and (Han et al., 2008). The former is a conjugate gradient method for nonlinear least-squares that exploits the structure of the Hessian of the objective function, denoted by CGSQN, that stands for conjugate gradient structured quasi-Newton. The latter is a multivariate spectral gradient method, referred to as MSGM, devised for unconstrained minimization, so that the special structure of the Hessian of the least-squares objective function is ignored. Nevertheless, the search directions are computed using a diagonal Hessian approximation similar to the one we have proposed within the Algorithm SDHAM, as detailed in (6). SDHAM is implemented using the non-monotone line search by (Zhang and Hager, 2004) with quadratic-cubic backtracking strategy for computing the step size (Nocedal and Wright, 2006). The −30 30 −4 parameters are chosen as ηmin = 0.1, ηmax = 0.85, β = 0.1, ℓ = 10 , u = 10 , and δ = 10 . We

12 have slightly alleviated on the choice of (Santos and Silva, 2014) by defining the sequence {ηk} of the Algorithm SDHAM as follows:

(−( k )2) ηk = 0.75e 45 + 0.1. (51)

It is clear from (51) that 0.1 ≤ ηk ≤ 0.85. The CGSQN method was implemented using the strong

T f(xk + αkdk) ≤ f(xk) + θαkg dk, k (52) | T | ≤ − T g(xk + αkdk) dk σgk dk for computing the step size αk, with θ = 0.0001 and σ = 0.1. As recommended by the authors in (Kobayashi et al., 2010), we used the choices ρ = 1, t = 1 for the constants that appear in the CG algorithm. Furthermore, at each iteration (k ≥ 1), the initial step length was chosen as

T 0 gk−1dk−1 − αk = αk 1 T , α0 = 1. gk dk The MSGM was implemented using the Grippo, Lampariello and Lucidi non-monotone line search (Grippo et al., 1986) for computing the step length λk, with θ = 0.0001 and M = 10. If the computed step size is not accepted, then the next trial step length is taken as α = 0.5α. The safeguarding parameter that keeps the sequence of the diagonal matrix entries (6) of the MSGM method uniformly bounded was chosen in the following way:  1 if ∥gk∥ > 1, ∥ ∥−1 −5 ≤ ∥ ∥ ≤ δk =  gk if 10 gk 1, (53)  5 −5 10 if ∥gk∥ < 10 .

All algorithms were coded in MATLAB R2017a, and run on a DELL PC with Intel Corei3 processor, 2.30GHz CPU speed, 4GB of RAM and Windows 7 operating system. We use the same −4 stopping criterion for all the tested algorithms, namely ∥gk∥ ≤ 10 = ε. We addressed 28 large-scale problems of dimensions n = m ∈ {1000, 5000, 10000}, except for Function 21, for which we have used n = m ∈ {1500, 7500, 15000}, together with 12 small-scale problems, giving us a total of 96 instances. In Table 1, we listed the test function names, including their references, the initial point used in the numerical experiments, and the residual size. Further details of the test problems can be found in the corresponding references. The numerical results are shown in Tables 2, 3, 4 and 5, where we report the number of iterations (ITER), number of function evaluations (FEVAL), number of matrix-vector products (MVP), CPU 1 ∥ ∗ ∥2 time in seconds (TIME) and the value of the squared residues at the stopping point 2 F (x ) (VALf), for each of the methods tested. Failures are reported and denoted by F if the number of iterations is greater than 10000, the number of function evaluation is greater than 50000, or the norm of the gradient is not a number. We see from Tables 2–5, that out of the 96 instances, SDHAM failed for 3, whereas its closest competitor MSGM failed for 14 problems, and CGSQN failed for 33 problems. It is worth mention- ing that the method we have proposed failed to solve 3 problems for which the other two methods also failed. We have used the performance profiles of (Dolan and Mor´e, 2002) to plot graphs in log2 scale according to the data from Tables 2–5, considering the 93 problems obtained after eliminating those 3 problems that could not be solved for any of the tested methods. It is clear from Figure 1 that SDHAM successfully solves and wins for 60% of the problems with the fewest number of iterations, it also tops the graph for the fraction of τ > 2. MSGM solves and wins for 50% of the problems, and lastly, CGSQN solves and wins for about 15% of the problems. In terms of the number of function evaluations, SDHAM outperforms MSGM and CGSQN. This

13 can be easily observed from Figure 2. Based on this, we claim that in most of the experiments, SDHAM requires the least number of backtracks for the step length computation, implying that the computed direction is mostly acceptable, as further discussed below. Figure 3, on the other hand, shows that SDHAM requires more matrix-vector products (MVP) evaluations than MSGM. This is not surprising because SDHAM is a structured algorithm, whereas MSGM is not. In contrast with the other competitors, it can be observed from Figure 4 that the SDHAM algorithm requires less CPU time to reach the approximate solution. In addition, we have assessed the effectiveness of the SDHAM direction using the comparative FEVAL box-plots of the distribution of the ratios ITER . After eliminating all the failures for each of the methods, we have prepared the corresponding box-plots for each method (MSGM, 82 problems; CGSQN, 63 problems, and SDHAM, 93 problems). Figure 5 shows that SDHAM solved 75% of the 93 problems with less than 1.5 function evaluations per iteration, whereas MSGM solved 75% of the 82 problems with less than 3.0 functions evaluations per iteration, and CGSQN solved 75% of the 63 problems with no more than 3.8 function evaluation per iteration. These values imply that, in these experiments, SDHAM requires the least number of backtracks for the step length computation, compared to its counterparts.

14 Table 1: List of test functions with reference, starting point and residual size

S/N FUNCTION NAME STARTING POINT RESIDUAL SIZE LARGE-SCALE PROBLEMS 1 Penalty Function I (La Cruz et al., 2004) (1/3, 1/3,..., 1/3) zero 2 Variably Dimensioned (Mor´eet al., 1981) (1 − 1/n, 1 − 2/n, . . . , 0) zero 3 Trigonometric Function(Mor´eet al., 1981) (1/n, . . . , 1/n) zero 4 Discrete Boundary Value (Mor´eet al., 1981) (1/(n + 1)(1/(n + 1) − 1,...) zero 5 Full Rank (Mor´eet al., 1981) (1, 1,..., 1) small 6 Linear Rank 1 Function (Mor´eet al., 1981) (1, 1,..., 1) large 7 Problem 202 (Lukˇsanand Vlˇcek, 2003) (2, 2,..., 2) zero 8 Problem 206∗ (Lukˇsanand Vlˇcek, 2003) (1/n, . . . , 1/n) zero 9 Problem 212 (Lukˇsanand Vlˇcek, 2003) (0.5,..., 0.5) zero 10 AR Boundary Value Problem* (Lukˇsanand Vlˇcek, 2003) (1/n, . . . , 1/n) zero 11 Strictly I (Raydan, 1997) (1/n, 2/n, . . . , 1) large 12 Strictly Convex Function II (Raydan, 1997) (1, 1,..., 1) large 13 Brown Almost Linear (Mor´eet al., 1981) (0.5, 0.5,..., 0.5) zero 14 1 (La Cruz et al., 2004) (n/n − 1, n/n − 1, . . . , n/n − 1) zero 15 Exponential Function 2 (La Cruz et al., 2004) (1/n2, 1/n2,..., 1/n2) zero 16 Singular Function (La Cruz et al., 2004) (1, 1,..., 1) zero 17 Logarithmic Function (La Cruz et al., 2004) (1, 1,..., 1) zero 18 Trigonometric Exponential System* (Lukˇsanand Vlˇcek, 2003) (4, 4,..., 4) zero 19 Extended Freudenstein & Roth (La Cruz et al., 2004) (6, 3, 6, 3,..., 6, 3) zero 20 Extended Powell Singular (La Cruz et al., 2004) (1.5E − 4,..., 1.5E − 4) zero 21 Function 21∗ (La Cruz et al., 2004) (−1, −1,..., −1) zero 22 Singular Broyden (Lukˇsanand Vlˇcek, 2003) (−1, −1,..., −1) zero 23 Broyden Tridiagonal Function (Mor´eet al., 1981) (−1, −1,..., −1) zero 24 Generalized Broyden Tridiagonal (Lukˇsanand Vlˇcek, 2003) (−1, −1,..., −1) zero 25 Extended Rosenbrock (Mor´eet al., 1981) (−1, 1, −1, 1,..., −1, 1) zero 26 Extended Himmelblau (Jamil and Yang, 2013) (1, 1/n, 1, 1/n, . . . , 1, 1/n) zero 27 Function 27 (La Cruz et al., 2004) (100, 1/n2, 1/n2,..., 1/n2) zero 28 Trigonometric Logarithmic Function∗∗ (1, 1,..., 1) zero SMALL-SCALE PROBLEMS 29 Beale Function (Mor´eet al., 1981) (1, 1) zero 30 Bard Function* (Mor´eet al., 1981) (−1000, −1000, −1000) large 31 Brown Badly Scaled (Mor´eet al., 1981) (1, 1) zero 32 Branin RCOS Function (Jamil and Yang, 2013) (−5, 0) small 33 Jennrich and Sampson* (Mor´eet al., 1981) (0.2, 0.2) large 34 Box 3D Function* (Mor´eet al., 1981) (0, 0, 1) zero 35 Rank Deficient Jacobian(Gon¸calves and Santos, 2016) (−1, 1) small 36 Rosenbrock Function* (Mor´eet al., 1981) (−1, 1) zero 37 Parameterized Problem (Huschens, 1994) (10, 10) small 38 Freudenstein & Roth Function (Mor´eet al., 1981) (0.5, −2) large 39 Linear Rank 1 with zero cols. & rows (Mor´eet al., 1981) (1, 1,..., 1) small 40 Linear Rank 2 (La Cruz et al., 2004) (1, 1/n, 1/n, . . . , 1/n) zero ∗The initial point is different from the source. ∗∗See Appendix for the details.

15 Table 2: Numerical Results for MSGM, CGSQN and SDHAM on large-scale Problems 1 to 9, F means failure.

ITER FEVAL MVP TIME VALf PROB m = n MSGM CGSQN SDHAM MSGM CGSQN SDHAM MSGM CGSQN SDHAM MSGM CGSQN SDHAM MSGM CGSQN SDHAM 1000 6 F 5 8 F 6 6 F 16 0.04937 F 0.021117 2.56E-06 F 1.23E-05 1 5000 5 F 15 6 F 16 6 F 46 0.466885 F 0.448912 4.84E-05 F 7.92E-05 10000 5 F 23 6 F 24 6 F 70 1.270691 F 0.784702 2.60E-06 F 1.42E-04 1000 1 F 3 2 F 5 2 F 10 0.119752 F 0.02357 2.23E-22 F 5.17E-19 2 5000 F F 4 F F 6 F F 13 F F 0.162431 F F 3.37E-25 10000 2 F 4 26 F 6 3 F 13 0.310928 F 0.245139 3.81E-23 F 5.83E-25 1000 5 F 8 18 F 41 7 F 25 0.027587 F 0.163666 1.29E-20 F 6.27E-19 3 5000 F F F F F F F F F F F F F F F 10000 F F F F F F F F F F F F F F F 1000 20 22 28 48 36 34 21 89 85 0.151333 0.193595 0.256873 3.49E-08 1.74E-08 3.70E-08 4 5000 6 9 12 20 18 15 7 37 37 0.248217 0.3218 0.504632 5.63E-09 2.82E-09 3.40E-09 10000 3 8 6 18 17 9 4 33 19 0.370197 0.422372 0.461165 2.02E-09 1.26E-09 1.00E-09 1000 2 F 2 3 F 3 3 F 7 0.17259 F 0.014425 0.502 F 0.502 5 5000 2 F 2 3 F 3 3 F 7 0.06091 F 0.064377 0.5004 F 0.5004 10000 2 F 2 3 F 3 3 F 7 0.03268 F 0.261286 0.5002 F 0.5002 1000 F F 3 F F 5 F F 10 F F 0.034039 F F 124.8126 6 5000 F F 3 F F 5 F F 10 F F 0.072743 F F 624.8125 10000 F F 3 F F 5 F F 10 F F 0.125148 F F 1.25E+03 16 1000 7 11 5 9 46 6 8 45 16 0.029173 0.044294 0.018465 5.01E-15 4.07E-09 1.39E-10 7 5000 7 14 5 9 53 6 8 57 16 0.060253 0.176751 0.307302 2.50E-14 2.21E-11 6.94E-10 10000 7 14 5 9 53 6 8 57 16 0.283629 0.5305 0.250679 5.01E-14 4.35E-11 1.39E-09 1000 32 20 38 60 33 45 33 81 115 0.209246 0.071715 0.20781 3.66E-08 2.17E-08 3.03E-08 8 5000 6 6 6 22 14 8 7 25 19 0.207378 0.077327 0.173812 4.42E-09 2.97E-09 2.83E-09 10000 3 5 3 19 12 5 4 21 10 0.262789 0.167809 0.185471 2.50E-09 1.07E-09 7.16E-10 1000 7 15 7 9 45 8 8 61 22 0.066021 0.05367 0.167718 1.37E-10 7.48E-10 3.32E-12 9 5000 7 15 7 9 48 8 8 61 22 0.199102 0.200552 0.239117 1.37E-10 7.61E-10 3.32E-12 10000 7 13 7 9 47 8 8 53 22 0.24259 0.345388 0.230193 1.37E-10 2.96E-09 3.32E-12 Table 3: Numerical Results for MSGM, CGSQN and SDHAM on large-scale Problems 10 to 18, F means failure.

ITER FEVAL MVP TIME VALf PROB m = n MSGM CGSQN SDHAM MSGM CGSQN SDHAM MSGM CGSQN SDHAM MSGM CGSQN SDHAM MSGM CGSQN SDHAM 1000 39 20 51 82 33 67 40 81 154 0.109706 0.06443 0.252709 3.32E-08 2.07E-08 2.80E-08 10 5000 8 6 11 24 14 17 9 25 34 0.197314 0.087121 0.271173 4.70E-09 2.96E-09 3.75E-09 10000 2 5 2 18 12 5 3 21 7 0.139015 0.216182 0.14838 2.78E-09 1.07E-09 3.24E-09 1000 1 8 4 2 32 5 2 33 13 0.156459 0.044034 0.059165 0.5 0.5 0.5 11 5000 1 8 4 2 32 5 2 33 13 0.062429 0.042786 0.161094 0.5 0.5 0.5 10000 1 8 4 2 32 5 2 33 13 0.199935 0.073398 0.137328 0.5 0.5 0.5 1000 9 F 8 10 F 11 10 F 25 0.021313 F 0.042862 1.67E+06 F 1.67E+06 12 5000 9 F 9 10 F 12 10 F 28 0.129961 F 0.370585 2.08E+08 F 2.08E+08 10000 9 F 12 10 F 15 10 F 37 0.271169 F 0.374726 1.67E+09 F 1.67E+09 1000 3 F 4 5 F 5 4 F 13 0.007375 F 0.031734 1.24E-07 F 1.24E-07 13 5000 4 F 4 7 F 6 5 F 13 0.066471 F 0.144092 5.00E-09 F 5.00E-09 10000 F F F F F F F F F F F F F F F 1000 4 14 5 13 26 6 5 57 16 0.060312 0.039216 0.045708 7.37E-08 4.53E-08 4.53E-08 14 5000 2 8 4 13 17 5 3 33 13 0.056524 0.155331 0.140486 1.10E-07 3.82E-08 3.62E-08 10000 2 6 3 14 14 4 3 25 10 0.167216 0.181063 0.123839 5.49E-08 2.99E-08 7.24E-08 1000 2 18 2 22 43 4 3 73 7 0.092363 0.079672 0.022668 1.13E-11 1.01E-10 1.13E-11 15 5000 2 30 2 27 62 4 3 121 7 0.167989 0.412895 0.074947 4.53E-13 7.72E-12 4.53E-13 10000 2 35 2 29 70 4 3 141 7 0.292698 0.675369 0.174287 1.13E-13 2.69E-12 1.13E-13 17 1000 31 109 29 35 170 31 32 437 88 0.218564 0.541997 0.347942 3.23E-06 1.69E-06 1.70E-06 16 5000 38 61 33 41 139 35 39 245 100 0.835618 1.122087 0.64672 1.52E-06 1.78E-06 2.20E-06 10000 43 41 35 48 121 37 44 165 106 0.950093 1.459593 1.224682 9.68E-07 4.66E-06 1.75E-06 1000 1 16 6 2 41 7 2 65 19 0.685516 0.060063 0.038491 0 3.89E-09 5.18E-12 17 5000 1 16 6 2 49 7 2 65 19 0.027315 0.34729 0.323921 0 3.29E-10 2.54E-11 10000 1 12 6 2 43 7 2 49 19 0.181986 0.310444 0.213843 0 1.44E-10 5.06E-11 1000 26 F 17 36 F 19 27 F 52 0.280566 F 0.327048 1.94E-11 F 1.37E-12 18 5000 28 F 17 48 F 19 29 F 52 0.666795 F 0.740089 9.39E-12 F 3.85E-13 10000 28 F 16 34 F 18 29 F 49 0.966948 F 0.991824 2.80E-11 F 2.65E-11 Table 4: Numerical Results for MSGM, CGSQN and SDHAM on large-scale Problems 19 to 28, F means failure.

ITER FEVAL MVP TIME VALf PROB m = n MSGM CGSQN SDHAM MSGM CGSQN SDHAM MSGM CGSQN SDHAM MSGM CGSQN SDHAM MSGM CGSQN SDHAM 1000 73 F 31 468 F 35 74 F 94 0.485248 F 0.172042 4.18E-10 F 3.25E-13 19 5000 F F 31 F F 35 F F 94 F F 0.605424 F F 1.62E-12 10000 F F 31 F F 35 F F 94 F F 0.757161 F F 3.25E-12 1000 2 5 1 15 25 2 3 21 4 0.257663 0.030449 0.009738 1.21E-12 3.28E-12 1.21E-12 20 5000 2 5 1 15 25 3 3 21 4 0.171162 0.077916 0.131582 6.03E-12 1.64E-11 6.04E-12 10000 2 5 1 15 25 3 3 21 4 0.294376 0.285255 0.110694 1.21E-11 3.28E-11 1.21E-11 1500 71 38 88 101 86 102 72 153 265 0.501871 0.254971 0.852861 1.79E-10 1.01E-10 5.66E-10 21 7500 72 39 103 102 88 118 73 157 310 1.116035 0.869544 2.26944 7.08E-10 1.08E-10 6.42E-10 15000 75 40 106 107 91 124 76 161 319 1.696645 1.437185 3.708238 2.49E-10 4.84E-11 4.78E-10 1000 20 136 36 27 222 38 21 545 109 0.163765 0.522557 0.174102 3.23E-08 1.75E-08 9.70E-08 22 5000 22 484 37 30 578 39 23 1937 112 0.455263 3.565122 1.054517 5.15E-08 9.74E-09 1.32E-07 10000 21 980 37 28 1073 39 22 3921 112 0.503334 10.03143 1.139421 2.09E-07 2.50E-08 2.68E-07 1000 19 110 13 32 182 15 20 441 40 0.045901 0.314076 0.045379 3.08E-12 4.18E-13 6.56E-12 23 5000 17 45 13 31 119 15 18 181 40 0.2226 0.452803 0.296201 8.06E-12 2.14E-12 6.56E-12 18 10000 16 76 13 28 153 15 17 305 50 0.366592 1.195828 0.343847 7.25E-12 6.37E-12 6.56E-12 1000 13 141 13 20 215 15 14 565 40 0.028782 0.46639 0.046192 6.52E-12 5.19E-13 1.56E-13 24 5000 13 43 13 20 121 15 14 173 40 0.184298 0.479552 0.230443 6.79E-12 3.57E-12 1.56E-13 10000 13 209 13 20 291 15 14 837 40 0.313876 2.48811 0.357831 6.82E-12 1.02E-11 1.56E-13 1000 F 1 1 F 2 2 F 5 4 F 0.015252 0.011977 F 0 0 25 5000 F 1 1 F 2 2 F 5 4 F 0.010777 0.048624 F 0 0 10000 F 1 1 F 2 2 F 5 4 F 0.232439 0.059754 F 0 0 1000 28 97 28 49 148 33 29 389 85 0.188102 0.218337 0.177441 7.98E-11 2.99E-10 2.78E-10 26 5000 29 37 30 45 88 35 30 149 91 0.592021 0.387213 0.440403 3.01E-11 1.33E-10 3.70E-12 10000 30 36 32 47 89 36 32 145 97 0.860893 0.594024 0.74386 2.02E-10 4.30E-11 6.27E-11 1000 30 13 25 31 75 27 31 53 76 0.344238 0.053442 0.082136 2.30E-07 2.69E-08 1.45E-07 27 5000 30 20 25 31 80 27 31 89 76 0.365395 0.282087 0.356092 2.30E-07 2.56E-07 1.45E-07 10000 30 19 25 31 81 27 31 77 76 0.455881 0.422523 0.497633 2.30E-07 8.29E-10 1.45E-07 1000 1 12 6 2 45 7 2 49 19 0.046555 0.222659 0.030187 0 1.04E-09 2.73E-12 28 5000 1 10 6 2 45 7 2 41 19 0.025227 0.325564 0.174144 0 4.43E-10 1.33E-11 10000 1 8 6 2 39 7 2 33 19 0.055576 0.699232 0.207987 0 3.36E-14 2.65E-11 Table 5: Numerical Results for MSGM, CGSQN and SDHAM on small-scale Problems 29 to 40, F means failure.

ITER FEVAL MVP TIME VALf PROB n, m MSGM CGSQN SDHAM MSGM CGSQN SDHAM MSGM CGSQN SDHAM MSGM CGSQN SDHAM MSGM CGSQN SDHAM 29 2,3 174 F 65 400 F 86 175 F 196 0.422211 F 0.479203 2.74E-08 F 9.26E-09 30 3,15 2 13 1 3 74 3 3 53 4 0.130271 0.20183 0.033895 8.72E+00 8.7174 8.72E+00 31 2,3 F F 10 F F 12 F F 31 F F 0.188574 F F 1.05E-27 32 2,2 21 F 10 35 F 12 22 F 31 0.185068 F 0.053585 1.99E-01 F 1.99E-01 33 2,20 5 15 10 11 77 11 6 61 25 0.016485 0.054146 0.054585 62.1811 62.1811 62.1811 19 34 3,10 27 22 34 40 55 43 28 89 103 0.303147 0.075935 0.191894 9.12E-10 1.84E-09 2.45E-09 35 2,3 7 14 7 8 48 9 8 57 22 0.14489 0.047093 0.021486 1.29E+00 1.29E+00 1.29E+00 36 2,2 F 1 1 F 2 2 F 5 4 F 0.002377 0.015603 F 0.00E+00 0.00E+00 37 2,3 8 F 6 24 F 8 9 F 19 6.91E-02 F 0.017107 5.00E-01 F 5.00E-01 38 2,2 693 692 64 2084 747 81 694 2769 193 0.549462 0.521478 0.062439 2.45E+01 2.45E+01 2.45E+01 39 10,10 2 F 1 3 F 3 3 F 4 8.66E-02 F 0.050119 1.8235 F 1.8235 40 10,10 10 F 9 37 F 11 11 F 28 0.122116 F 0.048189 1.96E-10 F 3.44E-20 1

0.9

0.8

0.7

0.6 ) 0.5 P(

0.4

0.3

MSGM 0.2 CGSQN SDHAM

0.1

0 0 1 2 3 4 5 6

Figure 1: Performance profiles (log2 scaled) with respect to the number of iterations (ITER) for problems 1-40

1

0.9

0.8

0.7

0.6 ) 0.5 P(

0.4

0.3 MSGM 0.2 CGSQN SDHAM 0.1

0 0 1 2 3 4 5

Figure 2: Performance profiles (log2 scaled) with respect to the number of function evaluations (FEVAL) for problems 1-40

1

0.9

0.8

0.7

0.6 ) 0.5 P(

0.4

0.3

0.2 MSGM CGSQN 0.1 SDHAM

0 0 1 2 3 4 5 6 7 8

Figure 3: Performance profiles (log2 scaled) with respect to the number of matrix-vector products (MVP) for problems 1-40

20 1

0.9

0.8

0.7

0.6 ) 0.5 P(

0.4

0.3 MSGM 0.2 CGSQN SDHAM 0.1

0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Figure 4: Performance profiles (log2 scaled) with respect to the CPU time (TIME) for problems 1-40

Figure 5: Box-plots of the distributions of the ratios FEVAL/ITER for the problems actually solved by MSGM, CGSQN and SDHAM. The corresponding medians are 1.7, 2.4 and 1.2, respectively.

21 6 Final remarks

We have proposed a structured diagonal Hessian for solving nonlinear least-squares problems (SDHAM). The presented algorithm is a Jacobian-free strategy, requiring neither to form nor to store the Jacobian matrix. Instead, it requires only a loop-free subroutine for computing the product of the Jacobian transpose by a vector. This is an advantage, especially for very large-scale and structures problems. In addition, our proposed method is suitable for both zero and nonzero residual problems To the best of our knowledge, this is the first time an attempt is made to diagonally approximate the complete Hessian of the objective function of (7), taking into account its particular structure. Although the approach is not new for general unconstrained optimization problems, we have ad- dressed a special case of the unconstrained optimization problem, that is minimizing sums of squares of nonlinear functions, taking advantage of the intrinsic structure of such a problem. In the Algorithm SDHAM, the direction is always minus the product of a carefully devised positive definite diagonal approximation of the inverse of the Hessian matrix with the gradient. Consequently, any suitable line search strategy can be used for the global convergence of the algo- rithm. We have chosen Zhang and Hager’s non-monotone line search. We have proved that, for obtaining an ε-accurate , the Algorithm SDHAM requires O(ε−2) function evalua- tions. Our preliminary numerical results show that SDHAM efficiently solved more than 90% of the tested problems within the fewest number of iterations and number of function evaluations. The results also reveal that SDHAM is slightly faster than MSGM and CGSQN, matrix-free algorithms with similar requirements to SDHAM. In future research, we intend to investigate further techniques for obtaining good approximations of the Hessian matrix, taking into account its special structure, and to develop the local convergence analysis of the Algorithm SDHAM. Because of the low memory requirements of SDHAM, we hope it will perform well when applied to solve practical data fitting and imaging problems. Indeed, this is also another subject for future research. Furthermore, the Algorithm SDHAM can be easily modified to address general large-scale nonlinear systems of equations directly, without the least- squares framework.

Appendix

Problem 28 in Table 1 is a modification of the Logarithmic Function (La Cruz et al., 2004), the 17th problem of Table 1, being defined as follows.

28. Trigonometric Logarithmic Function: sin(x ) F (x) = ln(x + 1) − i , for i = 1, 2, ..., n. i i n

Acknowledgements .1 This research is supported by the Academic Staff Training and Develop- ment (AST&D), Tertiary Education Trust Fund (TETFund), Nigeria (ACT 2011), and by the Brazilian agencies: FAPESP (Funda¸c˜aode Amparo `aPesquisa do Estado de S˜aoPoulo) grants 2013/05475-7 and 2013/07375-0), and CNPq (Conselho National de Desenvolvimento Cient´ıfico e Tecnol´ogico) grant 302915/2016-8. We thank Dr. Douglas S. Gon¸calves,affiliated at the Federal University of Santa Catarina, Brazil, for providing us with useful suggestions at the preliminary stage of this work.

References

M. Al-Baali. Quasi-Newton algorithms for large-scale nonlinear least-squares. In High Performance Algorithms and Software for Nonlinear Optimization, pages 1–21. Springer, 2003.

22 M. Al-Baali and R. Fletcher. Variational methods for non-linear least-squares. Journal of the Operational Research Society, 36(5):405–421, 1985. M. C. Bartholomew-Biggs. The estimation of the Hessian matrix in nonlinear least squares problems with non-zero residuals. Mathematical Programming, 12(1):67–80, 1977. J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8(1):141–148, 1988. Stefania Bellavia, Coralia Cartis, Nicholas I. M. Gould, Benedetta Morini, and Philippe L. Toint. Convergence of a regularized euclidean residual algorithm for nonlinear least-squares. SIAM Journal on Numerical Analysis, 48(1):1–29, 2010. J. T. Betts. Solving the nonlinear least square problem: Application of a general method. Journal of Optimization Theory and Applications, 18(4):469–483, 1976. E. G. Birgin, J. L. Gardenghi, J. M. Mart´ınez,S. A. Santos, and Ph. L. Toint. Worst-case evalu- ation complexity for unconstrained nonlinear optimization using high-order regularized models. Mathematical Programming, 163(1-2):359–368, 2017. A. Bj¨orck. Numerical methods for least squares problems. SIAM, Philadelphia, 1996. K. M Brown and J. E. Dennis. A new algorithm for nonlinear least-squares curve fitting. Technical report, Cornell University, New York, 1970. Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. On the evaluation complexity of cubic regularization methods for potentially rank-deficient nonlinear least-squares problems and its relevance to constrained nonlinear optimization. SIAM Journal on Optimization, 23(3):1553– 1574, 2013. Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. On the evaluation complexity of constrained nonlinear least-squares and general constrained nonlinear optimization using second- order methods. SIAM Journal on Numerical Analysis, 53(2):836–851, 2015. A. Cornelio. Regularized nonlinear least squares methods for hit position reconstruction in small gamma cameras. Applied Mathematics and Computation, 217(12):5589–5595, 2011. S. Deng and Z. Wan. A diagonal quasi-Newton spectral conjugate gradient algorithm for nonconvex unconstrained optimization problems. In Proceedings of the 5th International Conference on Optimization and Control with Applications, pages 305–310, December 2012. J. E. Dennis and R. B. Schnabel. Numerical Methods for Unconstrained Optimization and Nonlinear Equations, volume 16. SIAM, 1996. J. E. Dennis, D. M. Gay, and R. E. Walsh. An adaptive nonlinear least-squares algorithm. ACM Transactions on Mathematical Software (TOMS), 7(3):348–368, 1981. J.E. Dennis. Some computational techniques for the nonlinear least squares problem. In George D. Byrne and Charles A. Hall, editors, Numerical Solution of Systems of Nonlinear Algebraic Equa- tions, pages 157–183. Academic Press, New York, 1973. E. D. Dolan and J. J. Mor´e.Benchmarking optimization software with performance profiles. Math- ematical programming, 91(2):201–213, 2002. R. Fletcher and C. Xu. Hybrid methods for nonlinear least squares. IMA Journal of Numerical Analysis, 7(3):371–389, 1987. G. Golub and V. Pereyra. Separable nonlinear least squares: the variable projection method and its applications. Inverse Problems, 19(2):R1–R26, 2003.

23 D. S. Gon¸calves and S. A. Santos. Local analysis of a spectral correction for the Gauss-Newton model applied to quadratic residual problems. Numerical Algorithms, 73(2):407–431, 2016. Geovani N. Grapiglia, Jinyun Yuan, and Ya xiang Yuan. On the convergence and worst-case com- plexity of trust-region and regularization methods for unconstrained optimization. Mathematical Programming, 152(1-2):491–520, 2015. L. Grippo, F. Lampariello, and S. Lucidi. A nonmonotone line search technique for Newton’s method. SIAM Journal on Numerical Analysis, 23(4):707–716, 1986. L. Han, G. Yu, and L. Guan. Multivariate spectral gradient method for unconstrained optimization. Applied Mathematics and Computation, 201(1-2):621–630, 2008. H. O. Hartley. The modified Gauss-Newton method for the fitting of non- functions by least squares. Technometrics, 3(2):269–280, 1961. S. Henn. A Levenberg-Marquardt scheme for nonlinear image registration. BIT Numerical Mathe- matics, 43(4):743–759, 2003. J. Huschens. On the use of product structure in secant methods for nonlinear least squares problems. SIAM Journal on Optimization, 4(1):108–129, 1994. M. Jamil and X. S. Yang. A literature survey of benchmark functions for global optimisation problems. International Journal of Mathematical Modeling and Numerical Optimisation, 4(2): 150–194, 2013. S.J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky. An interior-point method for large-scale ℓ1-regularized least squares. IEEE Journal of Selected Topics in Signal Processing, 1(4):606–617, 2007. D. A. Knoll and D. E. Keyes. Jacobian-free Newton–Krylov methods: a survey of approaches and applications. Journal of Computational Physics, 193(2):357–397, 2004. M. Kobayashi, Y. Narushima, and H. Yabe. Nonlinear conjugate gradient methods with structured secant condition for nonlinear least squares problems. Journal of Computational and Applied Mathematics, 234(2):375–397, 2010. W. La Cruz, J. M. Mart´ınez,and M. Raydan. Spectral residual method without gradient information for solving large-scale nonlinear systems: theory and experiments. Technical Report RT-04-08, Universidad Central de Venezuela, Venezuela, July 2004. K. Levenberg. A method for the solution of certain non-linear problems in least squares. Quarterly of Applied Mathematics, 2(2):164–168, 1944. J. Li, F. Ding, and G. Yang. Maximum likelihood least squares identification method for input non- linear finite impulse response moving average systems. Mathematical and Computer Modelling, 55(3-4):442–450, 2012. D. C. L´opez, T. Barz, S. Korkel, and G. Wozny. A Levenberg-Marquardt scheme for nonlinear image registration. Computers and Chemical Engineering, 77(Supplement C):24–42, 2015. L. Lukˇsan.Hybrid methods for large sparse nonlinear least squares. Journal of Optimization Theory and Applications, 89(3):575–595, 1996. L. Lukˇsanand J. Vlˇcek. Test problems for unconstrained optimization. Technical Report 897, Academy of Sciences of the Czech Republic, Institute of Computer Science, 2003. D. W. Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the society for Industrial and Applied Mathematics, 11(2):431–441, 1963.

24 J. J. McKeown. Specialised versus general-purpose algorithms for minimising functions that are sums of squared terms. Mathematical Programming, 9(1):57–68, 1975. J. J. Mor´e,B. S. Garbow, and K. E. Hillstrom. Testing unconstrained optimization software. ACM Transactions on Mathematical Software (TOMS), 7(1):17–41, 1981. D. D. Morrison. Methods for nonlinear least squares problems and convergence proofs. In J. Lorell and F. Yagi, editors, Proceedings of the Seminar on Tracking Programs and Orbit Determination, pages 1–9. Jet Propulsion Laboratory, Pasadena, 1960. L. Nazareth. Some recent approaches to solving large residual nonlinear least squares problems. SIAM Review, 22(1):1–11, 1980. Yu. Nesterov. Modified Gauss-Newton scheme with worst case guarantees for global performance. Optimization Methods and Software, 22(3):469–483, 2007. J. Nocedal and S. J. Wright. Numerical Optimization. Springer Science, New York, 2 edition, 2006. M. J. D. Powell. A method for minimizing a sum of squares of non-linear functions without calculating derivatives. The Computer Journal, 7(4):303–307, 1965. M. Raydan. On the Barzilai and Borwein choice of steplength for the gradient method. IMA Journal of Numerical Analysis, 13(3):321–326, 1993. M. Raydan. The Barzilai and Borwein gradient method for the large scale unconstrained minimiza- tion problem. SIAM Journal on Optimization, 7(1):26–33, 1997. S. A. Santos and R. C. M. Silva. An inexact and nonmonotone proximal method for smooth unconstrained minimization. Journal of Computational and Applied Mathematics, 269:86–100, 2014. Z. Shi and G. Sun. A diagonal-sparse quasi-Newton method for unconstrained optimization problem (in Chinese). Journal of Systems Science and Mathematical Sciences, 26(1):101–112, 2006. E. Spedicato and M. T. Vespucci. Numerical experiments with variations of the Gauss-Newton algorithm for nonlinear least squares. Journal of Optimization Theory and Applications, 57(2): 323–339, 1988. W. Sun and Y. X. Yuan. Optimization Theory and Methods: . Springer Science, New York, 2006. Li Min Tang. A regularization homotopy for ill-posed nonlinear least squares problem and its application. In Advances in Civil Engineering, ICCET 2011, volume 90 of Applied Mechanics and Materials, pages 3268–3273. Trans Tech Publications, 10 2011. F. Wang, D. H. Li, and L. Qi. Global convergence of Gauss-Newton-MBFGS method for solving the nonlinear least squares problem. Advanced Modeling and Optimization, 12(1):1–18, 2010. W. Xu, T. F. Coleman, and G. Liu. A secant method for nonlinear least-squares minimization. Computational Optimization and Applications, 51(1):159–173, 2012. W. Xu, N. Zheng, and K. Hayami. Jacobian-free implicit inner-iteration preconditioner for nonlinear least squares problems. Journal of Scientific Computing, 68(3):1055–1081, 2016. H. Zhang and W. W. Hager. A nonmonotone line search technique and its application to uncon- strained optimization. SIAM Journal on Optimization, 14(4):1043–1056, 2004. H. Zhang, A. R. Conn, and K. Scheinberg. A derivative-free algorithm for least-squares minimiza- tion. SIAM Journal on Optimization, 20(6):3555–3576, 2010. Ruixue Zhao and Jinyan Fan. Global complexity bound of the Levenberg-Marquardt method. Optimization Methods and Software, 31(4):805–814, 2016.

25