<<

1 Acceleration of RED via Vector Extrapolation Tao Hong, Yaniv Romano and Michael Elad

Abstract—Models play an important role in inverse problems, Multipliers (ADMM) scheme [16], P3 shows that the whole serving as the prior for representing the original signal to be problem is decomposed into a sequence of image denoising recovered. REgularization by Denoising (RED) is a recently sub-problems, coupled with simpler computational steps. The introduced general framework for constructing such priors using 3 state-of-the-art denoising algorithms. Using RED, solving inverse P scheme provides a constructive answer to the desire to problems is shown to amount to an iterated denoising process. use denoisers within inverse problems, but it suffers from However, as the complexity of denoising algorithms is generally several key disadvantages: P3 does not define a clear objective high, this might lead to an overall slow algorithm. In this , since the regularization used is implicit; Tuning the paper, we suggest an accelerated technique based on vector parameters in P3 is extremely delicate; and since P3 is tightly extrapolation (VE) to speed-up existing RED solvers. Numerical experiments validate the obtained gain by VE, leading toa coupled with the ADMM, it has no flexibility with respect to substantial savings in computations compared with the original the numerical scheme. fixed-point method. A novel framework named REgularization by Denoising Index Terms—Inverse problem, RED – REgularization by (RED) [17] proposes an appealing alternative while over- Denoising, fixed-point, vector extrapolation, acceleration. coming all these flaws. The core idea in RED is the use of the given denoiser within an expression of regularization that generalizes a Laplacian term. The work in I.INTRODUCTION [17] carefully shows that the gradient of this regularization is Inverse problems in imaging address the reconstruction of in fact the denoising residual. This, in turn, leads to several clean images from their corrupted versions. The corruption can iterative algorithms, all guaranteed to converge to the global be a blur, loss of samples, downscale or a more complicated minimum of the inverse problem’s penalty function, while operator (e.g., CT and MRI), accompanied by a noise contam- using a denoising step in each iteration. ination. Roughly speaking, inverse problems are characterized The idea of using a state-of-the-art denoising algorithm for by two main parts: the first is called the forward model, which constructing an advanced prior for general inverse problems is formulates the relation between the noisy measurement and very appealing.1 However, a fundamental problem still exists the desired signal, and the second is the prior, describing the due to the high complexity of typical denoising algorithms, log-probability of the destination signal. which are required to be activated many times in such a In recent years, we have witnessed a massive advancement recovery process. Indeed, the evidence from the numerical in a basic inverse problem referred to as image denoising [1], experiments posed in [17] clearly exposes this problem, in all [2], [3], [4], [5], [6]. Indeed, recent work goes as far as specu- the three methods proposed, namely the steepest descent, the lating that the performance obtained by leading image denois- fixed-point (FP) strategy and the ADMM scheme. Note that ing algorithms is getting very close to the possible ceiling [7], the FP method is a parameter free and the most efficient among [8], [9]. This motivated researchers to seek ways to exploit this the three, and yet this approach too requires the activation of progress in order to address general inverse problems. Success- the denoising algorithms dozens of times for a completion of ful attempts, as in [10], [11], [12], suggested an exhaustive the recovery algorithm. manual adaptation of existing denoising algorithms, or the arXiv:1805.02158v2 [cs.CV] 1 Apr 2019 priors used in them, treating specific alternative missions. This Our main contribution in this paper is to address these dif- line of work has a clear limitation, as it does not offer a ficulties by applying vector extrapolation (VE) [18], [19], [20] flexible and general scheme for incorporating various image to accelerate the FP algorithm shown in [17]. Our simulations denoising achievements for tackling other advanced image illustrate the effectiveness of VE for this acceleration, saving processing tasks. This led to the following natural question: more than 50% of the overall computations involved compared is it possible to suggest a general framework that utilizes the with the native FP method. abundance of high-performance image denoising algorithms The rest of this paper is organized as follows. We review for addressing general inverse problems? Venkatakrishnan RED and its FP method in SectionII. Section III recalls the et al. gave a positive answer to this question, proposing a Vector Extrapolation acceleration idea. Several experiments on framework called Plug-and-Play Priors (P3) method [13], [14], image deblurring and super-resolution, which follows the ones [15]. Formulating the inverse problem as an optimization given in [17], show the effectiveness of VE, and these are task and handling it via the Alternating Direction Method of brought in SectionIV. We conclude our paper in SectionV.

T. Hong and M. Elad are with the Department of Computer Sci- ence, Technion - Israel Institute of Technology, Haifa, 32000, Israel (e- mail:{hongtao,elad} @cs.technion.ac.il). 1Firstly, it enables utilizing the vast progress in image denoising for solving Y. Romano is with the Department of , Stanford University, challenging inverse problems as explained above. Secondly, RED enables Stanford, CA 94305, U.S.A. (email: [email protected]). utilizing the denoiser as a black-box. 2

II.REGULARIZATION BY DENOISING (RED) • Local Homogeneity: For any scalar c arbitrarily close to This section reviews the framework of RED, which utilizes 1, we have f (cxx) = c f (x). denoising algorithms as image priors [17]. We also describe • Strong Passivity: The Jacobian ∇x f (x) is stable in the its original solver based on the Fixed Point (FP) method. sense that its spectral radius is upper bounded by one, ρ(∇x f (x)) ≤ 1. A. Inverse Problems as Optimization Tasks Surprisingly, the gradient of E(x) is given by

From an estimation point of view, the signal x is to ∇xE(x) = ∇x`(y,x) + α(x − f (x)). (5) be recovered from its measurements y using the posterior As discussed experimentally and theoretically in [17, Section conditional probability P(x|y). Using maximum a posterior 3.2], many of the state-of-the-art denoising algorithms satisfy probability (MAP) and the Bayes’ rule, the estimation task is the above-mentioned two conditions, and thus the gradient of formulated as: (4) is simply evaluated through (5). As a consequence, E(x) in x∗ x y xMAP = argmaxx P(x|y) Equation (4) is a convex function if `(x,y) is convex, such as P(y|x)P(x) = argmaxx P(y) in the case of (1). In such cases any gradient-based algorithm = argmaxx P(y|x)P(x) can be utilized to address Equation (4) leading to its global = argminx −log{P(y|x)} − logP(x). minimum. Note that evaluating the gradient of E(x) calls for one The third equation is obtained by the fact that P(y) does not denoising activation, resulting in an expensive operation as depend on x. The term −log{P(y|x)} is known as the log- the complexity of good-performing denoising algorithms is likelihood `(y,x). A typical example is typically high. Because of the slow convergence speed of 1 2 the steepest descent and the high complexity of ADMM, `(y,x) , −log{P(y|x)} = kHHxx − yk2 (1) 2σ2 the work reported in [17] suggested using the FP method referring to the case y = HHxx +e, where H is any linear degra- to handle the minimization task posed in Equation (4). The dation operator and e is a white mean zero Gaussian noise development of the FP method is rather simple, relying on the with variance σ2. Note that the expression `(y,x) depends on fact that the global minimum of (4) should satisfy the first- 2 the distribution of the noise. Now, we can write the MAP order optimality condition, i.e., ∇x`(y,x) + α(x − f (x)) = 00. optimization problem as For the FP method, we utilize the following iterative formula to solve this equation: x∗ = argmin`(y,x) + αR(x) (2) MAP x ∇x`(y,xk+1) + α(xk+1 − f (xk)) = 0. (6) where α > 0 is a trade-off parameter to balance `(y,x) and The explicit expression of the above for `(y,x) = 1 kHHxx − R(x). R(x) , −logP(x) refers to the prior that describes 2σ2 2 the statistical nature of x. This term is typically referred to yk2 is as the regularization, as it is used to stabilize the inversion  1 −1  1  by emphasizing the features of the recovered signal. In the x HT H I HT y x xk+1 = 2 H H + αI 2 H y + α f (xk) , (7) following, we will describe how RED activates denoising σ σ algorithms for composing R(x). Note that Equation (2) defines where I represents the identity matrix.4 The convergence of a wide family of inverse problems including, but not limited the FP method is guaranteed since to, inpainting, deblurring, super-resolution, [21] and more. !  1 −1 HT H I x ρ 2 H H + αI α∇x f (xk) < 1. B. RED and the Fixed-Point Method σ Define f (x) as an abstract and differentiable denoiser.3 RED Although the FP method is more efficient than the steepest suggests applying the following form as the prior: descent and the ADMM, it still needs hundreds of iterations, 1 which means hundreds of denoising activations, to reach the R(x) = xT (x − f (x)), (3) 2 desired minimum. This results in a high complexity algorithm which we aim to address in this work. In the next section, we where T denotes the transpose operator. The term introduce an accelerated technique called Vector Extrapolation xT (x − f (x)) is an image-adaptive Laplacian regularizer, (VE) to substantially reduce the amount of iterations in the FP which favors either a small residual x − f (x), or a small inner method. product between x and the residual [17]. Plugged into Equation (2), this leads the following minimization task: III.PROPOSED METHODVIA VECTOR EXTRAPOLATION 1 minE(x) `(y,x) + α xT (x − f (x)). (4) We begin this section by introducing the philosophy of VE x , x 2 in linear and nonlinear systems and then discuss three vari- The prior R(x) of RED is a convex function and easily ants of VE5, i.e., Minimal Polynomial Extrapolation (MPE), differentiated if the following two conditions are met: 4This matrix inversion is calculated in the Fourier domain for block- 2White Gaussian noise is assumed throughout this paper. circulant H, or using iterative methods for the more general cases. 3This denoising function admits a noisy image x, and removes additive 5We refer the interesting readers to [20] and the references therein to explore Gaussian noise form it, assuming a prespecified noise energy. further the VE technique. 3

κ Reduced Rank Extrapolation (RRE) and Singular Value De- Multiplying both sides of (14) by A results in ∑i=0 ciAAeem+i = κ composition Minimal Polynomial Extrapolation (SVD-MPE) ∑i=0 ciem+i+1 = 00, and thus we receive [22], [20]. Efficient implementation of these three variants is κ κ also discussed. We end this section by embedding VE in the ∑ ciem+i = ∑ ciem+i+1 = 0. (15) FP method for RED, offering an acceleration of this scheme. i=0 i=0 Finally, we discuss the convergence and stability properties of Subtracting these expressions gives VE. κ κ ∑ ci (em+i+1 − em+i) = ∑ ci (xm+i+1 − xm+i) i=0 i=0 A. VE in Linear and Nonlinear Systems κ (16) = ∑ cium+i = 0. N Consider a vector set {xi ∈ ℜ } generated via a linear i=0 process, This suggests that {ci} could be determined by solving the xi+1 = AAxxi + b, i = 0,1,··· , (8) linear equations posed in (16). Once obtaining {ci}, {γi} are ci κ N×N N calculated through γi = κ . Note that ∑ j=0 c j 6= 0 if I − A where A ∈ ℜ , b ∈ ℜ and x is the initial vector. If ρ(A) < ∑ j=0 c j 0 κ 1, a limit point x∗ exists, being the FP of (8), x∗ = AAxx∗ +b. We is not singular yielding ∑ j=0 c j = P(1) 6= 0. Assuming κ is turn to describe how VE works on such linear systems [23]. the degree of the minimal polynomial of A with respect to e , we can find a set of {γ } to satisfy κ γ = 1 resulting Denote ui = xi+1 − xi, i = 0,1,··· , and define the defective m i ∑i=0 i in κ γ x = x∗. However, the degree of the minimal vector ei as ∑i=0 i m+i ∗ polynomial of A can be as large as N, which in our case ei = xi − x , i = 0,1,··· . (9) is very high. Moreover, we also do not have a way to Subtracting x∗ from both sides of (8) and utilizing the fact obtain this degree with an easy algorithm. Because of these ∗ that x is the FP, we have ei+1 = AAeei resulting in two difficulties, some approximate methods are developed to extrapolate the next vector via the previous ones and we will i+1 ei+1 = A e0. (10) discuss them in Subsection III-B. Turning to the nonlinear case, denote F as the FP function We define a new extrapolated vector x as a weighted (m,κ) to evaluate the next vector, average of the form

κ xi+1 = F(xi), i = 0,1,··· , (17) x(m,κ) = ∑ γixm+i, (11) i=0 where F is an N-dimensional vector-valued function, F : N → N. We say x∗ is a FP of F if x∗ = F(x∗). Expanding κ ℜ ℜ where ∑ γi = 1. Substituting (9) in (11) and using (10) and F(x) in its Taylor series yields i=0 κ ∗ 0 ∗ ∗ ∗ 2 ∗ ∑i=0 γi = 1, we have F(x) = F(x ) + F (x )(x − x ) + O(kx − x k ) as x → x , κ 0 ∗ x = γ (x∗ + e ) where F (·) is the Jacobian matrix of F(·). Recalling F(x ) = (m,κ) ∑ i m+i ∗ i=0 x , we have κ ∗ = x + ∑ γiem+i (12) ∗ 0 ∗ ∗ ∗ 2 ∗ i=0 F(x) = x + F (x )(x − x ) + O(kx − x k ) as x → x . κ ∗ i ∗ = x + ∑ γiA em. Assuming the sequence x0,x1,... converges to x (if i=0 0 ∗ ρ(F (x)) < 1), it follows that xi will be close enough to x Note that the optimal {γi} and κ should be chosen so as to for all large i, and hence κ i force ∑ γiA em = 0. This way, we attain the FP through only x = x∗ + F 0(x∗)(x − x∗) + O(kx − x∗k2), as i → ∞. i=0 i+1 i i one extrapolation step. Then, we rewrite this in the form More broadly speaking, given a nonzero matrix B ∈ ℜN×N N ∗ 0 ∗ ∗ ∗ 2 and an arbitrary nonzero vector u ∈ ℜ , we can find a unique xi+1 − x = F (x )(xi − x ) + O(kxi − x k ), as i → ∞. polynomial P(z) with smallest degree to yield P(B)u = 00. Such a P(z) is called the minimal polynomial of B with respect to For large i, the vectors {xi} behave as in the linear system of the vector u. Notice that the zeros of P(z) are the eigenvalues the form (I − A)x = b through of B. Thus, assume that the minimal polynomial of A with xi+1 = AAxxi + b, i = 0,1,··· , respect to em can be represented as 0 ∗ 0 ∗ ∗ κ where A = F (x ), b = [I − F (x )]x . This implies that the i P(z) = ∑ ciz , cκ = 1 (13) nonlinear system yields the same formula as the linear one and i=0 motivates us to extrapolate the next vector by the previous ones as in linear systems. Indeed, such an extension has been shown resulting in P(A)em = 00. So, we have to be successful in various areas of science and engineering, κ κ e.g., computational fluid dynamics, semiconductor research, c Aie = c e = 0. (14) ∑ i m ∑ i m+i tomography and geometrical image processing [19], [24], [20]. i=0 i=0 4

B. Derivations of MPE, RRE and SVD-MPE Algorithm 2 Modified Gram-Schmidt (MGS) Output: Q and R (r denotes the (i, j)th element of R We turn to discuss how to utilize an approximate way to κ κ i j κ and q and u represent the ith column of Q and U m, obtain the next vector by extrapolating the previous ones. Due i i κ κ respectively.). to the fact that the degree of the minimal polynomial can be 1: Compute r = ku k and q = u /r . as large as N and we cannot obtain it, an arbitrary positive 11 1 2 1 1 11 2: for i = 2,··· ,κ + 1 do number is set as the degree, being much smaller than the (1) 3: Set u = u true one. With such a replacement, the linear equations in i i 4: for j = 1,··· ,i − 1 do (16) become inconsistent and there does not exist a solution T ( j) ( j+1) ( j) 5: r ji = q u and u = u − r jiq for {ci},cκ = 1 in the ordinary sense. Alternatively, we solve j i i i j 6: end for instead (i) (i) m 2 7: Compute rii = kui k2 and qi = ui /rii minkU ck , s.t. cκ = 1 (18) c κ 2 8: end for

 T m   where c = c0 ··· cκ and U κ = um ··· um+κ . Then i=κ  evaluating γi through ci/ ∑i=0 ci results in the next vector formulation of γ is the solution of: κ x(m,κ) = ∑i=0 γixm+i as a new approximation. This method is m 2 minkU γk , s.t. γi = 1. (19) γ κ 2 ∑ known as Minimal Polynomial Extrapolation (MPE) [25]. i The detailed steps for obtaining the next vector through SVD-MPE: Computing the SVD decomposition of Rκ = MPE are shown in Algorithm1 . To solve the constrained T vκ+1 U ΣV , we have γ = where vκ+1 and vi,κ+1 problem in (18), we suggest utilizing QR decomposition with ∑i vi,κ+1 represent the last column and the (i,κ + 1)th element of the modified Gram-Schmidt (MGS) [25], [26]. The MGS m matrix V , respectively. procedure for the matrix U κ is shown in Algorithm2 . Here are two remarks regarding RRE, MPE and SVD-MPE: Algorithm 1 Minimal Polynomial Extrapolation (MPE) • Observing the derivations of MPE, SVD-MPE and RRE, Initialization: we notice that RRE’s solution must exist unconditionally, while MPE and SVD-MPE may not exist because the A sequence of vectors {xm,xm+1,xm+2,··· ,xm+κ+1} is produced by the baseline algorithm (FP in our case). sum of {ci} and {vi,κ+1} in MPE and SVD-MPE may Output: become zero. Thus RRE may be more robust in practice [24]. However, MPE and RRE are related, as revealed in A new vector x(m, ). κ xRRE 1: Construct the matrix [27]. Specifically, if MPE does not exist, we have x(m,κ) = xRRE . Otherwise, the following holds m   N×(κ+1) (m,κ−1) U κ = xm+1 − xm,··· ,xm+κ+1 − xm+κ ∈ ℜ µ xRRE = µ xRRE + v xMPE , µ = µ + v and then compute its QR factorization via Algorithm2 , κ (m,κ) κ−1 (m,κ−1) κ (m,κ) κ κ−1 κ m U κ = QκRκ. where µκ, µκ−1 and vκ are positive scalars depending only RRE RRE MPE 2: Denote rκ+1 as the κ+1th column of Rκ without the last on x(m,κ), x(m,κ−1) and x(m,κ), respectively. Furthermore, row and solve the following κ×κ upper triangular system the performance of MPE and RRE is similar – both of the methods either perform well or work poorly [20]. R c0 = −r c0 = c ,c ,··· ,c T κ−1 κ+1 0 1 κ−1 • Observe that we only need to store κ + 2 vectors in memory at all steps in Algorithm2 . Formulating the where Rκ−1 is the previous κ columns of Rκ without the c κ last row. Finally, evaluate {γ } through { i }. matrix U m, we overwrite the vector xm+i with um+i = i k c ∑i=0 i x − x when the latter is computed and only x  T m+i m+i−1 m 3: Compute ξ = ξ0,ξ1,··· ,ξκ−1 through is always in the memory. Next, um+i is overwritten by q , i = 1,··· ,κ+1 in computing the matrix Q . Thus, we ξ0 = 1 − γ0; ξ j = ξ j−1 − γ j, j = 1,··· ,κ − 1 i κ do not need to save the vectors xm+1,··· ,xm+κ+1, which implies that no additional memory is required in running  T 4: Compute η = η0,η1,··· ,ηκ−1 = Rκ−1ξ. Then we at- Algorithm2 . tain x(m,κ) = xm + Qκ−1η as the new initial vector where Qκ−1 represents the previous κ columns of Qκ. C. Embedding VE in the Baseline Algorithm We introduce VE in its cycling formulation for practical Now, let us discuss the other two variants of VE, i.e., usage. One cycling means we activate the baseline algorithm Reduced Rank Extrapolation (RRE) [25] and SVD-MPE [22]. to produce {xi} and then utilize VE once to evaluate the new The main difference among RRE, MPE and SVD-MPE is vector as a novel initial point. Naturally, we repeat such a at Step2 in Algorithm1 regarding the evaluation of {γi}. cycling many times. The steps of utilizing VE in its cycling In RRE and SVD-MPE, we utilize the following methods to mode are shown in Algorithm3 . Few comments are in order: obtain {γi}: • In practice, we utilize VE in its cycling formulation. T RRE: Solving Rκ Rκd = 1 through forward and backward Specially, the iterative form shown in Algorithm3 is substitution, we obtain γ through d . Actually, such a named as full cycling [20]. To save the complexity of ∑i di 5

computing {γi}, one may reuse the previous {γi}, a an increasing κ generally results in a faster convergence of VE. method known as cycling with frozen γi. Parallel VE However, a large κ has to increase the storage requirements can be also developed if more machines are available. and also requires a much higher computational cost. Numerical Explaining details of the last two strategies is out of the experiments indicate that a moderate κ can already works well scope of this paper. We refer the reader to [20] for more in practice. information. If the following condition is held, we say VE is stable: • Numerical experience also indicates that cycling with κ even moderately large m > 0 will avoid stalling from (m,κ) sup ∑ |γi | < ∞. (21) happening [28]. Moreover, we also recommend setting m i=0 m > 0 when the problem becomes challenging to solve. (m,κ) Here, we denote {γi} by {γ } to show their dependence on • In our case, the termination criterion in Algorithm3 i m and κ. If (21) holds true, the error in x will not magnify can be the number of total iterations (the number of i severely. As shown in [29], [30], [31], MPE and RRE obey calling of the baseline algorithm) or the difference be- such a stability property. tween consecutive two vectors. Furthermore, we also For nonlinear systems, the analysis of convergence and recommend giving additional iterations to activate the stability becomes extremely challenging. One of the main baseline algorithm after terminating the VE, which can results is the quadratic convergence theorem [32], [23], [33]. stabilize the accelerated algorithm in practice. This theorem is built on one special assumption that κ is set to be the degree of the minimal polynomial of F 0(x∗). The Algorithm 3 Baseline Algorithm + Vector Extrapolation proof of the quadratic convergence was shown in [32]. In a Initialization: following work, Smith, Ford and Sidi noticed that there exists Choose nonnegative integers m and and an initial vector κ a gap in the previous proof [23]. Jbilou et al. suggested two x . The baseline algorithm is the FP method, as given in 0 more conditions in order to close the gap [33]: (7). 0 • The matrix F (x∗) − I is nonsingular Output: 0 • F Final Solution x∗. F (·) satisfies the following Lipschitz condition: 1: while 1 do kF 0(x) − F 0(y)k ≤ Lkx − yk L > 0. 2: Obtain the series of xi through the baseline algorithm where 1 ≤ i ≤ m+κ+1, and save xm+i for 0 ≤ i ≤ κ+1 Surprisingly, these two conditions are met by the RED scheme. κ The first condition is satisfied by the fact to formulate U m. 3: Call Algorithm1 to obtain x(m,κ).  −1 ! ∗ 1 T 4: If the termination of the algorithm is satisfied, set x = ρ H H + αI α∇x f (x) < 1. σ2 x x(m,κ) and break, otherwise, set x(m,κ) as the new initial point x0 and go to Step2. The second one is also true, due to the assumption in RED 5: end while that the denoiser f (x) is differentiable. So we claim that it is possible for VE to solve RED with quadratic convergence rate. D. Convergence and Stability Properties Although VE can lead to a quadratic convergence rate, We mention existing results regarding the convergence and trying to achieve such a rate may not be realistic because κ stability properties of VE for understanding this technique can be as large as N. However, we may obtain a linear but better. A rich literature has examined the convergence and fast convergence in practice with even moderate values of m stability properties of RRE, MPE and SVD-MPE in linear and κ, which is also demonstrated in the following numerical systems [29], [30]. Assuming the matrix A is diagonalizable, experiments. ∗ then in the kth iteration xk should have the form xk = x + κ k IV. EXPERIMENTAL RESULTS ∑i=1 viλi where (λi,vi) are some or all of the eigenvalues and corresponding eigenvectors of A, with distinct nonzero We follow the same experiments of image deblurring and eigenvalues. By ordering λi as |λ1| ≥ |λ2| ≥ ···, the following super-resolution as presented in [17] to investigate the perfor- asymptotic performance holds for all of the three variants of mance of VE in acceleration. The trainable nonlinear reaction VE when |λk| > |λk+1|: diffusion (TNRD) method [6] is chosen as the denoising engine. Mainly, we choose the FP method as our baseline al- x − x∗ = O(λm ) as m → ∞. (20) (m,κ) κ+1 gorithm. For a fair comparison, the same parameters suggested ∞ ∗ This implies that the sequence {x(m,κ)}m=0 converges to x in [17] for different image processing tasks are set in our faster than the original sequence {xk}. experiments. In [17], the authors compared RED with other As shown in (20), for a large m,(8) reduces the contributions popular algorithms in image deblurring and super-resolution ∗ of the smaller λi to the error x(m,κ) −x , while VE eliminates tasks, showing its superiority. As the main purpose in this the contributions of the κ largest λi. This indicates that x(m,κ) − paper is to present the acceleration of our method for solving ∗ ∗ x is smaller than each of the errors xm+i −x , i = 0,1,··· ,κ, RED, we omit the comparisons with other popular algorithms. when m is large enough. We mention another observation that In the following, we mainly show the acceleration of applying 6

MPE with FP for solving RED first and then discuss the choice B. Image super-resolution of parameters in VE, i.e., m and κ. In addition, we compare We generate a low resolution image by blurring the ground our method with three other methods, steepest descent (SD), truth one with a 7×7 Gaussian kernel with standard derivation Nesterov’s acceleration [34] and Limited-memory BFGS (L- 1.6 and then downsample by a factor of 3. Afterwards, an BFGS) [35]. Note that we need to determine a proper step-size additive Gaussian noise with σ = 5 is added to the resulting for the above methods [35]. However, evaluating the objective image. The same parameters m and κ used in the deblurring value or gradient in RED is expensive implying that any line- task for FP-MPE are adopted here. For SD-MPE, the param- search method becomes prohibitive. Note that, in contrast, in eters m and κ are set to 1 and 10, respectively. We choose our framework as described in Algorithm3 does not suffer “Plants” as our test image because it needs more iterations for from such a problem. In the following, we manually choose FP-MPE to converge. As observed from Fig.4, while L-BFGS a fixed step-size for getting a good convergence behavior. Fi- and the Nesterovs method are faster than the FP method, our nally, we compare the difference among RRE, MPE and SVD- acceleration method (FP-MPE) is quite competitive with both. MPE. All of the experiments are conducted on a workstation Furthermore, we investigate all of the test images as shown with Intel(R) Xeon(R) CPU E5-2699 @2.20GHz. in [17] to see how many iterations are needed for MPE to achieve the same or lower cost compared with the FP method. The results are shown in TableII. As can be seen, MPE works A. Image deblurring better than the FP method indicating an effective acceleration for solving RED. In this experiment, we degrade the test images by convolv- ing with two different point spread functions (PSFs), i.e., 9×9 uniform blur and a Gaussian blur with a standard derivation of C. The Choice of the Parameters and the Difference Among 1.6. In both√ of these cases, we add an additive Gaussian noise RRE, MPE and SVD-MPE with σ = 2 to the blurred images. The parameters m and κ Conclude by discussing the robustness to the choice of in Algorithm3 are set to 0 and 5 for the image deblurring parameters m and κ for the MPE algorithm. To this end, task. Additionally, we apply VE to the case where the baseline the single image super-resolution task is chosen as our study. algorithm is SD, called SD-MPE, with the parameters m and Furthermore, we choose to demonstrate this robustness on κ are chosen as 0 and 8. The value of the cost function and the “Plants” image since it required the largest number of peak signal to noise ratio (PSNR) versus iteration or CPU iterations in the MPE recovery process. As seen from Fig.s 6 time are given in Fig.s2 and3 . These correspond to both a 5(a)- 5(c), MPE always converges faster than the regular FP uniform and a Gaussian blur kernels, all tested on the “starfish” method with different m and κ. Moreover, we also observe that image. Clearly, we observe that SD is the slowest algorithm. a lower objective value is attained through MPE. Notice that Surprisingly, SD-MPE and Nesterov’s method yield almost the MPE has some oscillations because it is not a monotonically same convergence speed, despite their totally different scheme. accelerated technique. However, we still see a lower cost is Moreover, We note that FP is faster than L-BFGS, Nesterov’s achieved if additional iterations are given. method and SD-MPE. Undoubtedly, FP-MPE is the fastest In part (d) of Fig.5, an optimal pair of m and κ is chosen one, both in terms of iterations and CPU time, which indicates for MPE, RRE and SVD-MPE for the single image super- the effectiveness of MPE’s acceleration. To provide a visual resolution task with the “Plants” image.7 We see that all three effect, we show the change in reconstructed quality of different methods yield an acceleration and a lower cost, demonstrating algorithms in Fig.1. Clearly, the third column of FP-MPE the effectiveness of the various variants of VE. Moreover, we achieves the best reconstruction faster, while other methods see that SVD-MPE converges faster at the beginning, but MPE need more iterations to obtain a comparable result. yields a lowest eventual cost. Additional nine test images suggested in [17] are also included in our experiments, in order to investigate the perfor- V. CONCLUSION mance of VE further. In this experiment we focus on the com- parison between FP-MPE and FP for the additional images. The work reported in [17] introduced RED – a flexible We run the native FP method 200 iterations first and denote framework for using arbitrary image denoising algorithms as the final image by x∗. Clearly, the corresponding cost-value is priors for general inverse problems. This scheme amounts to E(x∗). We activate Algorithm3 with the same initial value iterative algorithms in which the denoiser is called repeat- as used in the FP method to examine how many iterations are edly. While appealing and quite practical, there is one major needed to attain the same or lower objective value than E(x∗). weakness to the RED scheme – the complexity of denoising The final number of iterations with different images are given algorithms is typically high which implies that the use of in TableI. Clearly, an acceleration is observed in all the test RED is likely to be costly in run-time. This work aims at images in the image deblurring task. deploying RED efficiently, alleviating the above described shortcoming. An accelerated technique is proposed in this paper, based on the Vector Extrapolation (VE) methodology. 6The goal of this paper is to investigate the performance of solving RED with VE rather than the restoration results. Therefore, we present the recovered The proposed algorithms are demonstrated to substantially PSNR versus iteration or running time. One can utilize some no-reference quality metrics like NFERM [36] and ARISMc [37] to further examine the 7The optimal m and κ are obtained by searching in the range [0,10] with restoration results. κ ≥ 2, seeking the fastest convergence for these three methods. 7

(a) SD: 22.56dB (input) → 25.65dB → 26.47dB → 26.97dB → 27.35dB → original.

(b) SD-MPE: 22.56dB (input) → 27.46dB → 28.62dB → 29.29dB → 29.69dB → original.

(c) Nesterov: 22.56dB (input) → 27.44dB → 28.78dB → 29.43dB → 29.79dB → original.

(d) LBFGS: 22.56dB (input) → 27.80dB → 29.52dB → 30.14dB → 30.40dB → original.

(e) FP: 22.56dB (input) → 28.91dB → 29.76dB → 30.13dB → 30.31dB → original.

(f) FP-MPE: 22.56dB (input) → 30.07dB → 30.55dB → 30.60dB → 30.60dB → original.

Fig. 1. Reconstruction results of different algorithms in various iterations for the uniform kernel. From left to right: Blurred one → #20 → #40 → #60 → #80 → Ground truth. reduce the number of overall iterations required for the overall Computer Society Conference on, vol. 2, pp. 60–65, 2005. recovery process. We also observe that the choice of the [2] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Transactions on Image parameters in the VE scheme is robust. processing, vol. 15, no. 12, pp. 3736–3745, 2006. [3] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by REFERENCES sparse 3-d transform-domain collaborative filtering,” IEEE Transactions on Image Processing, vol. 16, no. 8, pp. 2080–2095, 2007. [1] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image [4] D. Zoran and Y. Weiss, “From learning models of natural image patches denoising,” in Computer Vision and Pattern Recognition, CVPR, IEEE to whole image restoration,” in Computer Vision (ICCV), 2011 IEEE 8

TABLE I THE NUMBER OF ITERATIONS WITH DIFFERENT IMAGES FOR FP AND FP-MPE IN IMAGE DEBLURRING TASK TO ATTAIN THE SAME COST.

Image Butterfly Boats C. Man House Parrot Lena√ Barbara Starfish Peppers Leaves Deblurring: Uniform kernel, σ = 2 RED: FP-TNRD 200 200 200 200 200 200 200 200 200 200 RED: FP-MPE-TNRD 60 55 55 80 50 50√ 55 50 55 55 Deblurring: Gaussian kernel, σ = 2 RED: FP-TNRD 200 200 200 200 200 200 200 200 200 200 RED: FP-MPE-TNRD 70 65 55 65 55 80 45 55 95 80

TABLE II THE NUMBER OF ITERATIONS WITH DIFFERENT IMAGES FOR FP AND FP-MPE INIMAGESUPER-RESOLUTION TASK TO ATTAIN THE SAME COST.

Super-Resolution, scaling = 3, σ = 5 Image Butterfly Flower Girl Parth. Parrot Raccoon Bike Hat Plants RED: FP-TNRD 200 200 200 200 200 200 200 200 200 RED: FP-MPE-TNRD 60 65 50 70 55 60 60 50 70

Starfish-UniformBlur Starfish-GaussianBlur 6 Starfish-UniformBlur Starfish-GaussianBlur 106 10 106 6 SD 10 SD SD SD SD-MPE SD-MPE SD-MPE SD-MPE Nesterov Nesterov Nesterov Nesterov L-BFGS L-BFGS L-BFGS L-BFGS FP FP FP FP 4 104 FP-MPE 4 10 FP-MPE 10 FP-MPE 104 FP-MPE 5 5 2.5 2.5 4 4 5 105 5 10 10 5 Cost

Cost 10 2 Cost 3 3 Cost 2

2 1.5 2 1.5 10 20 30 40 10 20 30 40 10 20 30 40 10 20 30 40

4 104 10 104 4 0 20 40 60 80 100 120 140 160 180 200 10 0 50 100 150 200 250 0 50 100 150 200 250 0 20 40 60 80 100 120 140 160 180 200 Iterations Seconds Seconds Iterations (a) Cost value versus iteration. (b) Cost value versus CPU time. (a) Cost value versus iteration. (b) Cost value versus CPU time.

Starfish-UniformBlur Starfish-UniformBlur Starfish-GaussianBlur Starfish-GaussianBlur 30 30 32 32

29 29 31 31 28 28

SD SD 30 30 27 27 SD-MPE SD-MPE Nesterov Nesterov

26 L-BFGS Cost 26 LBFGS 29 29 PSNR FP Cost

FP PSNR FP-MPE FP-MPE 25 25 28 28 SD SD 24 24 SD-MPE SD-MPE 27 Nesterov 27 Nesterov L-BFGS LBFGS 23 23 FP FP FP-MPE 26 FP-MPE 26 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 180 200 0 50 100 150 200 250 Iterations Seconds Iterations Seconds (c) PSNR versus iteration. (d) PSNR versus CPU time. (c) PSNR versus iteration (d) PSNR versus iteration

Fig. 2. Image Deblurring - Uniform Kernel, “Starfish” Image. Fig. 3. Image Deblurring - Gaussian Kernel, “Starfish” Image.

International Conference on, pp. 479–486, IEEE, 2011. passing,” in Sampling Theory and Applications (SampTA), 2015 Inter- [5] W. Dong, L. Zhang, G. Shi, and X. Li, “Nonlocally centralized sparse national Conference on, pp. 508–512, IEEE, 2015. representation for image restoration,” IEEE Transactions on Image [13] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plug-and- Processing, vol. 22, no. 4, pp. 1620–1630, 2013. play priors for model based reconstruction,” in Global Conference on [6] Y. Chen and T. Pock, “Trainable nonlinear reaction diffusion: A flexible Signal and Information Processing (GlobalSIP), 2013 IEEE, pp. 945– framework for fast and effective image restoration,” IEEE Transactions 948, IEEE, 2013. on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1256– [14] S. Sreehari, S. V. Venkatakrishnan, B. Wohlberg, G. T. Buzzard, L. F. 1272, 2017. Drummy, J. P. Simmons, and C. A. Bouman, “Plug-and-play priors [7] P. Chatterjee and P. Milanfar, “Is denoising dead?,” IEEE Transactions for bright field electron tomography and sparse ,” IEEE on Image Processing, vol. 19, no. 4, pp. 895–911, 2010. Transactions on Computational Imaging, vol. 2, no. 4, pp. 408–423, [8] P. Milanfar, “A tour of modern image filtering: New insights and meth- 2016. ods, both practical and theoretical,” IEEE Signal Processing Magazine, [15] S. H. Chan, X. Wang, and O. A. Elgendy, “Plug-and-play admm for vol. 30, no. 1, pp. 106–128, 2013. image restoration: Fixed-point convergence and applications,” IEEE [9] A. Levin and B. Nadler, “Natural image denoising: Optimality and Transactions on Computational Imaging, vol. 3, no. 1, pp. 84–98, 2017. inherent bounds,” in Computer Vision and Pattern Recognition (CVPR), [16] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al., “Distributed 2011 IEEE Conference on, pp. 2833–2840, IEEE, 2011. optimization and statistical learning via the alternating direction method [10] M. Protter, M. Elad, H. Takeda, and P. Milanfar, “Generalizing the of multipliers,” Foundations and Trends R in Machine Learning, vol. 3, nonlocal-means to super-resolution reconstruction,” IEEE Transactions no. 1, pp. 1–122, 2011. on Image Processing, vol. 18, no. 1, pp. 36–51, 2009. [17] Y. Romano, M. Elad, and P. Milanfar, “The little engine that could: [11] A. Danielyan, V. Katkovnik, and K. Egiazarian, “Bm3d frames and Regularization by denoising (red),” SIAM Journal on Imaging Sciences, variational image deblurring,” IEEE Transactions on Image Processing, vol. 10, no. 4, pp. 1804–1844, 2017. vol. 21, no. 4, pp. 1715–1728, 2012. [18] S. Cabay and L. Jackson, “A polynomial extrapolation method for [12] C. A. Metzler, A. Maleki, and R. G. Baraniuk, “Optimal recovery from finding limits and antilimits of vector sequences,” SIAM Journal on compressive measurements via denoising-based approximate message , vol. 13, no. 5, pp. 734–752, 1976. 9

Plants-Downscale Plants-Downscale Imaging Sciences, vol. 2, no. 3, pp. 858–878, 2009.

5000 SD 5000 SD [25] A. Sidi, “Efficient implementation of minimal polynomial and reduced SD-MPE SD-MPE 4500 Nesterov 4500 Nesterov L-BFGS L-BFGS rank extrapolation methods,” Journal of Computational and Applied 4000 4000 FP FP 3500 FP-MPE 3500 FP-MPE , vol. 36, no. 3, pp. 305–337, 1991.

3000 3000 [26] G. H. Golub and C. F. Van Loan, Matrix Computations, vol. 3. JHU

Press, 2012. Cost Cost 2500 2500 [27] A. Sidi, “Minimal polynomial and reduced rank extrapolation methods 2000 2000 are related,” Advances in Computational Mathematics, vol. 43, no. 1,

1500 1500 pp. 151–170, 2017. [28] A. Sidi and Y. Shapira, “Upper bounds for convergence rates of accel- 0 20 40 60 80 100 120 140 160 180 200 0 50 100 150 200 250 Iterations Seconds eration methods with initial iterations,” Numerical Algorithms, vol. 18, no. 2, pp. 113–132, 1998. (a) Cost value versus iteration. (b) Cost value versus CPU time. [29] A. Sidi, W. F. Ford, and D. A. Smith, “Acceleration of convergence of vector sequences,” SIAM Journal on Numerical Analysis, vol. 23, no. 1, Plants-Downscale Plants-Downscale 31.5 pp. 178–196, 1986. 31.5

31 [30] A. Sidi and J. Bridger, “Convergence and stability analyses for some 31

30.5 vector extrapolation methods in the presence of defective iteration 30.5 Journal of Computational and Applied Mathematics 30 matrices,” , vol. 22, 30 no. 1, pp. 35–61, 1988.

29.5 PSNR Cost 29.5 [31] A. Sidi, “Convergence and stability properties of minimal polynomial 29 29 SD SD and reduced rank extrapolation algorithms,” SIAM Journal on Numerical SD-MPE 28.5 SD-MPE Nesterov 28.5 Nesterov Analysis, vol. 23, no. 1, pp. 197–209, 1986. L-BFGS LBFGS 28 FP 28 FP FP-MPE [32] S. Skelboe, “Computation of the periodic steady-state response of FP-MPE 0 20 40 60 80 100 120 140 160 180 200 0 50 100 150 200 250 nonlinear networks by extrapolation methods,” IEEE Transactions on Iterations Seconds Circuits and Systems, vol. 27, no. 3, pp. 161–175, 1980. (c) PSNR versus iteration. (d) PSNR versus CPU time. [33] K. Jbilou and H. Sadok, “Some results about vector extrapolation methods and related fixed-point iterations,” Journal of Computational Fig. 4. Image Super-Resolution, “Plant” Image. and Applied Mathematics, vol. 36, no. 3, pp. 385–398, 1991. [34] Y. Nesterov, Lectures on Convex Optimization. Springer, 2018. [35] J. Nocedal and S. J. Wright, Numerical Optimization. Springer, 2006.

Plants-Downscale Plants-Downscale [36] K. Gu, G. Zhai, X. Yang, and W. Zhang, “Using free energy principle 2400 2400 for blind image quality assessment,” IEEE Transactions on Multimedia, 2300 2300 vol. 17, no. 1, pp. 50–63, 2015. [37] K. Gu, G. Zhai, W. Lin, X. Yang, and W. Zhang, “No-reference 2200 2200 image sharpness assessment in autoregressive parameter space,” IEEE

2100 2100 Cost Cost Transactions on Image Processing, vol. 24, no. 10, pp. 3218–3231, 2015.

2000 2000

1900 1900

1800 1800 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200 Iteration Iteration (a) m = 0 and κ = 3,5,10. (b) m = 5 and κ = 3,5,10.

Plants-Downscale Plants-Downscale 2400 2300 FP FP-MPE 2250 FP-RRE 2300 FP-SVDMPE 2200

2150 2200 2100

2100 2050

Cost Cost 2000 2000 1950

1900 1900

1850

1800 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200 Iteration Iteration (c) m = 10 and κ = 3,5,10. (d) Comparsion among MPE, RRE and SVD-MPE.

Fig. 5. Exploring the robustness to the choice of the parameters in MPE, and the difference among the three VE schemes. All these graphs correspond to the test image “Plants” for the single image super-resolution task.

[19] N. Rajeevan, K. Rajgopal, and G. Krishna, “Vector-extrapolated fast maximum likelihood estimation algorithms for emission tomography,” IEEE Transactions on Medical Imaging, vol. 11, no. 1, pp. 9–20, 1992. [20] A. Sidi, Vector Extrapolation Methods with Applications, vol. 17. SIAM, 2017. [21] A. Kirsch, An introduction to the mathematical theory of inverse problems, vol. 120. Springer Science & Business Media, 2011. [22] A. Sidi, “Svd-mpe: An svd-based vector extrapolation method of poly- nomial type,” arXiv preprint arXiv:1505.00674, 2015. [23] D. A. Smith, W. F. Ford, and A. Sidi, “Extrapolation methods for vector sequences,” SIAM Review, vol. 29, no. 2, pp. 199–233, 1987. [24] G. Rosman, L. Dascal, A. Sidi, and R. Kimmel, “Efficient beltrami image filtering via vector extrapolation methods,” SIAM Journal on