Submitted by Kemal Raik, MA MSc.

Submitted at Industrial Mathematics Institute

Supervisor and First Examiner Priv. Doz. DI Dr Stefan Kindermann

Second Examiner Linear and Nonlinear Univ.-Prof. Dr Bernd Hofmann Heuristic Regularisation July 2020 for Ill-Posed Problems

Doctoral Thesis to obtain the academic degree of Doktor der technischen Wissenschaften in the Doctoral Program Technische Wissenschaften

JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, Osterreich¨ www.jku.at DVR 0093696

Abstract

In this thesis, we cover the so-called heuristic (aka error-free or data-driven) parameter choice rules for the regularisation of ill-posed problems (which just so happen to be prominent in the treatment of inverse problems). We consider the linear theory associated with both continuous regularisation methods, such as that of Tikhonov, and also iterative procedures, such as Landweber’s method. We provide background material associated with each of the aforementioned regularisation methods as well as the standard results found in the literature. In particular, the convergence theory for heuristic rules is typically based on a noise-restricted analysis. We also introduce some more recent developments in the linear theory for certain instances: in case of operator perturbations or weakly bounded noise for linear Tikhonov regularisation. In both the aforementioned cases, novel parameter choice rules were derived; for the case of weakly bounded noise, through necessity and in the case of operator perturbations, an entirely new class of parameter choice rules are discussed (so-called semi-heuristic rules which could be said to be the “middle ground” between heuristic rules and a-posteriori rules). We then delve further into the abyss of the relatively unknown; namely the nonlinear theory (by which we mean that the regularisation is nonlinear) for which the development and analysis of heuristic rules are still in their infancy. Most notably in this thesis, we present a recent study of the conver- gence theory for heuristic Tikhonov regularisation with convex penalty terms which attempts to generalise, to some extent, the restricted noise analysis of the linear theory. As the error in this setting is measured in terms of the Bregman distance, this naturally lends itself to the introduction of some novel parameter choice rules. Finally, we illustrate and supplement most of the preceding by including a numerics section which displays the effectiveness of heuristic parameter choice rules and conclude with a discussion of the results as well as a specu- lation of the potential future scope of research in this exciting area of applied mathematics.

1 Zusammenfassung

In dieser Dissertation behandeln wir sogenannte heuristische Parameterwahl- regeln (auch noisefreie oder datengesteuerte Parameterwahlregeln genannt) fur¨ die Regularisierung schlecht gestellter Probleme (welche bei der Behand- lung von inversen Problemen eine herausragende Rolle spielen). Wir behan- deln zun¨achst lineare inverse Probleme sowohl in Kombination mit kontinu- ierlichen Regularisierungsmethoden, wie zum Beispiel der Tichonov Regulari- sierung, als auch mit iterativen, wie etwa dem Landweber Verfahren. Wir lie- fern Hintergrundmaterial zu jedem dieser Regularisierungsmethoden als auch die dazugeh¨origen Standardresultate aus der Literatur, wie etwa die Kon- vergenztheorie fur¨ heuristische Parameterwahlregeln, die typischerweise auf einer Analysis mit Einschr¨ankungen an den Datenfehler basieren. Außerdem stellen wir einige neuere Entwicklungen in der linearen Theorie fur¨ bestimmte Spezialf¨alle vor: im Fall von Operatorst¨orungen oder schwach beschr¨ankten Datenfehlern fur¨ lineare Tichonovregularisierung. In beiden F¨allen werden neuartige Parameterwahlen vorgestellt; im Fall von schwach beschr¨ankten Datenfehlern aus Notwendigkeit, und im Fall von Operatorst¨orungen wird eine v¨ollig neue Klasse von Parameterwahlregeln diskutiert (sogenannte semi- heuristische Regeln, die in gewisser Weise ein “Mittelweg” zwischen heuristi- schen und a-posteriori-Regeln sind). Anschließend tauchen wir weiter in den Abgrund des relativ Unbekannten ein, n¨amlich in die nichtlineare Theorie (d.h., wenn die Regularisierungmethode nichtlinear ist), fur¨ die Entwicklung und Analyse heuristischer Regeln noch im Kindheitsstadium sind. Bemer- kenswert in dieser Arbeit ist, dass wir eine aktuelle Konvergenztheorie fur¨ heuristische Tikhonov-Regularisierung mit konvexem Strafterm entwickeln, die versucht, die Konvergenzanalyse mit Datenfehlerbeschr¨ankungen der li- nearen Theorie bis zu einem gewissen Grad zu verallgemeinern. Da der Fehler bei diesen Methoden ublicherweise¨ in der Bregman-Distanz gemessen wird, bieten sich dementsprechend einige neuartige Regeln fur¨ die Parameterwahl an. Schließlich veranschaulichen und erg¨anzen wir die meisten der vorher- gehenden Resultate in einem Abschnitt mit numerischen Experimenten, der die Wirksamkeit heuristischer Parameterwahlen illustriert, und wir schlie- ßen mit einer Diskussion der Ergebnisse sowie Spekulationen uber¨ m¨ogliche zukunftige¨ Forschungsinhalte in diesem spannenden Anwendungsbereich der Mathematik.

2 Acknowledgements

First and foremost, I would like to acknowledge and thank my supervisor, Dr Stefan Kindermann, who provided significant guidance over the course of my doctoral studies. The research topic, on which this thesis is based, was through his proposal, which was granted funding from the Austrian Science Fund (FWF) to whom I also extend my thanks. Moreover, much of the contents of this thesis are based on research which was jointly conducted by myself and my supervisor. I would also like to thank Professor Bernd Hofmann for agreeing to be the second examiner for this thesis and also for being the original organiser of the Chemnitz Symposium on Inverse Problems which I have twice had the pleasure of participating in. I also owe a great deal of thanks to my family in London for their continued support whilst I have been in Linz, particularly my mother who has visited me here on a great deal of occasions. Given the natural beauty of , she did not need much convincing, however. Finally, I would like to thank my friends and colleagues; most notably, and in no particular order, Dr Simon Hubmer, Fabian Hinterer, Onkar Sandip Jadhav, Alexander Ploier and Dr G¨unter Auzinger for their friendship and discussions, both academic and otherwise.

Kemal Raik Linz, July 2020

3 Contents

1 Introduction 7 1.1 Examples ...... 7 1.2 Preliminaries ...... 10 1.3 Regularisation Methods ...... 13 1.3.1 Continuous Methods ...... 14 1.3.2 Iterative Methods ...... 20 1.3.3 Parameter Choice Rules ...... 21 1.4 Heuristic Parameter Choice Rules ...... 23

I Theory 30

2 Linear Tikhonov Regularisation 31 2.1 Classical Theory ...... 31 2.1.1 Heuristic Parameter Choice Rules ...... 36 2.2 Weakly Bounded Noise ...... 52 2.2.1 Modified Parameter Choice Rules ...... 54 2.2.2 Predictive Mean-Square Error ...... 61 2.2.3 Generalised Cross-Validation ...... 65 2.3 Operator Perturbations ...... 69 2.3.1 Semi-Heuristic Parameter Choice Rules ...... 71

3 Convex Tikhonov Regularisation 78 3.1 Classical Theory ...... 78 3.2 Parameter Choice Rules ...... 86 3.2.1 Convergence Analysis ...... 89 3.2.2 Convergence Rates (for the Heuristic Discrepancy Rule) 92 3.3 Diagonal Operator Case Study ...... 94 3.3.1 Muckenhoupt Conditions ...... 97

4 Iterative Regularisation 108 4.1 Landweber Iteration for Linear Operators ...... 108 4.1.1 Heuristic Stopping Rules ...... 112 4.2 Landweber Iteration for Nonlinear Operators ...... 122

4 CONTENTS 5

4.2.1 Heuristic Parameter Choice Rules ...... 123

II Numerics 125

5 Semi-Heuristic Rules 127 5.1 Gaußian Operator Noise Perturbation ...... 128 5.1.1 Tomography Operator Perturbed by Gaußian Operator. 128 5.2 Smooth Operator Perturbation ...... 129 5.2.1 Fredholm Integral Operator Perturbed by Heat Operator129 5.2.2 Blur Operator Perturbed by Tomography Operator . . 130 5.3 Summary ...... 131

6 Heuristic Rules for Convex Regularisation 136 6.1 `1 Regularisation ...... 137 3 6.2 ` 2 Regularisation ...... 138 6.3 `3 Regularisation ...... 139 6.4 TV Regularisation ...... 139 6.5 Summary ...... 140

7 The Simple L-curve Rules for Linear and Convex Tikhonov Regularisation 144 7.1 Linear Tikhonov Regularisation ...... 145 7.1.1 Diagonal Operator ...... 145 7.1.2 Examples from IR Tools ...... 146 7.2 Convex Tikhonov Regularisation ...... 148 7.2.1 `1 Regularisation ...... 148 3 7.3 ` 2 Regularisation ...... 149 7.4 TV Regularisation ...... 150 7.5 Summary ...... 151

8 Heuristic Rules for Nonlinear Landweber Iteration 152 8.1 Test problems ...... 154 8.1.1 Nonlinear Hammerstein Operator ...... 154 8.1.2 Auto-Convolution ...... 154 8.1.3 Summary ...... 157

III Future Scope 158

9 Future Scope 159 9.1 Convex Heuristic Regularisation ...... 159 9.2 Heuristic Blind Kernel Deconvolution ...... 159 9.2.1 Deconvolution ...... 160 9.2.2 Semi-Blind Deconvolution ...... 160 9.3 Meta-Heuristics ...... 162

A Functional Calculus 177

B Convex Analysis 181

6 Chapter 1

Introduction

Typically in the “real world”, we have problems in which we would like to extract information from given data, e.g., acoustic sound waves and X-ray sinograms among other examples. In particular, an acoustic sound wave recorded on the surface of the Earth contains information regarding the sub- surface, and X-rays contain information on the density of the material which they pass through. In order to recover this information, one must, in ef- fect, reverse the aforementioned processes, i.e., solve the inverse problem. In the theory of inverse problems, this is usually mathematically formalised in operator theoretic terms. That is, we generally consider an equation of the form Ax = y, (1.1) in which A : X → Y is a continuous operator mapping between two vector spaces, called the “forward operator”. The objective is then to invert the forward operator and to thus recover the solution x from measured data y. Generally speaking, the data we measure is considered corrupted to reflect, for instance, real-world machine error, and what we consider in fact is a perturbation of the data, yδ = y + e (where e may be very small), which we call noisy data, and then naturally y is called exact data. One should mention that the noise model, i.e., e may be deterministic or stochastic, although in this thesis, we will limit ourselves to the deterministic framework for ill-posed problems. On the topic of stochastic ill-posed problems though, we opt to refer the reader to [15,23]. Note that significant parts of this thesis are derived from the papers [51,89– 91] of which the author of this thesis was a coauthor.

1.1 Examples

Examples of inverse problems may be found in theory as well as a wide va- riety of applications ranging from differentiation as a theoretically grounded example to tomography as an example found in application.

7 Differentiation Indeed, differentiation and integration may be seen as op- posites of one another and therefore we may define one as the direct (or forward) and the other as the inverse problem, respectively. In this way, we see that the definition of the inverse problem is rather arbitrary as either may qualify. However, it is the norm to define differentiation as the inverse problem and the reason for this is that, unlike integration, the differentiation problem is ill-posed; a concept which we will illustrate by example now and define somewhat more rigorously in the proceeding section. We include the following example from [35]: for any f ∈ C1[0, 1], consider the perturbed function nx f δ(x) := f(x) + δ sin , n δ with δ ∈ (0, 1) and n ∈ {2, 3, ...} arbitrary. In the language of inverse prob- lems, the first term would be the “exact data” and the second term would be the “noise”, and subsequently their sum is referred to as the “data with noise” or in even more colloquial terms, the “noisy data”. Now, differentiating the function above yields nx (f δ)0(x) = f 0(x) + n cos . n δ Note that δ kf − fnk∞ = δ, (1.2) whereas 0 δ 0 kf − (fn) k∞ = n. (1.3) Or to put into words, arbitrarily small data errors (1.2) (e.g., δ < 1) may lead to arbitrarily large solution errors (1.3) (e.g., n → ∞). That is, there is a lack of continuous dependence between the data and the solution which makes this problem, e.g., differentiation, ill-posed. This consequently leads one to approximate the ill-posed problem by a well-posed problem, i.e., to regularise (another term which we shall define later).

Tomography In computerised tomography (CT), which is particularly prevalent in the medical field for a variety of applications, one seeks to recon- struct the density of a medium from X-ray measurements. This falls under the umbrella of non-destructive testing in which one would like to under- stand the properties within a medium without causing any physical damage, hence the term non-destructive. In computerised tomography, the subject of interest is usually the human body or more precisely, a body part. In par- ticular, if we restrict ourselves to a two dimensional domain and let Ω ⊂ R2 represent the (compact) cross-section of a human body, then the “aim of the game” is to recover the density, which we denote by a two-dimensional function f :Ω → R, from X-ray measurements in the plane where Ω lies.

8 In particular, the X-rays travel in straight lines which are parametrised by their normal vector θ ∈ R2 (kθk = 1) and their distance s > 0 from the ori- gin (cf. [35]). The forward operator which maps this is known as the Radon transform, and we can represent it by the following integral expression Z (Rf)(s, θ) := f(sθ + tθ⊥) dt. (1.4) R For the derivation of (1.4), the reader is referred to [118]. Note that the Radon transform was named after the Austrian mathematician Johann Radon who, in fact, derived it under entirely theoretical grounds (cf. [130]). It is quite obvious then that R is the operator which models the forward problem. The inverse problem is therefore to invert the Radon transform and recover the density distribution f. This problem has been treated quite extensively and we refer the reader to [118], [102, Chapter 6] and the references therein. In fact, in R2, an explicit inversion formula for (1.4) already exists thanks to Johann Radon and the aforementioned reference [130]. The point, and rele- vance regarding the theory of ill-posed problems, is that the formula involves taking the derivative of the data and as already shown in the previous exam- ple, differentiation is an ill-posed problem! Thus, by consequence, inversion of the Radon transform is also ill-posed.

Backwards Heat Equation This example was also taken from [35], and we refer the reader to the aforementioned reference for further details as we only give a relatively brief description of the problem. The “forward” heat equation is a well known one, and is also referred to as the diffusion equation as it mathematically models the diffusion of heat in a body or medium. The one-dimensional heat equation is usually written in the following way:

 ∂ ∂2  u(x, t) − u(x, t) = 0,  ∂t ∂x (1.5) u(x, 0) = u0, in Ω,   u = 0, on ∂Ω × [0,T ], with an initial and Dirichlet boundary condition, where Ω ⊂ R is the domain of the body/medium with a constant temperature equal to 0 on its boundary. The forward operator which describes the forward problem would be the one which maps A : u(·, 0) 7→ f, where

f(x) = u(x, T ), (x ∈ Ω) where T > 0 is the (final) time of measurement. This can be solved via, e.g., Fourier analysis. However, our concern is the inverse problem which would be to determine the initial temperature distri- bution u(x, 0) from data derived from measurements of the final temperature.

9 Note, however, that there is no solution for this inverse problem, unless, that is, f is assumed to be analytic (cf. [25, 35]). Restricting ourselves to f for which a solution exists still does not remedy all our problems as this unique (cf. [35]) solution would still not depend continuously on the data. Therefore, two out of three requirements for a “well-posed” problem would be violated (namely, existence, uniqueness and continuous dependence of the data and solution; see below). In order to see this, we follow the example of [35] by writing (1.5) as ( ∆φ + λ φ = 0, in Ω, k k k (1.6) φk = 0, on ∂Ω, where {λk}k and {φk}k represent the eigenvalues and eigenvectors for the 2 Dirichlet problem (1.6) on Ω, respectively, with φk ∈ L (Ω) normalised such that kφkkL2(Ω) = 1 for all k ∈ N. Letting 1 uk(x, t) := φk(x) exp (λk(T − t)) , λk and plugging into (1.5), we see that 1 (∆uk)(x, t) = (∆φk)(x) exp (λk(T − t)) λk = −φk(x) exp (λk(T − t)) ∂ = u (x, t), ∂t k which confirms that uk satisfies (1.5), with fk = φk/λk. Now, since λk → ∞, we get that kfkkL2 → 0, whereas exp(λkT ) kuk(·, 0)kL2 = → ∞, λk as k → ∞. Therefore, considering fk as perturbations of f = 0 with (data) error 1/λk (measured in the L2 norm), the (solution) error of the inverse problem is amplified exponentially by the factor exp(λkT ). Thus, in the “quantification” of ill-posed problems, namely, the so-called degree of ill-posedness (again, see below for details), the backwards heat equation is said to be severely (aka exponentially) ill-posed.

1.2 Preliminaries

Assume henceforth, until stated otherwise, that A ∈ L(X,Y ) is a continuous linear operator between two Hilbert spaces. In case A−1 does not exist, one

10 seeks to construct a generalised inverse A† which recovers the best approxi- mate solution, denoted by x† := A†y (cf. [116,117]).

Definition 1. The Moore-Penrose generalised inverse is defined as the unique linear extension of the operator A˜−1 : range A → ker A⊥ to the new domain dom A† := range A ⊕ range A⊥, with range A⊥ =: ker A†, where A˜ is a re- striction of A to the orthogonal complement of its kernel.

X Y

range A⊥ A ker A range A

ker A⊥ A†

Figure 1.1: The Moore-Penrose generalised inverse, as we see in this illus- tration, maps the direct sum of the range of A and its complement to the complement of the kernel of A.

The Moore-Penrose generalised inverse A† : range A ⊕ range A⊥ → ker A⊥ thus allows a simple way to compute the best approximate solution which is usually expressed in terms of the least-squares solution of (1.1) [35]:

Definition 2. A vector x ∈ X is called a least-squares solution of(1.1) if

kAx − yk = inf kAz − yk. z∈X

The best approximate solution of (1.1) may be defined as x ∈ X satisfying

kxk = inf {kzk | z is a least-squares solution of Ax = y} .

In particular, the following theorem is key in understanding the relationship between the least-squares solution of (1.1) and the generalised inverse [35]:

Theorem 1. x ∈ X is a least-squares solution of Ax = y if and only if the Gaußian normal equation A∗Ax = A∗y, holds for all y ∈ dom A†.

11 Courtesy of Theorem 1, for A∗A continuously invertible, one could “naively” compute the best approximate solution as Z ∞ † † ∗ −1 ∗ 1 ∗ x = A y = (A A) A y = dEλA y, (1.7) 0 λ

† for y ∈ dom A , where {Eλ} refers to the spectral family of the self-adjoint operator A∗A, and we refer to Appendix A for further details and explana- tion. However, for ill-posed problems (to be defined), it turns out that (1.7) is not well defined as, in particular, 0 could be in the spectrum of A∗A, which is precisely where the integrand in (1.7) has a pole [139]. On the other hand, what can be deduced from Theorem 1 is that A† = (A∗A)†A∗ [35]. Another question that begs to be asked is whether A†yδ would be a good approximation of A†y. In other words, if kyδ − yk is small, does that imply that kA†yδ − A†yk will remain small? The answer, it turns out, also depends on whether (1.1) is well-posed or not. There are essentially two well-known definitions of well-posedness which are attributed to Hadamard and Nashed, respectively.

Definition 3. A problem (1.1) is said to be well-posed according to Hadamard [50] if all three of the following criteria are satisfied:

1. For all admissible data, a solution exists (i.e., range A = Y );

2. The solution is unique (i.e., ker A = {0});

3. The solution depends continuously on the data (i.e., A−1 ∈ L(Y,X)).

The working definition we opt to proceed with, however, is that of Nashed [116,117]:

Definition 4. The problem (1.1) is said to be well-posed according to Nashed if range A = range A.

If any one of the criterion of Hadamard, or the criteria of Nashed fail to be satisfied, then (1.1) is said to be ill-posed. The latter definition may be linked with the following theorem [139]:

Theorem 2 (Open Mapping Theorem). A† is continuous if and only if range A = range A.

Whilst we do not present the proof here and instead refer the reader to [35], we illustrate it via a rather abstract and general example. For instance, if K is a compact operator (note that we usually opt to write K in place of A whenever referring to a compact operator), then its range is closed if and only if it is finite dimensional (cf. [33]). Thus, in case dim range K = ∞, (1.1)

12 would automatically be ill-posed. That is, 0 would be an accumulation point of the spectrum, so that (1.7) would no longer be well defined. In particular, δ δ for a compact operator, with yn := y + δun, then kyn − yk = δ, but

† δ † δ kK yn − K yk = → ∞, σn as n → ∞. In particular, for a compact operator K, we can write

∞ X 1 K†y = hy, u iv , (1.8) σ i i i=1 i where {σi; vi, ui} is its singular system (cf. Appendix A), whenever

∞ X 1 y ∈ dom K† ⇐⇒ |hy, u i| < ∞. (1.9) σ i i=1 i Note that (1.9) is commonly referred to as the Picard condition (cf. [35]). Whilst it is only applicable in case the model operator is compact, therefore yielding a singular value decomposition, it provides insight as to when the data y is in the range of the pseudo-inverse (i.e., is attainable). In particular, it says that the Fourier coefficients, i.e., hy, uni, with respect to the singular functions un, should decay faster than the singular values σn. We may also accordingly quantify degrees of ill-posedness (cf. [67]): Definition 5. Let K : X → Y be a compact linear operator between two Hilbert spaces. Then the equation Kx = y with singular value decomposition −β (A.1) is said to be mildly ill-posed if σi = O(i ) for some β > 0 and severely −i ill-posed if σi = O(e ). The faster the singular values decay, the more difficult the inverse problem is to solve. Thus, a severely ill-posed problem tends to be more problematic than the mildly ill-posed case. The “irony” is that for heuristic rules (which will be defined later) to “work”, the problem should in fact be sufficiently ill-posed (but not “too ill-posed”). This will be explained in further detail in the appropriate chapter. Note that in the example of differentiation, which we mentioned as a first −1 example, the order of ill-posedness is given by σi = O(i ) (cf. [35]). There is further literature on the nonlinear extensions on the degree of ill-posedness, cf. [69], as well as extensions for nonlinear equations, cf. [70,71].

1.3 Regularisation Methods

We have seen that the pseudo-inverse does not always yield an acceptable solution. For this reason, we would like to find a different way to compute

13 an acceptable (approximate) solution. As the topic of this thesis is on regu- larisation (cf. [35,49,115,145–149]), this is the method which we will explain below. In particular, it is possible to divide them into continuous and itera- tive methods (although both are related, as will become apparent):

1.3.1 Continuous Methods In case A† is unbounded, we seek to approximate it by a parametric family † of continuous operators {Rα}α>0, with Rα : Y → X such that Rα → A as α → 0 pointwise [35]. In particular, if we consider the integral of the form:

Z ∞ ∗ ∗ ∗ Rαy = gα(A A)A y = gα(λ) dEλA y, (1.10) 0 then we see that the “aim of the game” is to choose a filter function gα such that 1 g (λ) → , (λ > 0), α λ as α → 0, as it is apparent that this would then imply convergence of the regularisation operator to the generalised inverse (see (1.7)). In case the forward operator is compact, one may also recall that we can express (1.10) as ∞ X Rαy = gα(σi)hy, uiivi, i=1 in which we would like for gα(σi) → 1/σi as α → 0 (see (1.8)). We also introduce the residual filter function as

rα(λ) := 1 − λgα(λ), (1.11) which may be derived, for instance, by considering the error. In particular, the residual function is derived in [35] as follows:

† † ∗ ∗ ∗ ∗ † x − Rαy = x − gα(A A)A y = (I − gα(A A)A A)x Z ∞ † ∗ † = (1 − λgα(λ)) dEλx = rα(A A)x . 0 The following convergence proof is also courtesy of [35]:

Theorem 3. If gα is piecewise continuous and there exists a positive constant such that |λgα(λ)| ≤ C, (1.12) and 1 lim gα(λ) = , (1.13) α→0 λ

14 for all λ ∈ (0, kAk2), then

∗ ∗ † Rαy = gα(A A)A y → x , as α → 0 for all y ∈ dom A†. Note that if y∈ / dom A†, then

∗ ∗ kRαyk = kgα(A A)A yk → ∞, as α → 0. Proof. The proof essentially boils down to exploiting Lebesgue’s Dominated Convergence Theorem [36]. We may write Z kAk2+ † 2 2 † 2 kRαy − x k = rα(λ) dkEλx k , 0 and due to (1.12), it follows that

|rα(λ)| = |1 − λgα(λ)| ≤ 1 + C, for all λ ∈ (0, kAk2). Therefore, Z kAk2+ Z kAk2+ 2 † 2 2 † 2 lim rα(λ) dkEλx k = lim rα(λ) dkEλx k , (1.14) α→0 0 0 α→0 is a consequence of the boundedness of the integrand thereby allowing one to utilise the aforementioned dominated convergence theorem. Now, due to (1.13), it follows that

lim rα(λ) = 1 − lim λgα(λ) = 1 − 1 = 0, α→0 α→0 for all λ > 0, and since rα(0) = 1, we also have that

lim rα(0) = 1. α→0 Hence, without going into great detail, it follows that the integral in (1.14) † 2 is equal to the “jump” of λ 7→ kEλx k at λ = 0, i.e., it is equal to † 2 † 2 † 2 limλ→0+ kEλx k −kE0x k = kP x k , where P : X → ker A is an orthogonal projection. For further detail, we refer to [35]. In essence, since x† ∈ ker A⊥, one has P x† = 0, which yields the desired result that

† 2 lim kRαy − x k = 0, α→0 thus, completing the convergence proof. For the divergence result, we assume, for the sake of trying to prove a contradiction, that there exists a bounded sequence {αn} with αn → 0 such that {kRαn yk} is uniformly bounded so that there exists a subsequence {R y} such that kR yk * x ∈ X. Now, αnk αnk due to the weak sequential continuity of A, it follows that ARαn y * Ax.

On the other hand, ARαn y → Qy =⇒ Ax = Qy, where Q : Y → range A is an orthogonal projection, i.e., we must have y ∈ dom A†; that is, no such bounded sequence can exist for y∈ / dom A† and this completes the proof.

15 Next we provide some typical examples for the choice of the regularisation operator Rα: Example 1. Sticking with the general form (1.10), Tikhonov regularisation (cf. [145, 146]), sometimes also called Tikhonov-Phillips regularisation [128], amounts to choosing 1 g (λ) = , α λ + α ∗ −1 ∗ (i.e., taking Rα = (A A + αI) A , which is well defined as it is well known that the operator A∗A + αI is invertible for α > 0). It is trivial to see that this tends to the desired integrand as α vanishes (and equivalently, that (A∗A + αI)−1A∗ → (A∗A)†A∗ as α → 0).

Example 2. An alternative is the spectral cutoff method (cf. [35,102]), which consists of choosing 1 g (λ) = χ (λ) , α (α,∞) λ i.e., Z ∞ 1 ∗ Rαy = dEλA y, α λ where χ(α,∞) is the characteristic function on the interval specified by the subscript; that is ( 1, if λ ∈ (α, ∞), χ (λ) := (α,∞) 0, otherwise. The intuition of this regularisation method is that we cut off the problematic eigenvalues congregating near zero. As before, it is clear to see once more † that Rαy → A y as α → 0. Now, in the presence of noise, e.g., given data yδ = y + e, convergence is not so straightforward to prove as a quick glance of Figure 1.3.1 may suggest. In particular, one may observe in Figure 1.3.1 that for a given noise level δ > 0, the error tends to infinity as the parameter α tends to zero. In the presence of noise, however, convergence is proven with respect to δ (rather than α, with an α = α∗ such that 0 < α∗ < αmax < ∞ usually depending on δ). It is common to define

δ δ xα = Rαy and xα = Rαy, where the latter xα is really an auxiliary function we consider only in proofs as it not possible to calculate it in practice, since we tend to have the presence δ of noise in our data. xα, on the other hand, is the approximate solution which we actually calculate via regularisation. The approximation lies in a neighbourhood of the exact solution x†, the proximity to which is determined by α. The idea is to choose the α as small as possible in order to obtain a

16 more accurate approximation whilst bearing stability of the computation in mind. In order to understand the relationship between the stability and approxi- mation errors, it is insightful to estimate the total error as

δ † δ † kxα − x k ≤ kxα − xαk + kxα − x k, (1.15) where the first and second term correspond to the data propagation (a.k.a. stability) error and the approximation error, respectively. Subsequently,

δ † kxα − xαk ≤ B(α), and kxα − x k ≤ V (α), (1.16) where α 7→ B(α) and α 7→ V (α) are typically increasing and decreasing functions, respectively. Thus, one has that for small α, the data propagation error blows up, and for large α, one gets a larger approximation error.

δ † kxα − x k

δ kxα − xαk † kxα − x k

α

δ † Figure 1.2: In this plot, the total error, i.e., kxα − x k is represented by δ the blue line whilst the propagated data (kxα − xαk) and approximation † + (kxα − x k) errors, represented by the red lines, are increasing for α → 0 and α → ∞, respectively.

Source Conditions As well as proving convergence, as in Theorem 3, we are also interested in convergence rates (i.e., the “speed” of convergence). In particular, it is known that the approximate solution may converge arbitrarily slowly to the desired solution in the absence of certain conditions. We give the following proposition from [141]:

17 Proposition 1. Let range A be non-closed and α = α(δ, yδ) be a parameter choice rule. Then there does not exist any function f : (0, ∞) → (0, ∞) with limδ→0 f(δ) = 0 for which

δ † kxα(δ,yδ) − x k ≤ f(δ), holds for all y ∈ dom A† with kyk ≤ 1 and all δ > 0. In other words, the proposition above states that one cannot obtain uniform convergence rates for ill-posed problems, i.e., the convergence may be arbi- trarily slow. We do not give the proof of Proposition 1 here, but instead refer the reader to [35]. Thus, in order to overcome this issue, one usually enforces a so-called source condition on the solution, i.e.,

x† ∈ range(φ(A∗A)) ⇐⇒ x† = φ(A∗A)ω, kωk ≤ C, (1.17) with an index function φ : R+ → R+ (i.e., φ is continuous, strictly monoton- ically increasing and satisfies φ(0) = 0 cf. [83]), i.e.,

Z kAk2+ 1 † 2 2 dkEλx k < ∞. (1.18) 0 φ (λ) Typical index functions include the so-called H¨oldertype ones:

φ(λ) = λµ, (µ > 0) (1.19) where larger µ indicates higher degrees of smoothness (of the solution) and for this reason, we may refer to it as the smoothness index. For problems in which the singular values decay very rapidly (i.e., for severely ill-posed problems as defined previously), (1.19) may be unnecessarily strong and therefore, a more appropriate index function may be  C −µ φ(λ) = log , λ i.e., a logarithmic type source condition [73]. The following proposition from [110] shows that for (1.17) satisfying (1.20), we achieve the desired bound on the approximation error: Proposition 2. Let x† satisfy a source condition (1.17) and suppose that there exists a positive constant such that

rα(λ)φ(λ) ≤ Cφ(α), (1.20) for all α > 0 and λ ∈ (0, kAk2). Then the approximation error may be estimated as † kxα − x k ≤ Cφ(α), (1.21) for all α ∈ (0, αmax).

18 Proof. Recalling that y = A†x†, we may write Z ∞ † 2 † † 2 2 † 2 kxα − x k = kRαAx − x k = |1 − λgα(λ)| dkEλx k 0 Z ∞ 2 Z ∞ 2 2 φ (λ) † 2 2 1 † 2 = |rα(λ)| 2 dkEλx k ≤ Cφ (α) 2 dkEλx k 0 φ (λ) 0 φ (λ) ≤ C0φ2(α), where we used (1.20) in the penultimate inequality and the final inequality is a consequence of (1.18), thereby completing the proof. Note that (1.20) typically does not hold for all φ, for instance, if we consider the µ dependent ones then (1.20) may only hold for µ ∈ (0, µ0) in a finite in- terval, where µ0 would be the so-called qualification index, i.e., the maximum µ for which optimal convergence rates hold, i.e., where saturation occurs (to be exemplified in the proceeding Chapter 2).

Iterating the Regularisation After regularising the problem (1.1), one may opt to regularise again and consequently define the so-called second regularisation iterate. In particular, δ given an “initial guess”, which would be the (first) regularised solution, xα, we can consider the equation

δ δ δ δ A(z + xα) = y =⇒ Az = y − Axα,

δ and thereby iterate the process of regularisation to yield a new vector, zα := δ δ Rα(y − Axα). Thus, the second iterate may be defined as

II δ δ δ δ δ δ δ xα,δ := zα + xα = Rα(y − Axα) + Rαy = 2Rαy − RαARαy , i.e., II δ xα,δ = (2Rα − RαARα)y , (1.22) II is the second regularisation iterate and we adopt the notation xα for the noise-free case. Note that it is possible to repeat this process n times and we exemplify this with e.g., (2.9). It is also useful to define

δ δ δ II δ II pα := y − Axα and pα,δ := y − Axα,δ, (1.23) as the residuals with respect to a regularisation and the second iterate of II an iterated regularisation, respectively. Moreover, pα and pα define their noise-free variants. The aforementioned families of regularisation operators, however, are not sufficient for determining a stable approximation of x†. In particular, which

19 α ∈ (0, αmax) does one opt for? If we choose it too large, then the approx- imation error increases; but if we choose it too small, then we may not be able to compute it stably in the presence of data error. A parameter choice is needed, i.e., a way to select α = α∗ which will ensure a stable yet accurate computation of the regularised solution. To this end, we define a regulari- δ † sation method as a pair (Rα∗ , α∗) such that Rα∗ y → A y and α∗ → 0 as δ → 0. We will expand on this in Section 1.3.3.

1.3.2 Iterative Methods If we recall the Gaußian normal equation: A∗Ax = A∗y, then it is not hard to see that one can transform this into the following fixed point equation

x = x + A∗(y − Ax), to approximate x†: ∗ xk = xk−1 + A (y − Axk−1), (1.24) with k ∈ N and some initial guess x0. Note that (1.24) is known as Landwe- ber iteration (cf. [97]) and will be explored further in Chapter 4. Iterative methods may be more generally expressed as

xk = xk−1 + Gk(xk−1, y), (1.25) for different choices of Gk (cf. [81] and the references therein). The iteration index k acts as the regularisation parameter for iterative methods and the 1 relationship with α is given by k ∼ α . Thus, if it were not already clear before, it should be clear now that the asymptotics are inverted in the sense that where α → 0 for continuous methods, the iterative regularisation should † be such that xk → x as k → ∞. Examples include Landweber iteration, already stated and also Newton type methods among others. δ † In case (1.1) is ill-posed, then it is not true that xk → x as k → ∞. Instead, one observes a phenomenon known as semi-convergence in which δ † kxk −x k → ∞ as k → ∞, but for a particular k∗ (which we call the stopping δ † index, to be detailed in the proceeding section), one has that xk∗ → x as δ → 0. Note that we adapt the notation of (1.23) so that

δ δ δ pk := y − Axk, and pk := y − Axk, (1.26) define the residual variables in this case. Moreover, it follows from (1.24), that the second iterates in this case correspond to

II δ II δ xk,δ = x2k, and pk,δ = p2k, (1.27) which can easily be verified, i.e., iterating an iterative regularisation, e.g., Landweber iteration, is tantamount to doubling the iteration index.

20 1.3.3 Parameter Choice Rules The optimal parameter would be the one which minimises the total error:

δ † αopt = argmin kxα − x k. (1.28) α∈(0,αmax)

However, (1.28) is not implementable as x† is unknown. Similarly, in case one regularises via an iterative method, one can also use a stopping rule which, much like a parameter choice rule, is a methodology for approximating the optimal stopping index, kopt (which minimises the total error with respect to k ∈ N). Therefore, we are left with the task of approximating αopt via an imple- mentable parameter choice rule the consequent parameter of which we shall denote henceforth by α∗. In particular, they may be categorised as follows: • a-priori: assumes knowledge of δ. As indicative of the name, such a parameter may be computed prior to measurement of the data. For s(µ) s(µ) instance, one may choose α∗ = O(δ ) or equivalently k∗ = O(δ ), with an exponent s(µ) that depends on the smoothness index µ of the solution.

• a-posteriori: assumes knowledge of δ and yδ. A typical example would be Morozov’s discrepancy principle (cf. [114,150]) in which would com- pute the parameter as

δ δ α∗ = α(δ, y ) = sup{α ∈ (0, αmax) | kpαk ≤ τδ}, (1.29)

δ or equivalently, k∗ = inf{k ∈ N | kpαk ≤ τδ}, with a constant τ ≥ 1. • heuristic (a.k.a. error-free, data-driven) [45, 83, 84, 87]: only assumes δ δ δ knowledge of y , i.e., α∗ = α(y ), or equivalently, k∗ = k(y ). As this is the topic of the thesis, we proceed to expand on this particular way of selecting the parameter below in the next section. In a way, the most “disadvantaged” of the above classes of parameter choice rules are the a-priori rules since they require knowledge of both the noise level δ and the smoothness index µ. At first glance, the naive reader would assume the heuristic class of rules would be the natural choice, given that they are the least strenuous with regard to the information required to implement them. However, it will become apparent in the subsequent convergence theory (and to some degree in the numerics) that heuristic rules are not without their banes. Note that in the stochastic case, i.e., where the noise e is considered random, it is common to assume that it has a Gaußian distribution. As such, it is typical in that case to utilise statistical estimators in order to determine the noise level δ (under certain restrictive conditions) and then to combine it

21 V (α) B(α)

α αopt

Figure 1.3: In this plot, the selected parameter αopt is the so-called oracle choice which mimiseses the sum of B(α) and V (α) which are the upper bounds for the data propagation and approximation errors, respectively. The class of heuristic parameter choice rules which we will focus on are based on minimising functionals which try to emulate the aforementioned oracle choice. with one of the a-posteiori methods mentioned above (cf. [15,65]). In terms of heuristic rules, the most popular in the stochastic case are arguably the cross-validation and generalised cross-validation rules (cf. [106,107]). We will discuss the GCV rule in a deterministic framework in Section 2.2.3. Definition 6. For a continuous linear operator A : X → Y between two Hilbert spaces, the family of operators {Rα}, with Rα : Y → X for each † α ∈ (0, αmax), is called a regularisation (for A ) if there exists a parameter choice rule α = α∗ such that δ † lim sup kRα∗ y − A yk = 0, (1.30) δ→0 yδ∈Y ky−yδk≤δ

† for all y ∈ dom A , where α∗ is also such that

lim sup α∗ = 0. (1.31) δ→0 yδ∈Y ky−yδk≤δ

† For any y ∈ dom A , a pair (Rα, α) is called a (convergent) regularisation method (for solving (1.1)) whenever (1.37) and (1.31) hold. Note that (1.30) case is often referred to as worst case convergence as, in particular, the method must converge for the supremum of all possible noise.

22 1.4 Heuristic Parameter Choice Rules

In case one does not have knowledge of the noise level δ, one may select δ the parameter depending only on the measured data alone, i.e. α∗ = α(y ). Generally speaking, such a parameter is typically selected as the minimiser of some convex functional ψ : (0, αmax) × Y → R ∪ {∞}, i.e., δ α∗ = argmin ψ(α, y ), (1.32) α∈(0,αmax) where ψ acts as a surrogate functional, i.e., it should serve as an estimator for the error (cf. Figure 1.4). Similarly, a heuristic stopping rule may be defined via δ k∗ = argmin ψ(k, y ), (1.33) k∈[1,kmax] where ψ : N × Y → R ∪ {∞}. In this thesis, whenever A : X → Y is a continuous linear operator between Hilbert spaces, the functionals we consider may represented via their spectral form (cf. Appendix A): Z ∞ δ 2 δ 2 ψ(α, y ) = Φα(λ) dkFλy k , (1.34) 0 with Φα : (0, ∞) → R, a spectral filter function. The idea of the ψ-functionals, and consequently the associated heuristic rules, is that the aforementioned functional should act as a surrogate for the error functional (cf. Figure 1.4), i.e., δ δ † ψ(α, y ) ∼ kxα − x k. Note that for a ψ-functional of the form (1.34), we may apply the triangle inequality to estimate the functional by two components: ψ(α, yδ) ≤ ψ(α, y − yδ) + ψ(α, y), (1.35) with the first term denoting the “noisy” component and the second term being associated with exact data. In particular, it is for this reason that (under certain conditions), δ δ ψ(α, y − y ) ∼ kxα − xαk, † ψ(α, y) ∼ kxα − x k, (1.36) i.e., the functional acting on the noise should behave similarly to the data propagation error, and the functional acting on the exact data should anal- ogously behave like the approximation error. In order to prove convergence of a regularisation method (cf. Definition 6), one should be able to prove convergence with respect to the parameter choice for all possible noise, i.e., for the worst case scenario (cf. (1.30)). A pitfall for heuristic rules is the so-called Bakushinskii veto [4], which is the following:

23 ψ(α, yδ) δ † kxα − x k

α α∗ Figure 1.4: In this plot, we illustrate the principle that the heuristic ψ- functional should somehow act as a surrogate for the error (represented by the black curve) so that minimising the ψ-functional which only requires knowledge of the data should yield a parameter which is as close as possible to the optimal one.

Proposition 3 (Bakushinskii’s veto). Suppose (1.1) is ill-posed, i.e., A† is not continuous, and let α = α∗. Then we have that

δ † kxα∗ − x k → 0, as δ → 0 only if α∗ depends on δ.

δ Proof. Suppose that α∗ = α(y ) does not depend on δ. Then

δ † lim sup kRα(yδ)y − A yk = 0, (1.37) δ→0 yδ∈Y kyδ−yk≤δ

† † would imply that Rα(y)y = A y for all y ∈ dom A . Thus, if {yn} (with † † yn ∈ dom A ) is such that yn → y ∈ dom A , then this would in turn imply † † † that A yn = Rα(yn)yn → A y, i.e., that A is continuous [35]. Contrary to Bakushinskii’s veto, however, heuristic parameter choice rules when implemented are often successful, usually providing better than sat- isfactory results (cf. [9]). A possible reason could be due to the fact that Bakushinskii’s veto considers the supremum of all possible noise, i.e., the worst possible noise, which in practical implementations is often not the en- countered scenario. Therefore, Kindermann and Neubauer proposed a noise

24 restricted analysis in which one can prove convergence of the heuristic rules (cf. [83,87]), i.e., instead of attempting to prove (1.37), one proves δ † lim sup kRα∗ y − A yk = 0, δ→0 yδ∈Y y−yδ∈N where N ⊂ Y is a restriction on the noise aimed to bypass the aforemen- tioned veto of Bakushinskii. In particular, with this concept of convergence, Proposition 3 no longer applies. Prior to the aforementioned solution, Glasko and Kriskin (cf. [45]) also de- fined  δ δ N := e ∈ Y | kxα − xαk ≤ Cψ(α, y − y ) for all α ∈ (0, αmax) , (1.38) as the so-called auto-regularisation set in order to prove convergence with respect to the quasi-optimality rule (although, as stated, one may use this same condition to prove convergence in the linear theory with respect to all the mentioned heuristic rules), which postulates a bound on the data propagation error. In particular, we provide a sketch below of how one would utilise this condition in order to prove convergence: δ † δ † δ kxα∗ − x k ≤ kxα∗ − xα∗ k + kxα∗ − x k ≤ Cψ(α∗, y − y ) + φ(α∗), where the first inequality is a mere application of the triangle inequality and the second inequality follows from the above auto-regularisation (1.38) condition and the bound for the approximation error by the index function, (1.21). Then, due to the sub-additivity condition of the ψ-functionals, one obtains the bound δ δ ψ(α∗, y − y ) ≤ ψ(α, y ) + ψ(α∗, y). (1.39) Additionally, the functionals satisfy ψ(α, yδ) → 0 as δ → 0 (cf. Proposition 4) and ψ(α∗, y) ≤ Cα∗ for all α ∈ (0, αmax). Hence, since it can (and will) be δ † shown that α∗ → 0 as δ → 0, it follows that xα∗ → x in the strong (i.e., norm) topology. One may view (1.38) as quite an abstract condition, however, which does not provide any particular insight or qualitative information regarding the noise. Prior to defining some concrete examples of ψ-functionals, we first provide some associated “axioms”. In particular, from [83], we can assume that all ψ functionals (in the linear Hilbert space setting) satisfy the following assumptions: δ δ ψ(α, y ) ≥ 0, for all α ∈ (0, αmax) and y ∈ Y ; (A1) δ δ δ ψ(α, y ) = ψ(α, −y ) for all α ∈ (0, αmax) and y ∈ Y ; (A2) ψ(α, ·): Y → R ∪ {∞} is continuous for all α ∈ (0, αmax); (A3) ψ(·, y) : (0, αmax) → R ∪ {∞} is l.s.c. for all y ∈ Y ; (A4) y ∈ dom A† implies lim ψ(α, y) = 0. (A5) α→0

25 We may now give the following useful convergence result from [83]:

Proposition 4. Let α∗ be selected by (1.32) and ψ satisfy (A1)-(A5). Then ψ(α, yδ) → 0 as δ → 0 for all yδ ∈ Y .

δ δ Proof. It is immediate that ψ(α∗, y ) ≤ ψ(α, y ) for all α ∈ (0, αmax). Addi- tionally, by (A3), it follows that

δ lim sup ψ(α∗, y ) ≤ ψ(α, y), for all α ∈ (0, αmax). δ→0 In light of (A5), it follows that

δ δ 0 ≤ lim inf ψ(α∗, y ) ≤ lim sup ψ(α∗, y ) ≤ 0, δ→0 δ→0 hence, yielding the desired result.

Definition 7. Let gα be such that rα ≥ 0. Then, for such a regularisation, we can define four “classical” heuristic parameter choice rules via the functionals (which also satisfy (A1),(A2),(A3),(A4),(A5)): 1 1 ψ (α, yδ) := √ kpδ k = √ kr (AA∗)yδk, (HD) HD α α α α

δ 1 II δ 1 1 ∗ 3 δ ψ (α, y ) := √ hp , p i 2 = √ kr (AA ) 2 y k, (HR) HR α α,δ α α α 1 1 δ δ δ II 2 ∗ ∗ ∗ 2 δ ψL(α, y ) := hxα, xα − xα,δi = kA gα(AA )rα(AA ) y k, (L) δ II δ ∗ ∗ ∗ δ ψQO(α, y ) := kxα,δ − xαk = kA gα(AA )rα(AA )y k, (QO) which define the heuristic discrepancy (HD), Hanke-Raus (HR) (cf. [59]), simple-L (L) (cf. [91]) and quasi-optimality (QO) (cf. [45,87,145,146]) func- tionals, respectively.

These would generally be the considered as the so-called “classical rules”. Their stopping rule counterparts are defined analogously by replacing the δ II δ δ δ II pα, pα,δ terms above by pk, p2k, respectively, and similarly xα and xα,δ are δ δ replaced by xk and x2k, respectively, in the same fashion. Similarly for the filter functions. In this thesis, we will generally focus slightly more on contin- uous regularisation methods, however. Moreover, one should note that the second equalities with the filter functions are only valid in case of linear reg- ularisation over Hilbert spaces. Thus, the expressions in the first equalities are the more general and can be used in a more expanded setting. The heuristic discrepancy rule was originally proposed as a stopping rule for Landweber iteration in the paper [59] whilst the Hanke-Raus rule was suggested as a parameter choice rule for Tikhonov regularisation. However, one may just as well use the HD rule continuous regularisation methods

26 such as Tikhonov regularisation and for this reason, it is sometimes rather confusingly also called the “Hanke-Raus” rule. The intuition is that since 1 ψ (α, y) = √ kr (AA∗)Ax†k ∼ kr (A∗A)x†k = kx − x†k, HD α α α α

δ ∗ δ √ in most cases, it seems reasonable to expect that ψHD(α, y ) = krα(AA )y k/ α may be used for error estimation (cf. [35]). The logic behind the quasi-optimality rule is that the minimum of ψQO will normally be obtained near the “cross-over point” where the propagated data and approximation errors have roughly the same order of magnitude (cf. [35]). This incidentally turns out to be an effective parameter choice rule, as will be revealed in later sections. Note that we may also “extend” the above to define a class of ratio rules by δ simply dividing the functionals above by kxαk, e.g. kpδ k ψ (α, yδ) := √ α , HDR δ αkxαk which defines the heuristic disrepancy ratio functional, and in similar fashion, we use the notation ψHRR and ψQOR to refer to the Hanke-Raus ratio and quasi-optimality ratio functionals [99]. The ψHDR functional defined above is also known as the Brezinski-Rodriguez-Seatzu rule [18]. In the case of stopping rules, the ratio rules are defined in an equivalent manner, replacing δ δ δ δ α with 1/k, and clearly also xα and pα by xk and pk, respectively.. The rationale behind the ratio rules being that whilst the standard functionals approximate the total error, the ratio functionals similarly approximate the relative error: kxδ − x†k Err (α) = α . rel kx†k However, we may disappoint the reader by stating now that the ratio rules lie beyond the scope of this thesis. A further class of heuristic rules include those which combine pre-existing functionals, e.g., ψ (α, yδ) 1 hpII , pδ i δ HR √ α,δ α ψHME(α, y ) := δ = II , ψHD(α, y ) α kpα,δk which defines the functional for the heuristic monotone error rule [52, 144]. One can observe that this is nothing other than the ratio of the Hanke-Raus and heuristic discrepancy functionals. This is also one of the “heuristifica- tions” of certain a-posteriori parameter choice rules; in this case, the mono- tone error rule which selects the parameter as

( II δ ) hpα,δ, pαi α∗ = sup α ∈ (0, αmax) | II ≤ τδ , kpα,δk

27 for a suitably chosen τ ≥ 1. Similarly, the HD rule is a heuristification of the discrepancy principle (1.29) mentioned earlier and the Hanke-Raus rule is a heuristification of the so-called improved discrepancy principle [34,35,44]:

 II δ α∗ = sup α ∈ (0, αmax) | hpα,δ, pαi ≤ τδ , (1.40) also known as the Raus-Gferer rule, which has the advantage of always being an optimal order method, contrary to the standard discrepancy principle. On the other hand, an a-posteriori rule seemingly based on a heuristic rule is the so-called Lepskii rule a.k.a. balancing principle [96,100,109]

 II δ α∗ = sup α ∈ (0, αmax) | kxα,δ − xαk ≤ τδ , which is arguably a “a-posteriori-fication” of the quasi-optimality rule. An- other, in fact very popular, heuristic rule is the generalised cross validation rule [125,152]. However, that rule is restricted to the finite dimensional case and we shall discuss it in Section 2.2.3.

Graph Based Heuristic Rules In addition to the ψ-based methods, al- ternative options for heuristic rules include choosing the parameter α∗ = α(yδ), again, based on the data alone, via a graph based method. Arguably the most popular of these methods and perhaps all heuristic rules is the L- curve rule [60]. Other methods include the U-curve [93], the V-curve [40] and the recently proposed Q-curve rules [135]. δ δ The L-curve method consists of plotting the graph (log kpαk, log kxαk) which, in principle, should produce an “L” shaped curve allowing one to subse- quently select the parameter α∗ as the corner of the “L”, i.e., the point of maximum curvature. See Figure 1.5. The intuition behind the L-curve is that we would like to choose a parameter such that the residual is small, δ whilst also maintaining stability of the solution, i.e., kxαk should not “blow up”. Whilst the logic may seem sound, it has been observed that the rule may sometimes fail and cannot be subjected to the same analysis as the ψ- based methods, e.g., the noise restricted analysis induced by (2.17). Thus, it is difficult to garner an understanding of when the method works and oth- erwise. Recently, however, a ψ-functional rule based on the L-curve method was developed [91] for which a convergence analysis consistent with the other ψ-based methods is possible. See Section 2.1.1 for details. We will not go into detail on the other methods, since the focus of this thesis is on the ψ-functional based methods, but the V-curve essentially chooses the parameter as the minimiser of the speed of the parameterisation of the L-curve. The Q-curve, on the other hand, consists of plotting the graph of II δ δ (loghpα,δ, pαi, log ψQO(α, y )) and choosing the point of maximum curvature, similarly as we do for the aforementioned L-curve rule. Whilst the Q-curve also exhibits an “L” shaped curve, in spite of its name (which is derived from

28 δ kxαk

δ δ α∗ kAxα − y k

Figure 1.5: In this figure, we see a typical example of how an L-curve plot δ δ δ should look. The idea is that the graph of (kAxα − y k, kxαk) should exhibit an “L” shape and the parameter is selected as the “corner point”. its incorporation of the QO functional rather than the general shape of its plots), the U-curve, in principle, should actually exhibit a “U” shaped plot. We would plot (α, U(α)), where 1 1 U(α) := δ + δ , kpαk kxαk and we would select the parameter α∗ as the one which corresponds to the local maximum on the left-hand-side of the graph. Indeed, the “U” shape arises from the observation that U(α) → ∞ whenever α → 0 and also as α → ∞. (cf. [93]).

29 Part I

Theory

30 Chapter 2

Linear Tikhonov Regularisation

In this chapter, we provide a theoretical analysis of Tikhonov regularisation, which the observant reader will recall was introduced in Example 1 of Chap- ter 1. This form of regularisation was introduced by its namesake, Tikhonov (cf. [145, 146]), in its variational form (see (2.4) below) in order to approx- imate the solution of ill-posed Fredholm integral equations. However, since its inception, Tikhonov regularisation has been thoroughly analysed via its spectral representation (see (2.1) below) for linear ill-posed problems (cf. [35] and the references therein). We therefore devote the first section of this chap- ter as an exposition of the aforementioned classical theory with relevance to the heuristic parameter choice rules. There are then a further two sections which provide an expansion of the classical theory to particular deviations of the typical framework for ill-posed problems: when the noise in unbounded in the usual norm and when the operator is perturbed, respectively. As mentioned, this method is sometimes also referred to as Tikhonov-Phillips regularisation as Phillips [128] suggested it around the same time. Further references include [49,147] and those therein. A scan of the Table of Contents would reveal that the proceeding chapter is also on Tikhonov regularisation, but that covers the relatively more recent application to so-called convex regularisation problems.

2.1 Classical Theory

Now we delve into the classical convergence theory for Tikhonov regulari- sation based on error estimates. The functional calculus for operators will feature heavily in the analysis and we therefore refer the unacquainted reader to Appendix A. Otherwise, we proceed. Let A : X → Y be a continuous linear operator between two Hilbert spaces. The idea of Tikhonov regularisation is to shift the spectrum of A∗A away from the singularity (at zero), i.e., we estimate (1.7) by δ ∗ −1 ∗ δ xα = (A A + αI) A y , (2.1)

31 with α ∈ (0, αmax) (cf. [145,146]). In this case, we recall from Example 1, that the associated filter function is given by 1 g (λ) = , (2.2) α λ + α and the filter function for its associated residual is given by α r (λ) = . (2.3) α λ + α The equation (2.1) may also be written in the variational form [35]:

δ Proposition 5. Let xα be defined by (2.1). Then

δ δ 2 2 xα = argmin kAx − y k + αkxk , (2.4) x∈X

δ for all α ∈ (0, αmax) and y ∈ Y .

δ Proof. From the optimality condition, it is clear that xα satisfies

∗ δ δ δ ∗ δ ∗ δ 0 = A (Axα − y ) + αxα = (A A + αI)xα − A y ∗ δ ∗ δ ⇐⇒ (A A + αI)xα = A y , from which the result follows trivially. In the latter sections, we will see that the form (2.4) lends itself better for generalisation than the spectral formulation (2.1) (cf. Chapter 3). Now we are ready to prove the following error estimates which will also be utilised for subsequent results [35]: Proposition 6. There exist positive constants such that

δ δ † kx − x k ≤ C √ , kx − x k = O(1), as α → 0, α α α α δ √ kA(xα − xα)k ≤ Cδ, kAxα − yk ≤ C α,

δ for all y, y ∈ Y and α ∈ (0, αmax) Proof. First we estimate the data propagation error

δ 2 ∗ −1 ∗ δ 2 kxα − xαk = k(AA + αI) A (y − y)k = kA∗(AA∗ + αI)−1(yδ − y)k2 ∗ 1 ∗ −1 δ 2 = k(AA ) 2 (AA + αI) (y − y)k Z ∞ λ δ 2 = 2 dkFλ(y − y)k 0 (λ + α) Z ∞ 1 δ 2 ≤ C dkFλ(y − y)k , α 0

32 since λ λ 1 1 = · ≤ . (λ + α)2 λ + α λ + α α Thus, in this case, δ kxδ − x k ≤ B(α) = C √ . α α α The approximation error is estimated as

† 2 ∗ −1 ∗ † 2 ∗ −1 † 2 kxα − x k = k[(A A + αI) A A − I]x k = kα(A A + αI) x k Z ∞ 2 α † 2 = 2 dkEλx k . 0 (λ + α) In particular, we prove similarly as in [35] and Theorem 3, by observing that since the integrand, i.e., the residual function, is bounded from above, we may apply the dominated convergence theorem identically as before to get

Z ∞ 2 Z ∞ 2 α † 2 α † 2 lim 2 dkEλx k = lim 2 dkEλx k , α→0 0 (λ + α) 0 α→0 (λ + α) and since the limit tends to zero for all λ > 0 and to 1 for λ = 0, the result follows as in the proof of Theorem 3. Now, we estimate the data discrepancy as

δ 2 ∗ −1 ∗ δ 2 kA(xα − xα)k = kA(A A + αI) A (y − y)k = k(AA∗ + αI)−1AA∗(yδ − y)k2 Z ∞ 2 λ δ 2 2 = 2 dkFλ(y − y)k ≤ Cδ . 0 (λ + α) Finally, we may bound the exact discrepancy in similar fashion

2 † 2 ∗ −1 † 2 kAxα − yk = kA(xα − x )k = kαA(A A + αI) x k Z ∞ 2 α λ † 2 = 2 dkEλx k ≤ Cα, 0 (λ + α) thereby completing the proof. The above estimates cater for a simple proof of the following convergence result for Tikhonov regularisation [35]: √ Theorem 4. Let α = α(δ) be chosen such that α → 0 and δ/ α → 0 as δ → 0. Then we have δ † kxα − x k → 0, as δ → 0.

33 Proof. From the estimates in the previous proposition, we can estimate the total error as δ † δ kx − x k ≤ C √ + O(1), α α √ for all α ∈ (0, αmax). Therefore, choosing α = α(δ) such that δ/ α → 0 and α → 0 as δ → 0 yields convergence. Due to Proposition 1, convergence in Theorem 4 may be arbitrarily slow. We remind the reader, however, that Proposition 1 (on arbitrarily slow conver- gence) is not just valid for Tikhonov regularisation, but for all regularisation methods. In any case, to prove convergence rates, we postulate the following H¨older type source condition:

x† ∈ range(A∗A)µ, µ > 0, (2.5) i.e., there exists an ω such that x† = (A∗A)µω. Thence, we obtain the following estimates [35]:

Proposition 7. Let (2.5) hold. Then there exists constants such that

† µ kxα − x k ≤ Cα , for µ ∈ [0, 1), (2.6)   µ+ 1 1 kAx − yk ≤ Cα 2 , for µ ∈ 0, , (2.7) α 2 for all α ∈ (0, αmax) and y ∈ Y . Proof. We have

† 2 ∗ −1 † 2 ∗ −1 ∗ µ 2 kxα − x k = kα(A A + αI) x k = kα(A A + αI) (A A) ωk Z ∞ 2 2µ  2 2µ  Z ∞ α λ 2 α λ 2 = 2 dkEλωk ≤ sup 2 dkEλωk 0 (λ + α) λ∈(0,∞) (λ + α) 0  µα 2µ = (µ − 1)2 kωk2 = Cα2µ, 1 − µ for µ ∈ [0, 1), and

2 ∗ −1 ∗ µ 2 kAxα − yk = kαA(A A + αI) (A A) ωk Z ∞ 2 2µ+1  2 2µ+1  Z ∞ α λ 2 α λ 2 = 2 dkEλωk ≤ sup 2 dkEλωk 0 (λ + α) λ∈(0,∞) (λ + α) 0 1 α + 2µα2µ = α (1 − 4µ2)kωk2 = Cα2µ+1, 4 1 − 2µ

1 for µ ∈ [0, 2 ), which yields the desired estimates.

34 In light of the above, we are now able to give the convergence rates result [35]:

Corollary 1. Let (2.5) hold with µ ∈ [0, 1). Then, choosing α = α(δ) = 2 αopt := δ 2µ+1 yields that

 2µ  δ † 2µ+1 kxα − x k = O δ , (2.8) as δ → 0.

Proof. Recall the estimates (1.15) and (1.16). Then it follows from the data propagation error estimate of Proposition 6 and (2.6), that computing   δ † δ µ kxα − x k ≤ inf C √ + Cα α∈(0,αmax) α

2 yields the desired estimate, with α = αopt = δ 2µ+1 . The estimate (2.8) is known as the optimal order of convergence (cf. [35]). From the estimates above, we may observe that Tikhonov regularisation exhibits the well-known saturation effect in which the convergence rates do not improve for µ ≥ 1 and in particular, why we do not assume a source condition (2.5) with µ ≥ 1. In general, the saturation effect describes the behaviour of regularisation methods for which (2.8) does not hold for all µ > 0 for only up to some finite qualification index µ0. For Tikhonov regularisation, we observe that the qualification is µ0 = 1 (cf. [35]). Note that there also exist generalisations of the source condition above, cf. [110,111]. One can also define an iterative regularisation scheme, as in Chapter 1 and (1.22), with the Tikhonov regularisation operator, known as iterated Tikhonov regularisation (cf. [57]). This is given by the following expression:

δ ∗ −1 ∗ δ δ δ xα,n := (A A + αI) (A y + αxα,n−1), xα,0 := 0. (2.9) It is also possible to express (2.9) in its variational form:

δ δ 2 δ 2 xα,n = argmin kAx − y k + αkx − xαn−1 k . x∈X

In this case, one may consider regularisation for α → 0 with a fixed n ∈ N and that is indeed usually the case. However, one may also fix an α ∈ (0, αmax) and consider (2.9) as an iterative regularisation method for n → ∞ [35]. The respective filter and residual functions may be written in the form: (λ + α)n − αn g (λ) = , α,n λ(λ + α)n  α n r (λ) = , α,n λ + α

35 which, we note, is a rather convenient form for the residual function. Of particular interest to us (for usage in several parameter choice rules) is the second Tikhonov iterate, which we denote by (cf. (1.22) in Chapter 1):

II ∗ −1 ∗ δ δ xα,δ = (A A + αI) (A y + αxα), (2.10) which is simple to compute and follows from plugging n = 2 into (2.9). One may also expand the expression in (2.10) to get

II ∗ −1 ∗ δ δ δ xα,δ = (A A + αI) A [y + (y − Axα)], where we observe that the iterates are essentially “adding back the noise”; δ δ δ that is, we add y − Axα to the initial data y (and then regularise). Note that the filter function associated with the second Tikhonov iterate is given by λ + 2α gII (λ) := , α (λ + α)2 and  α 2 rII (λ) := α λ + α is then the respective filter function for the associated residual.

Remark. The qualification for the second Tikhonov iterate is µ0 = 2 and in general, for iterated Tikhonov regularisation, (1.20) holds for H¨oldertype source conditions `ala (2.5) with qualification µ0 = n.

2.1.1 Heuristic Parameter Choice Rules For Tikhonov regularisation, we may consider the “classical” ψ-functional based heuristic parameter choice rules, defined in Definition 7 of Chapter 1, as the minimisers of (1.34), with

αk−j−1λj Φj,k(λ) = . (2.11) α (λ + α)k

In particular, from the definitions of gα (2.2) and rα (2.3), for Tikhonov regularisation, we have the following: if k = 2 and j = 0, then this defines the heuristic discrepancy rule; k = 3 and j = 0 defines the Hanke-Raus rule; k = 4 and j = 1 defines the quasi-optimality rule, and all of the aforementioned fall into the so-called R1 family of rules (cf. [133]). Another rule, which does not fall into the R1 family, is the simplified L-curve rule, which is defined by taking k = 3 and j = 1. In addition to the aforementioned R1-rules, there is the greater family of R*-rules [54]. We may also express this in tabular form for he benefit of the reader:

36 k = 2 k = 3 k = 4 j = 0 HD HR j = 1 L QO Or more explicitly: α Φ (λ) = , (HD) α (λ + α)2 α2 Φ (λ) = , (HR) α (λ + α)3 αλ Φ (λ) = , (L) α (λ + α)3 α2λ Φ (λ) = . (QO) α (λ + α)4 Relatively recent results on the quasi-optimality rule may be found in [7,8]. Note that for Tikhonov regularisation, the simple-L functional may be writ- ten in the following form [91]:  ∂  ψ (α, yδ) = − xδ , α xδ . (2.12) L α ∂α α Naturally, it follows that the respective ratio rule may be expressed as

δ ∂ δ δ xα, α ∂α xα ψLR(α, y ) = − δ 2 , (2.13) kxαk although, as mentioned previously, we will generally focus more on the non- ratio rules. An a-posteriori version of this rule was also proposed in [91], namely,   ∂   α = sup α ∈ (0, α ) | −α xδ , α xδ ≤ τδ , ∗ max α ∂α α for an appropriately chosen τ ≥ 1, although this rule is yet to have been investigated.

The Simple-L Rules Note that the simple-L and simple-L ratio rules are relatively new devel- opments (cf. [91]) which drew inspiration from the more classical L-curve δ method (cf. [60, 62, 64, 98]) in which one plots the graph of (log(kAxα − δ 2 δ 2 y k ), log(kxαk )) and selects the parameter as the maximiser of the curva- ture of the so-called L-graph, i.e., the curve κ(α) α 7→ , χ(α)

37 δ δ 2 δ 2 where κ(α) := log(kAxα − y k ) and χ(α) := log(kxαk ). In essence, this corresponds to selecting

α∗ = argmax θ(α), α∈(0,αmax) where θ : (0, αmax) → R ∪ {∞} denotes the signed curvature defined as [60] χ00(α)κ0(α) − χ0(α)κ00(α) θ(α) := 3 . (2.14) (χ0(α)2 + κ0(α)2) 2

In particular, for Tikhonov regularisation, (2.14) may be simplified to an expression devoid of second derivatives; namely, if we set

δ δ 2 δ 2 %(α) := kAxα − y k and υ(α) := kxαk , (2.15) and observing that we can write Z ∞ 0 ∂ ∗ −1 ∗ δ 2 ∂ 1 ∗ δ 2 υ (α) = k(A A + αI) A y k = 2 dkEλA y k ∂α ∂α 0 (λ + α) Z ∞ Z ∞ ∂ λ δ 2 λ δ 2 = 2 dkFλy k = −2 3 dkFλy k , 0 ∂α (λ + α) 0 (λ + α) and similarly again,

Z ∞ 2 Z ∞ 0 ∂ α δ 2 αλ δ 2 % (α) = 2 dkFλy k = 2 3 dkFλy k , 0 ∂α (λ + α) 0 (λ + α) we have the identity %0(α) = −αυ0(α). Moreover, we may define the so-called logarithmic derivatives of % and υ by

∂ %0 %00% − (%0)2 ∂ υ0 υ00υ − (υ0)2 κ00(α) = = and χ00(α) = = . ∂α % %2 ∂α υ υ2 Furthermore, ∂ %00 = (−αυ0) = −υ0 − αυ00, ∂α and we may use all of the above to rewrite (2.14) as

υ(α)%(α) %(α)υ(α) + αυ0(α)%(α) + α2υ0(α)υ(α) θ(α) = 0 3 , (2.16) |υ (α)| (%(α)2 + α2υ(α)2) 2 as was observed in [55, 63, 91, 151]. The described L-curve method is what may be called a “graphical” method and cannot be analysed in the same manner as the ψ-based methods. Note that other related methods include

38 the so-called V-curve [40], U-curve [93, 94] and more recently, Q-curve [135] (see Section 1.4). Indeed, the V-curve selects the parameter similarly to the L-curve rule by minimising the function

 ∂  α ∂α κ(α) θV (α) = ∂ . α ∂α χ(α)

Preceding the ψL method was Reginka’s rule (cf. [136]) which composed of minimising the functional:

δ δ τ δ δ ψR(α, y ) = kxαk kAxα − y k. (τ > 0)

Indeed, one can also find in [35, Proposition 4.37] the result that α∗ = δ argmin ψR(α, y ) if and only if α∗ = argmax θ(α). However, the choice of τ in ψR is somewhat of a cumbersome dilemma. Moreover, it was also ob- served in [83] that Reginska’s method is not subject to the same analysis as the other ψ-based methods (e.g. the HD, HR or QO rules) in the sense that ψR is neither subadditive, nor can it satisfy a noise restriction `ala (2.17); thus the motivation for the “new” L functionals, which are also motivated by the following proposition [91]: Proposition 8. We have υ(α) %(α) θ(α) = C (ζ) − C (ζ), ζ := , α|υ0(α)| 1 2 αυ(α) where C1,C2 are positive constants depending on ζ and satisfy 2 1 0 ≤ C1(ζ) ≤ √ , and 0 ≤ C2(ζ) ≤ √ , 3 3 2 for all α ∈ (0, αmax). Proof. The expression (2.16) can easily be rewritten as (2.15) with ζ2 ζ(1 + ζ) c1(ζ) = 3 , c2(ζ) = 3 . (ζ2 + 1) 2 (ζ2 + 1) 2 √ By elementary calculus, we may find the maxima for c1 at ζ = 2 and for c2 at ζ = 1 yielding the upper bounds. Upon observation of the expression in Proposition 8, and the realisation that the function α 7→ υ(α)/α|υ0(α)| is the sole contributer to large values of θ (recalling that the L-curve method maximises θ), it follows that one may in effect “simplify” the task of computing α∗ by instead minimising the reciprocal of the aforementioned function: the direct result is the simple-L ratio rule. The simple-L rule is then derived from the observation of the previous one being a ratio rule.

39 Convergence Analysis Arguably the most important part of this section is the convergence analysis of the mentioned rules. For this, we first prove some more preliminary bounds before finally providing theorems which give convergence rates for the total error with respect to each heuristic parameter choice rule. First we show that if the parameter α is selected according to a heuristic parameter choice rule, then it may be estimated from above by the corre- sponding ψ-functional: Proposition 9. Let δ α∗ = argmin ψ(α, y ), α∈(0,αmax) with ψ ∈ {ψHD, ψHR, ψQO, ψL}. Then

 2 δ δ Cψ (α, y ), if ψ = ψHD and ∃ C1,C2 > 0 s.t. ky k ≥ C,  ∗ δ  or ψ = ψL and ∃ C s.t. kA y k ≥ C, α∗ ≤ Cψ(α, yδ), if ψ = ψ and ∃ C s.t. kyδk ≥ C,  HR  ∗ δ or ψ = ψQO and ∃ C > 0 s.t. kA y k ≥ C, for all yδ ∈ Y . Proof. Note that we may write √  αk(AA∗ + αI)−1yδk, if ψ = ψ ,  HD  3  ∗ − 2 δ δ αk(AA + αI) y k, if ψ = ψHR, ψ(α, y ) = ∗ −2 ∗ δ αk(A A + αI) A y k, if ψ = ψQO,  √ ∗ − 3 ∗ δ  αk(A A + αI) 2 A y k, if ψ = ψL. Now, notice that for s ≥ 0, we have k(AA∗ + αI)s(AA∗ + αI)−syδk ≤ k(AA∗ + αI)skk(AA∗ + αI)−syδk, i.e., δ ∗ −s δ ky k C k(AA + αI) y k ≥ ∗ s ≥ , k(AA + αI) k Cs ∗ s where k(AA + αI) k ≤ Cs. Then the result subsequently follows for the 3 heuristic discrepancy and Hanke-Raus functionals with s = 1 and s = 2 , respectively. The estimate for the quasi-optimality and simple-L functionals follow simi- larly, as

∗ δ      δ t kA y k C 3 1 ψQO/L(α, y ) ≥ α ∗ s ≥ , s ∈ , 2 , t ∈ , 1 k(A A + αI) k Cs 2 2

∗ s with k(AA + αI) k ≤ Cs.

40 Corollary 2. Let α∗ be selected as in Proposition 9. Then α∗ → 0 as δ → 0. Proof. It follows from Propositions 9 and 4 that

s δ s δ α∗ ≤ ψ (α∗, y ) ≤ ψ (α, y ) → 0, s ∈ {1, 2} as δ → 0.

Noise Restriction Due to the ever ominous Bakushinkii veto (cf. Propo- sition 3), we would like to restrict the set of admissible noise in order to satisfy the auto-regularisation condition (1.38), thereby proving convergence of the method. In particular, whilst the former condition is quite abstract, a weaker and incidentally more insightful condition was given; namely the so-called Muckenhoupt condition, which requires that the noise belongs to the following set:

 Z ∞ Z α  p −1 2 p−1 2 Np := e ∈ Y | α λ dkFλek ≤ C λ dkFλek ∀ α ∈ (0, αmax) . α 0 (2.17) Notice that for (2.17) to hold, there should be sufficiently many high fre- quency components (for the right-hand side to dominate). Therefore, this condition tells us that the noise should be sufficiently irregular; i.e., very smooth perturbations of the noise do not satisfy the Muckenhoupt condi- tion. The observant reader might recall that this is the “irony” we referred to in the description of degrees of ill-posedness in Chapter 1. This yields the following proposition (cf. [83,87]):

Proposition 10. Let ( N , if ψ ∈ {ψ , ψ }, e = y − yδ ∈ 1 HD HR N2, if ψ ∈ {ψQO, ψL}.

Then δ δ kxα − xαk ≤ ψ(α, y − y ), (2.18) δ for all α ∈ (0, αmax) and y, y ∈ Y . Proof. Recall that the data propagation error may be represented in its spec- tral form by Z ∞ δ 2 λ δ 2 kxα − xαk = 2 dkFλ(y − y )k . (2.19) 0 (λ + α) The idea of the proof of this theorem is to split the above integral into two parts: λ ∈ (0, α) and λ ∈ (α, ∞) which will allow us to utilise the Muckenhoupt condition (2.17) above to either achieve an appropriate bound

41 for the data propagation error from above and then to subsequently connect it with a bound for the ψ-functional acting on the noise from below. Since Z ∞ Z ∞ λ δ 2 −1 δ 2 2 dkFλ(y − y )k ≤ C λ dkFλ(y − y )k , (2.20) α (λ + α) α and Z α Z α λ δ 2 1 δ 2 2 dkFλ(y − y )k ≤ C 2 λ dkFλ(y − y )k (2.21) 0 (λ + α) α 0 Z α 1 δ 2 ≤ C dkFλ(y − y )k , (2.22) α 0 it follows from the above estimates that (2.19) may be bounded as

 Z α Z ∞  δ 2 1 δ 2 −1 δ 2 kxα − xαk ≤ C 2 λ dkFλ(y − y )k + λ dkFλ(y − y )k α 0 α  Z α Z ∞  1 δ 2 −1 δ 2 ≤ C dkFλ(y − y )k + λ dkFλ(y − y )k α 0 α Recall the Muckenhoupt condition (2.17), which allows us to bound the sec- ond term of the inequality above with p = 1 and p = 2, respectively, yielding  1 Z α  dkF (y − yδ)k2, if p = 1, Z ∞ α λ λ−1 dkF (y − yδ)k2 ≤ C 0 (2.23) λ 1 Z α α  λ dkF (y − yδ)k2, if p = 2.  2 λ α 0

In case ψ ∈ {ψQO, ψL}, we observe, similarly as in [83,87,91], that

Z α 2 2 δ α λ δ 2 ψQO(α, y − y ) ≥ 4 dkFλQ(y − y )k 0 (λ + α) Z α Z α λ δ 2 ¯ 1 δ 2 ≥ C 2 dkFλQ(y − y )k ≥ C 2 λ dkFλQ(y − y )k , 0 (λ + α) α 0 and Z α Z α 2 δ αλ δ 2 λ δ 2 ψL(α, y − y ) ≥ 3 dkFλ(y − y )k ≥ C 2 dkFλ(y − y )k . 0 (λ + α) 0 α so that with (2.23), where p = 2, we achieve the desired estimate. In case ψ ∈ {ψHD, ψHR}, we follow the example of [83] and find that Z α Z α 2 δ α δ 2 C δ 2 ψHD(α, y − y ) ≥ 2 dkFλ(y − y )k ≥ dkFλ(y − y )k 0 (λ + α) α 0

42 and Z α 2 Z α 2 δ α δ 2 1 δ 2 ψHR(α, y − y ) ≥ 3 dkFλ(y − y )k ≥ dkFλ(y − y )k . 0 (λ + α) α 0 Then similarly as for the two previous parameter choice rules, we use (2.23) with p = 1 to complete the proof. We also bound the ψ-functional acting on the noise from above (cf. [83,87]): Proposition 11. We have  δ √ , if ψ ∈ {ψHD, ψHR} ψ(α, y − yδ) ≤ C α  δ kxα − xαk, if ψ ∈ {ψQO, ψL},

δ for all α ∈ (0, αmax) and y, y ∈ Y .

Proof. For ψ = ψHD, one can immediately estimate that Z ∞ 2 2 δ 1 α δ 2 ψHD(α, y − y ) = 2 dkFλ(y − y )k α 0 (λ + α) Z ∞ 2 1 δ 2 δ ≤ dkFλ(y − y )k ≤ . α 0 α

If ψ = ψHR, then Z ∞ 2 2 δ α δ 2 ψHR(α, y − y ) = 3 dkFλ(y − y )k 0 (λ + α) Z ∞ 2 1 δ 2 δ ≤ dkFλ(y − y )k ≤ C . 0 λ + α α

For ψ = ψQO, we have

Z ∞ 2 2 δ α λ δ 2 ψQO(α, y − y ) = 4 dkFλ(y − y )k 0 (λ + α) Z ∞ 2 α λ δ 2 = 2 · 2 dkFλ(y − y )k 0 (λ + α) (λ + α) Z ∞ λ δ 2 δ 2 ≤ 2 dkFλ(y − y )k = kxα − xαk , 0 (λ + α) and for ψ = ψL, the estimate follows similarly as for the quasi-optimality functional, since the filter function for the simple-L rule satisfies αλ λ Φ (λ) = ≤ , α (α + λ)3 (α + λ)2 thus completing the proof.

43 Now, the next proposition proves that the ψ-functionals may be bounded from above by the approximation error or, at least, a similar estimate:

Proposition 12. If ψ ∈ {ψHR, ψQO}, then we have

† ψ(α, y) ≤ Ckxα − x k. (2.24)

If ψ ∈ {ψHD, ψL} and (2.5) holds, then

sZ ∞   α † 2 µ 1 ψ(α, y) ≤ dkEλx k ≤ Cα , µ ≤ 0 λ + α 2 for all α ∈ (0, αmax) and y ∈ Y .

Proof. Let ψ = ψHD. Then Z ∞ Z ∞ 2 αλ † 2 λ α † 2 ψHD(α, y) = 2 dkEλx k = dkEλx k 0 (λ + α) 0 λ + α λ + α Z ∞ α † 2 ≤ dkEλx k , 0 λ + α

λ from which the result follows thanks to (2.5), courtesy of the fact that λ+α ≤ 1 for all λ ≥ 0. Note that the rest of the estimates in this proof also follow from the aforementioned upper bound. For ψ = ψHR, it easily follows that

Z ∞ 2 Z ∞ 2 2 α λ † 2 α λ † 2 ψHR(α, y) = 3 dkEλx k ≤ 2 dkEλx k 0 (λ + α) 0 (λ + α) λ + α † 2 ≤ kxα − x k .

If ψ = ψQO, then it is immediate that

Z ∞ 2 2 Z ∞ 2 2 2 α λ † 2 α λ † 2 ψQO(α, y) = 4 dkEλx k = 2 2 dkEλx k 0 (λ + α) 0 (λ + α) (λ + α) † 2 ≤ kxα − x k .

For ψ = ψL, it follows easily that

Z ∞ 2 Z ∞ 2 αλ † 2 α † 2 ψL(α, y) = 3 dkEλx k ≤ dkEλx k , 0 (λ + α) 0 λ + α

µ 1 and this expression is the order of α (for µ ≤ 2 ) which follows similarly via (1.20), and this completes the proof, as the remaining details are identical as for the HD rule.

44 Remark. We remark that in order to bound the HD functional from above by the approximation error, as one can do for the HR and QO rules, the following regularity condition was considered in [83]:

Z ∞ 2 Z ∞ 2 α † 2 α † 2 λ 2 dkEλx k ≤ Cα 2 dkEλx k , (2.25) α (λ + α) α (λ + α) for all α ∈ (0, αmax), with which one can prove: Z α Z ∞ 2 αλ † 2 ψHD(α, y) = + 2 dkEλx k 0 α (λ + α) Z α 2 Z ∞ 2 α † 2 α λ † 2 ≤ 2 dkEλx k + 2 dkEλx k 0 (λ + α) α (λ + α) α (2.25) Z α 2 Z ∞ 2 α † 2 α † 2 ≤ 2 dkEλx k + C 2 dkEλx k 0 (λ + α) α (λ + α) † 2 ≤ max{1,C}kxα − x k .

However, the inequality (2.25) is very restrictive as it is rarely satisfied in practice, so we will not consider it in general. In the absence of any source conditions, we now have all the tools to prove that when the parameter is selected according to a heuristic rule, the corre- sponding total error converges as the noise level decays to zero: Theorem 5. Let δ α∗ = argmin ψ(α, y ), α∈(0,αmax) with ψ ∈ {ψHD, ψHR, ψQO, ψL} and let ( N , if ψ ∈ {ψ , ψ }, y − yδ ∈ 1 HD HR N2, if ψ ∈ {ψQO, ψL},

δ ∗ δ and suppose ky k 6= 0 in case ψ ∈ {ψHD, ψHR}, and kA y k 6= 0 in case ψ ∈ {ψQO, ψL}. Then δ † kxα∗ − x k → 0, as δ → 0. Proof. We have

δ † δ † (2.18) δ †  kxα∗ − x k ≤ kxα∗ − xα∗ k + kxα∗ − x k = O ψ(α∗, y − y ) + kxα∗ − x k (1.39) δ †  = O ψ(α, y ) + ψ(α∗, y) + kxα∗ − x k ,

δ δ where we have used the optimality of α∗; namely, that ψ(α∗, y ) ≤ ψ(α, y ) for all α ∈ (0, αmax) (cf. (1.32)).

45 If ψ ∈ {ψQO, ψL}, then it follows from Proposition 12 that δ † δ †  kxα∗ − x k = O ψQO/L(α, y ) + kxα∗ − x k δ †  = O ψQO/L(α, y − y ) + ψQO/L(α, y) + kxα∗ − x k δ †  = O kxα − xαk + kxα − x k + φ(α∗) , with a function φ satisfying φ(α) → 0 as α → 0 (cf. (1.21)), and since α∗ → 0 due to Corollary 2, it follows that choosing any α = α(δ) such that α → 0 and δ2/α → 0 as δ → 0 yields the desired result. If ψ = ψHD, it also follows from Propositions 11 and 12, that  δ  kxδ − x†k = O √ + α + α , α∗ α ∗ and the result follows selecting α as before (cf. Corollary 2). If we now consider the source condition (2.5), then, in addition to being able to prove convergence of the error, we may also prove (suboptimal) conver- gence rates. However, before doing so, we first give the following proposition which, like Proposition 9, provides an estimate for the ψ-functionals from below, this time by the approximation error and in terms of the smoothness index µ: Proposition 13. Let (2.5) hold with µ ∈ [0, 1) and assume kA∗Ax†k 6= 0. Then  2µ 1 † ψ (α, y), if ψ ∈ {ψHD, ψL} and µ ≤ , kxα − x k ≤ C 2 µ ψ (α, y), if ψ ∈ {ψHR, ψQO}, for all α ∈ (0, αmax) and y ∈ Y .

Proof. If ψ = ψL, we can estimate Z ∞ 2 Z ∞ 2 αλ † 2 α 2 † 2 ψL(α, y) = 3 dkEλx k ≥ 2 3 λ dkEλx k 0 (λ + α) (kAk + αmax) 0 ∗ † 2 = C1k(A A)x k α, and conversely, we have that  µ † µ 1 2 kxα − x k ≤ C2α ≤ C2 ∗ † 2 ψL(α, y) C1k(A A)x k C = ψ2µ(α, y), k(A∗A)x†k2µ L

If ψ = ψQO, we estimate, analogously, Z ∞ 2 2 2 α λ † 2 ∗ † 2 2 ψQO(α, y) = 4 dkEλx k ≥ Ck(A A)x k α , 0 (λ + α)

46 by which we similarly arrive at the estimate:

† µ µ kxα − x k ≤ Cα ≤ CψQO(α, y). Note that the proofs for the heuristic discrepancy and Hanke-Raus rule follow in exactly the same fashion. The rates for the HD rule match that of the L rule and the Hanke-Raus matches the quasi-optimality rule’s rates in this instance due to the order of the α factor in the respective filter functions. We remark that in the above proposition, we can see that the HD and L rules 1 both exhibit the early saturation effect, as the result only holds for µ ≤ 2 rather than 1. This also has a “knock-on” effect on the subsequent results, namely, the main convergence rates result of this section, which we provide via the following corollary:

Corollary 3. In addition to the conditions of Theorem 5, let the source condition (2.5) hold with µ ∈ [0, 1). Then   4µ µ 1 O δ 2µ+1 , if ψ ∈ {ψHD, ψL} and µ ≤ , δ † 2 kxα − x k = ∗  2µ µ O δ 2µ+1 , if ψ ∈ {ψHR, ψQO}, as δ → 0.

Proof. Similar as in the proof of the previous theorem, we have

δ † δ †  kxα∗ − x k = O ψ(α, y ) + ψ(α∗, y) + kxα∗ − x k , which follows from the optimality of α∗, cf. (1.32) and the inequality (1.39). For ψ ∈ {ψHD, ψL}, by Propositions 11 and 12, we have

 δ  kxδ − x†k = O √ + αµ + αµ with µ ≤ 1 α∗ α ∗  2µ  2µ+1 2µ δ = O δ + ψHD/L(α, y )

 4µ µ = O δ 2µ+1 ,

2 δ where the second equality follows from the fact that α∗ ≤ ψHD/L(α∗, y ) with 1 µ ≤ 2 (cf. Prop. 9). For ψ ∈ {ψHR, ψQO}, the proofs follow similarly, except that since α∗ ≤ δ ψ(α∗, y ) for those rules, we get

 2µ   2µ  δ † 2µ+1 µ δ 2µ+1 µ kxα∗ − x k = O δ + ψHR/QO(α, y ) = O δ , thereby completing the proof.

47 Remark. Note that the heuristic discrepancy and simple-L rules exhibit different rates to the Hanke-Raus and quasi-optimality rules. Moreover, if one wanted to prove that, say the HD rule satisfied the sharper bound (2.24), then one would have to require an additional condition, namely (2.25), which we have already stated is somewhat restrictive (and thus we do not consider it in general). For certain problems, however, the HD and simple-L rules 1 are optimal, e.g., whenever µ = 2 . On the other hand, for µ = 1, the QO and HR rules yield optimal rates. Thus, for solutions with low smoothness, the HD and L rules in fact offer themselves as the better alternatives. For numerical implementations and results of the aforementioned rules, we refer the reader to [9,53,126]. As stated already, the convergence rates of the previous corollary are subop- timal and in order to prove optimal rates (i.e., rates which match (2.8)), we require an additional regularity condition [83,87]: Z ∞ Z α 2 −2 † 2 † 2 α λ dkEλx k ≥ C dkEλx k , (2.26) α 0 for all α ∈ (0, αmax). Note, however, that we also have the weaker condition (cf. [91]): Z α Z ∞ † 2 −1 † 2 dkEλx k ≤ Cα λ dkEλx k , (2.27) 0 α which suffices for the HD and simple L-curve rules in the following Lemma 1. Condition (2.27) is the weaker of the above two conditions in the sense that (2.26) implies (2.27). Considering the addition of the latest regularity condition, we now state the following lemma which allows us to prove that the ψ-functional for the considered heuristic parameter choice rules may be bounded from below by the approximation error [83,87]:

Lemma 1. For µ ∈ [0, 1), let (2.26) hold for ψ ∈ {ψHR, ψQO} and (2.27) hold for ψ ∈ {ψHD, ψL}. Then we have † ψ(α, y) ≥ Ckxα − x k, for all y ∈ Y and α ∈ (0, αmax). Proof. First, we decompose the spectral form of the approximation error as the sum of two integrals: Z α Z ∞ 2 † 2 α † 2 kxα − x k = + 2 dkEλx k . 0 α (λ + α) Now, similarly as in [87], we estimate the second integral of the sum as: Z ∞ 2 Z ∞ α † 2 2 −2 † 2 2 dkEλx k ≥ Cα λ dkEλx k α (λ + α) α (2.26) Z α Z α 2 † 2 α † 2 ≥ C dkEλx k ≥ C 2 dkEλx k . (2.28) 0 0 (λ + α)

48 Thus, with the above, we obtain an upper bound for the overall approxima- tion error: Z ∞ 2 † 2 α † 2 kxα − x k ≤ C 2 dkEλx k . α (λ + α)

Hence, for all ψ ∈ {ψHD, ψHR, ψQO, ψL}, it suffices to prove that

Z ∞ 2 2 α † 2 ψ (α, y) ≥ 2 dkEλx k . α (λ + α)

For ψ = ψQO, we can estimate

Z ∞ 2 2 Z ∞ 2 2 α λ † 2 α † 2 ψQO(α, y) ≥ 2 dkEλx k ≥ C 2 dkEλx k , α (λ + α) α (λ + α) and similarly for ψ = ψHR:

Z ∞ 2 Z ∞ 2 2 α † 2 α † 2 ψHR(α, y) ≥ 3 λ dkEλx k ≥ 2 dkEλx k . α (λ + α) α (λ + α)

Note that for ψ ∈ {ψHD, ψL}, we may use the weaker condition (2.27). In particular, we can bound

Z α 2 Z α 2 Z α α † 2 α † 2 † 2 2 dkEλx k ≤ 2 dkEλx k ≤ dkEλx k . (2.29) 0 (λ + α) 0 α 0 Thus,

Z α 2 Z ∞ 2 † 2 α † 2 λ α † 2 kxα − x k ≤ 2 dkEλx k + C 2 dkEλx k 0 (λ + α) α α (λ + α) Z ∞ 2 λ α † 2 ≤ C 2 dkEλx k , α α (λ + α) where the first inequality follows from the fact that λ/α ≥ 1 for all λ ≥ α, and the second inequality follows from a combination of (2.29) and (2.27). For ψ = ψHD, the proof follows from the observation that

Z ∞ Z ∞ 2 2 αλ † 2 α † 2 ψHD(α, y) ≥ 2 dkEλx k ≥ 2 dkEλx k , α (λ + α) α (λ + α) whereas for ψ = ψL, we observe that

Z ∞ 2 Z ∞ 2 αλ † 2 1 α † 2 ψL(α, y) ≥ 3 dkEλx k ≥ dkEλx k , α (λ + α) 2 α λ from which the result follows via (2.27).

49 Now we state the optimal convergence rates theorem for the heuristic pa- rameter choice rules which, more or less, combines all of the above results in this section:

Theorem 6. Let the source condition (2.5) hold with µ ∈ [0, 1), with the 1 δ additional restriction that µ ≤ 2 for ψ ∈ {ψHD, ψL}, and let y − y ∈ N1 δ for ψ = {ψHD, ψHR} and y − y ∈ N2 whenever ψ ∈ {ψL, ψQO}. Suppose, in addition, that the regularity condition (2.27) holds for ψ ∈ {ψHD, ψL} and that (2.26) holds for ψ ∈ {ψHR, ψQO}. Then

 2µ  δ † 2µ+1 kxα∗ − x k = O δ , as δ → 0.

Proof. We first assume that for δ sufficiently small there exists anα ¯ ∈ (0, αmax) such that

δ †  δ † kxα¯ − xα¯k + kxα¯ − x k = inf kxα − xαk + kxα − x k . α∈(0,αmax)

† Assumeα ¯ ≥ α∗. Then it follows, since α 7→ kxα − x k is a monotonically † † increasing function, that kxα∗ − x k ≤ kxα¯ − x k. Now, we estimate

δ δ δ  kxα∗ − xα∗ k = Cψ(α∗, y − y ) ≤ C ψ(α, y ) + ψ(α∗, y)  1 ψ(α, y − yδ) + ψ(α, y) + αµ , if ψ ∈ {ψ , ψ }, µ ≤ , ≤ C ∗ HD L 2  δ †  ψ(α, y − y ) + ψ(α, y) + kxα∗ − x k , if ψ ∈ {ψHR, ψQO}.    δ µ µ O √ + α + α , if ψ = ψHD,  α ∗   δ µ µ O kx − xαk + α + α , if ψ = ψL, = α ∗  δ  O √ + kx − x†k + kx − x†k , if ψ = ψ ,  α α∗ HR  α  δ † †  O kxα − xαk + kxα − x k + kxα∗ − x k , if ψ = ψQO.    δ µ O √ +α ¯ , if ψ = ψHD,  α¯   δ µ O kx − xα¯k +α ¯ , if ψ = ψL, = α¯  δ  O √ + kx − x†k , if ψ = ψ ,  α¯ HR  α¯  δ †  O kxα¯ − xα¯k + kxα¯ − x k , if ψ = ψQO, where we have used the estimates for the ψ-functionals from Propositions 11, † † µ µ 12, and forα ¯ ≥ α∗, we recall kxα∗ − x k ≤ kxα¯ − x k and α∗ ≤ α¯ .

50 Moreover, continuing with the case thatα ¯ < α∗: † δ δ  kxα∗ − x k ≤ Cψ(α∗, y) ≤ C ψ(α, y ) + ψ(α∗, y − y )    δ µ δ O √ + α + √ , if ψ = ψHD,  α α  ∗  δ µ δ  O kxα − xαk + α + kxα∗ − xα∗ k , if ψ = ψL, =    δ † δ O √ + kxα − x k + √ , if ψ = ψHR,  α α∗   δ † δ  O kxα − xαk + kxα − x k + kxα∗ − xα∗ k , if ψ = ψQO,    δ µ O √ +α ¯ , if ψ = ψHD,  α¯   δ µ O kx − xα¯k +α ¯ , if ψ = ψL, = α¯  δ  O √ + kx − x†k , if ψ = ψ ,  α¯ HR  α¯  δ †  O kxα¯ − xα¯k + kxα¯ − x k , if ψ = ψQO, where we have used the estimate from Lemma 1 which give a bound for the approximation error by the ψ-functionals , and forα ¯ < α∗, since α 7→ δ δ kxα−xαk is a monotonically decreasing function, it follows that kxα∗ −xα∗ k ≤ δ kxα¯ − xα¯k. Thus, for all α ∈ (0, αmax), we have    δ µ O √ +α ¯ , if ψ = ψHD,  α¯   δ µ O kx − xα¯k +α ¯ , if ψ = ψL, kxδ − x†k = α¯ α∗  δ  O √ + kx − x†k , if ψ = ψ ,  α¯ HR  α¯  δ †  O kxα¯ − xα¯k + kxα¯ − x k , if ψ = ψQO.  2µ  = O δ 2µ+1 , as δ → 0, which completes the proof. Remark. One may observe in the proof of the above theorem that the quasi- optimality rule actually satisfies δ †  δ † kxα∗ − x k ≤ C inf kxα − xαk + kxα − x k , α∈(0,αmax) which is quite remarkable as the upper bound above is almost the most optimal rate that is possible. We may organise the above four rules into a table in terms of early saturation and required noise restriction as follows: 1 Early Saturation: µ ≤ 2 No Early Saturation: µ < 1 Np with p = 1 HD HR Np with p = 2 L QO

51 where the required noise restrictions were demonstrated in Proposition 10 and the early saturation effect of the HD and L rules was first observed when bounding the approximation error in terms of the smoothness index µ. Namely, we saw in Proposition 13 that we had to restrict the range of 1 possible µ to µ ∈ [0, 2 ). The Hanke-Raus and quasi-optimality rules are thus a kind of “remedy” to this effect for the heuristic discrepancy and L rules, respectively, as they are closely related. We remark that for a-posteriori parameter choice rules, the discrepancy (1.29) and improved discrepancy (1.40) principles have a similar relationship in that the latter is also the “remedy” for the former (with respect to saturation). Returning to the topic of heuristic parameter choice rules, we mention that in spite of being “remedies” regarding saturation, the noise restrictions required for the L and quasi-optimality rules are more stringent compared to the somewhat weaker restriction required for the convergence of the other two rules.

2.2 Weakly Bounded Noise

This section is largely analogous with the paper [89]. In this scenario, we deviate from the classical theory somewhat and consider the case in which the noise is unbounded in the image space, but satisfies another constraint which qualifies it to be named as in the title of this section. We describe this in the following definition: Definition 8. If the noise e = y − yδ satisfies  1 τ := k(AA∗)pek < ∞, with p ∈ 0, , 2 then we say that the noise is weakly bounded [29,30]. That is to say that the noise belongs to the Hilbert space Z, which can be defined as the completion of range A with respect to k(AA∗)p · k complemented by range A⊥ (cf. [89]). The above noise model may be interpreted as a deterministic treatment of the white noise model, which is studied extensively in the stochastic treatment of ill-posed problems [88]. The question of whether one may still regularise, in the determistic sense, in the presence of this large, i.e., weakly bounded noise can be verified by the proceeding proposition [51]: Proposition 14. Suppose that e ∈ Z. Then the Tikhonov regularised solu- δ tion xα ∈ X is well defined. Proof. Notice that

δ ∗ ∗ −1 δ kxαk = kA (AA + αI) y k ≤ kA∗(AA∗ + αI)−1yk + kA∗(AA∗ + αI)−1(y − yδ)k.

52 The first term is well defined, thus we proceed to show that the second term is also finite Z ∞ ∗ 1/2 ∗ −1 δ 2 1 ∗ 1/2 δ 2 k(AA ) (AA + αI) (y − y )k = 2 dkFλ(AA ) (y − y )k 0 (λ + α) 2 1 ∗ 1/2 δ 2 τ ≤ sup 2 k(AA ) (y − y )k = 2 < ∞, λ∈(0,kAk2) (λ + α) α

1 1 ∗ 2 ∗ p with p = 2 , which proves existence, since k(A A) ek ≤ Ck(A A) ek for 1 p ≤ 2 . We would like to replicate the analogous data error estimates as we had in case the noise was bounded in Y . Indeed, the following error estimates are courtesy of [28,112]: Proposition 15. There exist constants such that

δ τ δ τ kx − xαk ≤ C and kA(x − xα)k ≤ C , (2.30) α p+ 1 α p α 2 α

δ for all y, y ∈ Z and α ∈ (0, αmax). Proof. We have

Z ∞ 1−2p δ 2 λ 2p δ 2 kxα − xαk = 2 λ dkFλ(y − y)k 0 (λ + α) 1 τ 2 ≤ C k(AA∗)p(yδ − y)k2 = C . p α2p+1 p α2p+1 Moreover,

Z ∞ 2−2p δ 2 λ 2p δ 2 kA(xα − xα)k = 2 λ dkFλ(y − y)k 0 (λ + α) 1 τ 2 ≤ C k(AA∗)p(yδ − y)k2 = C , p α2p p α2p where Cp are positive constants depending on p. The observant reader will notice that the above proposition has omitted ap- proximation error estimates and will realise that as those may be estimated independently of the noise, they are identical to the ones we derived in Propo- sition 6. Thus, with the error estimates, one can even prove convergence of the Tikhonov regularised solution, even in the weakly bounded noise case:

p+ 1 Theorem 7. Choosing α = α(τ) such that α(τ) → 0 and τ/α(τ) 2 → 0 as τ → 0, we have δ † kxα − x k → 0, as τ → 0.

53 Proof. The estimate for the data propagation error of the previous proposi- tion and the classical estimate for the approximation error allows us to bound the error as τ kxδ − x†k ≤ C + Cα → 0, (2.31) α p+ 1 α 2 p+ 1 as τ → 0, if we choose α such that τ/α 2 → 0 and α → 0 as τ → 0. In order to prove convergence rates (which, in this instance, will be in terms of τ), we recall from the usual theory that we require a source condition. To this end, we assume once more that (2.5) holds [28,30]:

Corollary 4. Let x† satisfy the source condition (2.5) with µ ∈ [0, 1). Then 2 choosing α = α(τ) = αopt ∼ τ 2µ+2p+1 , we have

2µ δ † 2µ+2p+1 kxα − x k = O(τ ), as τ → 0.

2 Proof. If we take the infimum of (2.31) over α, then we get α ∼ τ 2µ+2p+1 . Indeed, with this a-priori choice of parameter, one gets the convergence rate above. What we observe is that the above results are true generalisations of the standard theory in the sense that setting p = 0 so that rather than e ∈ Z, we have e ∈ Y , the results are analogous with the standard ones (see Section 2.1).

2.2.1 Modified Parameter Choice Rules In this setting, the discrepancy is not well defined and for this reason, we introduce, from [89]

δ 1 ∗ q δ δ ψMHD(α, y ) : = k(AA ) (Ax − y )k, (2.32) q+ 1 α α 2 δ 1 ∗ q II δ ∗ q δ δ 1 ψMHR(α, y ) : = h(AA ) (Ax − y ), (AA ) (Ax − y )i 2 , (2.33) q+ 1 α,δ α α 2 as the modified heuristic discrepancy and Hanke-Raus rules, respectively, with q ≥ 0, a parameter to be specified. The sharp eyed reader will notice the lack of a modified quasi-optimality rule; but this rule need not be changed as it lived in the domain space and is therefore well defined in any case. The L-curve rule has also been omitted, for the same reason, as well as the fact that its analysis was not included in [89] as it was introduced in a later paper (cf. [91]), and we also forgo that task of analysing it here in the weakly

54 bounded noise case. The linear filter functions (2.11) may be generalised to include the modified parameter functionals above; in particular αmλ2q Φq,m(λ) := , α α2q(λ + α)m+1  q ≥ p and m = 1, if ψ = ψ , (2.34)  MHD with q ≥ p and m = 2, if ψ = ψMHR,  1 q = 2 and m = 3, if ψ = ψQO.

Noise Restriction We may also generalise the Muckenhoupt inequalities, i.e., the set of permissible noise as  Z ∞ Z α  ν+1 −1 2 ν 2 Nν := e ∈ Z | α λ dkFλek ≤ C λ dkFλek ∀ α ∈ (0, αmax) , α 0 (2.35) where ν = 1 when ψ = ψQO and ν = 2q whenever ψ ∈ {ψMHD, ψMHR} [2]. In the following, we state some examples in which the condition (2.35) holds

Case Study of Noise Restriction Note that in the classical situation of (strongly) bounded noise, it has been verified that (2.35) is satisfied in typical situations [87]. Moreover, for coloured Gaußian noise, (2.35) holds almost surely for mildly ill-posed problems [88]. ∗ Suppose A is compact such that AA has eigenvalues {λi} with polynomial decay, and we assume a certain polynomial decay or growth of the noise δ ∗ e = y − y with respect to the eigenfunctions of AA , denoted by {ui}: 1 1 λ = , γ > 0, and |hyδ − y, u i|2 = κ β ∈ , κ > 0. (2.36) i iγ i iβ R Then ∞ ∞ X 1 X 1 kyδ − yk2 = κ , τ 2 = k(AA∗)p(yδ − y)k2 = κ . iβ iβ+2pγ i=1 i=1 If we consider the case of unbounded but weakly bounded noise, i.e., kyδ − yk2 = ∞ but τ < ∞, then the exponents β, p should thus satisfy

β ≤ 1 and β + 2pγ > 1, thus β ∈ (1 − 2pγ, 1]. The inequality in (2.35) can then be written as

ν+1 X γ−β ν+1 X 1 δ 2 κα i = α |hy − y, uii| λi 1 λ ≥α 1≤i≤α− γ i X X 1 ≤ C λν|hyδ − y, u i|2 = Cκ . i i iγν+β λ ≤α 1 i i≥α− γ

55 − 1 Defining N∗ = α γ , we have ( X Z N∗ N γ−β+1 if γ − β > −1, iγ−β ≤ xγ−β dx ≤ C ∗ 1 1 1 if γ − β < −1, 1≤i≤α− γ and  C Z ∞ X 1 1  γν+β−1 if γν + β > 1, γν+β ∼ γν+β dx ∼ N∗ i ∗ x 1 N  i≥α− γ ∞ if γν + β ≤ 1. −γ Since α = N∗ , we arrive at the sufficient inequality

−γ(ν+1)+1+γ−β 1−γν−β N∗ ≤ CN , in the case that γ − β > −1 and γν + β > 1. Since the exponents match, the noise condition is then satisfied. If γν + β ≤ 1, then the inequality is clearly satisfied because of the divergent right-hand side. Thus, the noise condition holds for β < γ + 1. Roughly speaking, this means that the noise should not be too regular (rela- tive to the smoothing of the operator). In particular, the deterministic model of white noise, where β = 0 (no decay) satisfies a noise condition if the op- erator is smoothing. Most importantly, the assumption of a noise condition (2.35) is compatible with a weakly bounded noise situation.

Convergence Analysis The convergence analysis of regularisation methods with standard (non- heuristic) parameter choice rules in the weakly bounded noise setting is well established (cf. [28]). For the analysis of heuristic rules, we follow the exam- ple of [89] and prove the following estimates for the associated functionals:

Lemma 2. Let x† satisfy (2.5) with µ ∈ [0, 1). Then

δ • If y − y ∈ N1, then there exists a positive constant C such that

δ δ δ Ckxα − xαk ≤ ψQO(α, y − y ) ≤ kxα − xαk, (2.37) † ψQO(α, y) ≤ kxα − x k; (2.38)

δ 1 • if y − y ∈ N2q, with q ∈ min{p + 1, 2 − µ}, then there exist positive constants such that

δ δ τ kx − xαk ≤ ψMHD(α, y − y ) ≤ C , (2.39) α p+ 1 α 2 µ ψMHD(α, y) ≤ Cα ; (2.40)

56 δ 3 • if y − y ∈ N2q and q ≤ min{p + 2 , 1 − µ}, then there exist positive constants such that

δ δ τ Ckx − xαk ≤ ψMHR(α, y − y ) ≤ C , (2.41) α p+ 1 α 2 µ ψMHR(α, y) ≤ Cα , (2.42) for all α ∈ (0, αmax). Proof. In the following, we utilise the following useful estimate: that for t ≥ 0, there exists a positive constant such that

λt C ≤ , (2.43) α + λ αmax{1−t,0} for all α, λ ≥ 0. The estimates (2.37),(2.38) do not require the weakly boundedness condition on the noise and therefore the proofs are nigh on identical to the ones found in Chapter 2, Propositions 10, 11 and (12). For m > 0 and with (2.43), we obtain

2(q−q) !m+1 λ2q αm λ2p λ m+1 ≤ α2q (λ + α)(m+1) α2q−m λ + α 1 λ2p λ2p ≤ Cλ2p = C = C , 2q−m+max{1− 2(q−q) ,0}(m+1) max{1+2p,2q−m} 1+2p α m+1 α α

m+1 if q < 2 + p. Moreover,

1+2µ+2q !m+1 λ2q αm 1 λ m+1 λ1+2µ ≤ α2q (λ + α)(m+1) α2q−m λ + α 1 1 ≤ C = C ≤ Cα2µ, 2q−m+max{1− 1+2µ+2q ,0}(m+1) max{−2µ,2q−m} α m+1 α

m if q < 2 − µ. Using the spectral representation, the upper estimates (2.39) (using m = 2) and (2.41) (using m = 3) follow from the first estimate, while (2.40) and (2.42) are obtained similarly by the second one. For the lower bound, we estimate

Z kAk2+ 2 2 δ 1 2q α δ 2 ψMHD(α, y − y ) = 2q+1 λ 2 dkFλ(y − y )k α 0 (λ + α) Z α Z kAk2+ 1 2q δ 2 1 2q−2 δ 2 ≥ C 2q+1 λ dkFλ(y − y )k + C 2q−1 λ dkFλ(y − y )k , α 0 α α (2.44)

57 for all α ∈ (0, αmax), since for λ ≤ α, it follows that α/(λ + α) ≥ C and for λ ≥ α, one has α/(λ + α) ≥ Cα/λ, yielding the estimate above. Conversely, Z α λ kxδ − x k2 = dkF (y − yδ)k2 α α (λ + α)2 λ 0 (2.45) Z α Z kAk2+ 1 λ δ 2 −1 δ 2 ≤ C dkFλ(y − y )k + C λ dkFλ(y − y )k . α 0 α α R α Since 2q − 1 ≤ 0, we observe that the term with 0 in the above inequal- ity is bounded by the corresponding term in (2.44). Thus, using the noise condition, the second term can be bounded by the first one of (2.44). For the lower bound of the modified Hanke-Raus functional, we can estimate

 λ2q 2q 2  if λ ≤ α, λ α  2q+1 ≥ C α α2q (λ + α)3 λ2q−3  if λ ≥ α. α2q−2

δ Now, using N2q and (2.45), we can estimate kxα − xαk by the part of δ ψMHR(α, y − y ) restricted to λ ≤ α. The part for λ ≥ α can then be estimated from below by 0, i.e., Z α δ 1 2q δ 2 ψMHR(α, y − y ) ≥ C 2q+1 λ dkFλ(y − y )k α 0 2 (2.35) Z kAk + −1 δ 2 ≥ C λ dkFλ(y − y )k , α for all α ∈ (0, αmax).

1 Note that our parameters satisfy µ ≥ 0 and p ∈ [0, 2 ], hence, the restrictions 1 on the parameter q reduce to q ≤ 2 − µ or q ≤ 1 − µ for the modified heuristic discrepancy or Hanke-Raus rules, respectively. Since, by definition, q ≥ p must hold in any case, we have as a restriction to the smoothness 1 index that µ ∈ [0, 2 − p] for ψMHD and µ ∈ [0, 1 − p] for ψMHR, respectively. Only then, there exists a possible choice for q that satisfies the conditions of the previous lemma. Observe that the interval for µ is smaller for ψMHD, which also illustrates a saturation effect of the discrepancy-based rules which is well-known in the standard noise case, i.e., p = 0 (see Section 2.1).

Theorem 8. Let x† satisfy the source condition (2.5) and in addition, sup- pose that  y − yδ ∈ N and A∗y 6= 0, ψ = ψ ,  1 QO δ 1 1 ∗ q (y − y ) ∈ N2q, µ ∈ [0, 2 − p], q ∈ [p, 2 − µ], (AA ) y 6= 0, ψ = ψMHD,  δ ∗ q (y − y ) ∈ N2q, µ ∈ [0, 1 − p], q ∈ [p, 1 − µ], (AA ) y 6= 0, ψ = ψMHR.

58 Then   2µ µ O τ 2µ+2p+1 , if ψ = ψQO,    2µ 2µ  δ † 2µ+2p+1 · 1−2q kxα∗ − x k = O τ , if ψ = ψMHD,   2µ · µ  O τ 2µ+2p+1 1−q , if ψ = ψMHR, as τ → 0.

Proof. We treat the different parameter choice rules separately:

• From the definition of α∗ and the triangle inequality, it follows, with 2 α = τ 2µ+2p+1 , that

δ δ δ ψQO(α∗, y ) ≤ ψQO(α, y ) ≤ ψQO(α, y − y) + ψQO(α, y) † δ µ τ  2µ  ≤ kxα − x k + kx − xαk ≤ Cα + C = O τ 2µ+2p+1 . α p+ 1 α 2 By the triangle inequality, (2.37) and (2.38) of Lemma 24,

δ † † δ kxα∗ − x k ≤ kxα∗ − x k + kxα∗ − xα∗ k † δ  = O kxα∗ − x k + ψQO(α∗, y − y ) † δ  ≤ O kxα∗ − x k + ψQO(α∗, y ) + ψQO(α∗, y)  2µ  µ 2µ+2p+1 = O α∗ + τ .

Note that

Z kAk2 2 δ 2 λ δ 2 ψQO(α, y ) ≥ α 2 4 dkFλy k 0 (λ + kAk ) Z kAk2 2 1 δ 2 (2.46) ≥ α 2 4 λ dkFλy k (2kAk ) 0 2 2 1  ∗ ∗ 1 −p  2 ≥ α kA yk − kAA k 2 τ ≥ Cα , (2kAk2)4

for all α ∈ (0, αmax) and τ sufficiently small. Hence for α = α∗, it 2µ follows that α∗ ≤ Cτ 2µ+2p+1 . Therefore, we may deduce that

2µ δ † 2µ+2p+1 µ kxα∗ − x k = O(τ ), as τ → 0.

• Note that from (AA∗)qy 6= 0, we may conclude, as in (2.46), that

1    2µ 2  δ 1 −q α∗ ≤ C ψMHD(α, y ) 2 = O τ 2µ+2p+1 1−2q .

59 Then it follows, as above, from (2.39) and (2.40), that δ † † δ µ δ kxα∗ − x k ≤ kxα∗ − x k + kxα∗ − xα∗ k = O(α∗ + ψMHD(α∗, y − y ))   µ µ τ  2µ 2µ 2µ  = O α + α + = O τ 2µ+2p+1 1−2q + τ 2µ+2p+1 , ∗ 1 +p α 2 as τ → 0. • One may similarly verify that if k(AA∗)qyk ≥ C, then

δ 1 α∗ ≤ CψMHR(α∗, y ) 1−q . Therefore, δ † δ † kxα∗ − x k ≤ kxα∗ − xα∗ k + kxα∗ + x k µ δ  = O α∗ + ψMHR(α, y) + ψMHR(α, y − y )  δ µ 2µ   2µ µ  = O ψMHR(α, y ) 1−q + τ 2µ+2p+1 = O τ 2µ+2p+1 1−q .

For the quasi-optimality rule, one may notice that the above convergence rates are optimal for the saturation case µ = 1, but they are only suboptimal for µ < 1 (similarly as in Section 2.1). Let us further discuss the assumptions in this theorem: for the modified heuristic discrepancy rule, the first condition on q is not particularly restric- 1 1 tive. However, the requirement q ≤ 2 −µ implies that µ ≤ 2 −q, which means 1 that we obtain a saturation at µ = 2 − q. This is akin to the bounded noise 1 case (q = 0), where this method saturates at µ = 2 (see Section 2.1). It is well known that a similar phenomenon occurs for the non-heuristic analogue of this method, namely the discrepancy principle [35]. In contrast to the modified discrepancy rule, we observe that the saturation for the modified Hanke-Raus rule occurs at µ = 1 − q. Hence, again analo- gous to the bounded noise case (and to the non-heuristic case), the modified Hanke-Raus method yields convergence rates for a wider range of smoothness classes. We may, however, impose an additional condition as before in order to achieve an optimal convergence rate. More specifically, since it was independent of the noise, we may consider the condition from the standard theory described before; namely, (2.26).

Theorem 9. Let the assumptions of the previous theorem hold and let α∗ be selected according to either the quasi-optimality, modified heuristic dis- crepancy, or the modified Hanke-Raus rule. Then, assuming the regularity condition (2.26), it follows that  2µ  δ † 2µ+2p+1 kxα∗ − x k = O τ , as τ → 0.

60 Proof. The proof for the quasi-optimality rule is analogous to the standard theory (see Corollary 6), so therefore we omit it. For the modified heuristic discrepancy and Hanke-Raus rules, we show that the regularity condition † implies that ψ(α, y) ≥ Ckxα − x k. Recall that

Z kAk2+ 2 † 2 α † 2 kxα − x k = 2 dkEλx k 0 (α + λ) Z α Z kAk2+ † 2 2 1 † 2 ≤ C dkEλx k + Cα 2 dkEλx k . (2.47) 0 α λ

For λ ≥ α we have the following estimate for m > 0 with a constant Cq,m:

λ2q λαm λ2q λαm αm αm+1 ≥ = C ≥ C . α (λ + α)m+1 λ (λ + λ)m+1 q,m λm q,m λm+1

Thus, for ψ = ψMHD taking m = 1 and for ψ = ψMHR, taking m = 2, we obtain

Z kAk2+  2q m 2 λ λα † 2 ψ (α, y) ≥ m+1 dkEλx k 0 α (λ + α) Z kAk2+  2q m λ λα † 2 ≥ m+1 dkEλx k α α (λ + α) Z kAk2+ 2 α † 2 ≥ C 2 dkEλx k . α λ By the regularity condition, the first integral in the upper bound in (2.47) can be estimated by the second part which agrees up to a constant with the lower bound for ψ(α, y) in both cases. In the proof of Theorem 8, the † µ † estimate kxα∗ − x k ≤ Cα∗ can then be replaced by kxα∗ − x k ≤ Cψ(α∗, y), which leads to the optimal rate.

2.2.2 Predictive Mean-Square Error The predictive mean-square error functional (cf. [89]) is given by

δ δ ψPMS(α, y ) := kAxα − yk.

It is clear that this is not an implementable parameter choice rule as the functional to be minimised requires knowledge of the exact data y. The motivation for studying it, however, becomes apparent when we consider the generalised cross-validation rule. Indeed, if we now turn our attention to finite dimensional ill-conditioned problem

Anx = yn,

61 n for An : X → R , then we can define the generalised cross-validation func- tional by 1 ψ (α, yδ ) := kA xδ − yδ k, GCV n ρ(α) n α n with α ρ(α) := tr (A A∗ + αI )−1 . (2.48) n n n n 2 δ 2 For i.i.d. noise, the expected value of ψGCV(α, yn) − kek estimates the pre- dicted mean-square error functional squared (cf. [105,152]). The predictive mean-square error functional differs from the previous ones in the sense that it has different upper bounds. In fact, from (2.30), one immediately finds that τ 2 ψ2 (α, yδ) ≤ C + Cα2µ+1, PMS α2p 1 δ for µ ≤ 2 . It can easily be seen that for strongly bounded noise ky −yk < ∞, the method fails as it selects α∗ = 0. 2 The minimum of the upper bound is obtained for α = αopt = O(τ 2p+2µ+1 ), but the resulting rate is of the order

h (2µ+1) i2 2 δ 2p+2µ+1 ψPMS(α, y ) ≤ C τ ,

δ which agrees with the optimal rate for the error in the A-norm, i.e., kxα − † δ † x kA := kA(xα − x )k. Thus, for this method, it is not reasonable to bound δ † the functional ψPMS by expressions involving kxα −xαk or kxα −x k. Rather, we try to directly relate the selected regularisation parameter α∗ to the op- timal choice αopt. To do so, we need some estimates from below, although in this case, we will need to introduce a noise condition of a different type and an additional condition on the exact solution. Lemma 3. Suppose that there exists a positive constant C such that yδ −y ∈ Z satisfies Z ∞ 2 δ 2 τ dkFλQ(y − y )k ≥ C 2p−ε , (2.49) α α for all α ∈ (0, αmax) and ε > 0 small. Then

δ τ kA(xα − xα)k ≥ C p− ε . α 2 Proof. From (2.49), one can estimate Z ∞ 2 Z ∞ δ 2 λ δ 2 δ 2 kA(xα − xα)k = 2 dkFλQ(y − y )k ≥ dkFλQ(y − y )k 0 (α + λ) α τ 2 ≥ C . α2p−ε

62 Let us exemplify condition (2.49): for the case in (2.36), we have that ( Z Z N∗ 1−β δ 2 X 1 1 CN∗ if 1 − β > 0, dkFλQ(y − y )k = β ∼ β dx = λ≥α i 1 x C if 1 − β < 0, 1≤i≤N∗

1−β 1 − γ with N∗ = 1 . This gives that the left-hand side is of the order of α . α γ 1−β For (2.49) to hold true, we require that γ ≥ 2p − ε, which means that 1 + εγ ≥ β + 2pγ.

If we now choose p close to the smallest admissible exponent for the weakly bounded noise condition, i.e. 2pγ = 1 − β + εγ, with ε small, then the condition holds. In other words, our interpretation of the stated noise con- dition means that k(AA∗)p(yδ − y)k < ∞ and p is selected as the minimal exponent such that this holds. This noise condition automatically excludes the (strongly) bounded noise case. The example also shows that the desired inequality with ε = 0 cannot be achieved.

1 δ Theorem 10. Let µ ≤ 2 , α∗ be the minimiser of ψPMS(α, y ), assume that the noise satisfies (2.49) and that Ax† 6= 0. Then

( 2µ 2µ+1 2µ+2p+1 2 δ † Cτ , if α∗ ≥ αopt, kx − x k ≤ 2µ 2p+1 α∗ − Cτ 2µ+2p+1 (2p−)(2µ+2p+1) , if α∗ ≤ αopt.

If additionally for some 2 > 0, Z ∞ 2µ−1 2 2µ−1+2 λ dkEλωk ≥ Cα , (2.50) α then for the first case we have

2µ 2µ+1 δ † 2µ+2p+1 2µ+1+2 kxα∗ − x k ≤ Cτ , if α∗ ≥ αopt.

† Proof. If α∗ ≥ αopt, it follows from Ax 6= 0 that

2 2 kAxα − yk ≥ Cα , (2.51) and if (2.50) holds, then one even has that

Z ∞ 1+2µ 2 † 2 λ α 2 kA(xα − x )k ≥ 2 dkEλωk α (α + λ) Z ∞ 2 2µ−1 2 2µ+1+2 ≥ α λ dkEλωk ≥ Cα . (2.52) α

63 δ 2 Since α 7→ kA(xα − xα)k is a monotonically decreasing function and using Young’s inequality [36], we may obtain that

h 2µ+1 i2 t δ 2 2 2µ+2p+1 Cα∗ ≤ kA(xαopt − xαopt )k + kAxαopt − yk ≤ C τ , i.e., 2µ+1 2 α∗ ≤ Cτ 2µ+2p+1 t , where t = 2 or t = 2µ + 1 + 2 if (2.50) holds. If α∗ ≤ αopt, then we may bound the functional from below as 1 ψ2 (α, yδ) ≥ kA(xδ − x )k2 − kAx − yk2, PMS 2 α α α for all α ∈ (0, αmax), which allows us to obtain 1 kA(xδ − x )k2 − kAx − yk2 ≤ ψ2 (α , yδ) ≤ ψ2 (α , yδ) 2 α∗ α∗ α∗ PMS ∗ PMS opt h 2µ+1 i2 δ 2 2 2µ+2p+1 ≤ 2kA(xαopt − xαopt )k + 2kAxαopt − yk ≤ C τ . i.e., by Lemma 3,

2 τ 1 h 2µ+1 i2 2µ+1 δ 2 2 2µ+2p+1 C 2p−ε − Cα∗ ≤ kA(xα∗ − xα∗ )k − kAxα∗ − yk ≤ C τ . α∗ 2

Now, from α∗ ≤ αopt, we get

2 τ h 2µ+1 i2 2µ+2p+1 2µ+1 C 2p−ε ≤ C τ + Cα∗ α∗ h 2µ+1 i2 h 2µ+1 i2 2µ+2p+1 2µ+1 2µ+2p+1 ≤ C τ + Cαopt ≤ C τ , i.e., h 2p i2 2 2p 2p−ε 2µ+2p+1 2µ+2p+1 · 2p−ε α∗ ≥ C τ ⇐⇒ α∗ ≥ Cτ .

Then inserting the respective bounds for α∗ into (1.15) yields the desired rates. Condition (2.50) can again be verified as we did for the noise condition for some canonical examples. The inequality with 2 = 0 does not usu- ally hold. The condition can be interpreted as the claim that x† satisfies a source condition with a certain µ but this exponent cannot be increased, i.e., x† 6∈ range((A∗A)µ+). A similar condition was used by Lukas in his analysis of the generalised cross-validation rule [105]. The theorem shows that we may obtain almost optimal convergence results but only under rather restrictive conditions. Moreover, the method shows a 1 saturation effect at µ = 2 comparable to the heuristic discrepancy rule.

64 2.2.3 Generalised Cross-Validation The generalised cross-validation rule was proposed and studied in particular by Wahba [152], and is most popular in a statistical context [125] but less so for deterministic inverse problems. It is derived from the cross-validation method by combining the associated estimates with certain weights. Impor- tantly, it was shown in [152] that the expected value of the generalised cross- validation functional converges to the expected value of the PMS-functional as the dimension tends to infinity. This is why, in the preceding section, we studied ψPMS in detail.

∗ Proposition 16. Let supn tr(AnAn) < ∞; then it follows that the weight ρ(α) in ψGCV is monotonically increasing with ρ(0) = 0 and bounded with ρ(α) ≤ 1. Furthermore, for α > 0, it follows that ρ(α) → 1 as the dimension n → ∞. Proof. Observe that, from the definition of ρ, namely (2.48), we may derive

n n α 1 X α 1 X λi ρ(α) = tr((A∗ A + αI)−1) = = 1 − n n n n α + λ n α + λ i=1 i i=1 i n 1 X λi = 1 − . (2.53) n α + λ i=1 i Moreover, it is clear that ρ(α) ≤ 1 for all α > 0, since the second term is positive. Additionally, since

n ∞ ∞ X λi X λi ∗ X 1 lim = ≤ sup tr(AnAn) , n→∞ α + λ α + λ n α + λ i=1 i i=1 i i=1 i is clearly bounded, as per the assumption on the supremum in the statement of the proposition, and since the series is clearly convergent, it follows that taking the limit 1/n → 0 as n → ∞ yields that the second term of (2.53) converges to 0 as n → ∞. Thus, ρ(α) → 1 as n → ∞. The fact that ρ(0) = 0 is obvious. This is also the reason why one has to study the GCV in terms of weakly bounded noise. The limit limn→∞ ψGCV tends pointwise to the residual δ δ kAxα − y k, which in the bounded noise case does not yield a reasonable parameter choice as then α∗ = 0 is always chosen. Note that in a stochastic context, and using the expected value of ψGCV, a convergence analysis has been done by Lukas [105]. In contrast, we analyse the deterministic case. We now consider the ill-conditioned problem

Anx = yn, (2.54)

δ n where we only have noisy data yn ∈ R .

65 We impose a discretisation independent source condition; that is, † ∗ µ x = (AnAn) ω, kωk ≤ C, 0 < µ ≤ 1, where C does not depend on the dimension n. Furthermore, let us restate some definitions for this discrete setting: n δ 2 X 2p δ 2 δn := kyn − ynk, τ := λi |hyn − yn, uii| . i=1 Note that in an asymptotically weakly bounded noise case, we might assume that τ is bounded independent of n while δn might be unbounded as n tends to infinity. Moreover, we impose a noise condition of similar type as for the predictive mean-square error, but slightly different: 2 2 X α δ 2 τ 2 |hyn − yn, uii| ≥ C 2p−ε , for all α ∈ I, (2.55) λi α λi≥α where C does not depend on n. This is different from the condition stated in [89], as the author of this thesis found that the condition there was false and it is thus corrected here with arguably a much more restrictive condition (which is nevertheless needed to for the proceeding results of this section). Note that in the discrete case, one must restrict the noise condition to an interval with I = [αmin, αmax] with αmin > 0. We also state a regularity condition

X 2µ−1 2 2µ−1+2 λ |hω, vii| ≥ Cα , for all α ∈ I, (2.56)

λi≥α ∗ where {vi} denote the eigenfunctions of AnAn. In order to deduce convergence rates, we look to bound the functional from above as we did for the other functionals in the previous sections: δ n Lemma 4. For yn ∈ R , there exist positive constants such that C 1 ψ (α, y ) ≤ Cα2µ+1, µ ≤ , (2.57) GCV n ρ(α) 2 C 1 ψ (α, yδ − y ) ≤ δ2, µ ≤ , (2.58) GCV n n ρ(α) n 2 1 1 ψ (α, yδ ) ≤ Cα2µ+1 + δ2 , µ ≤ . (2.59) GCV n ρ(α) n 2 Proof. It is a standard result (cf. [35]) that δ δ δ kAnxα − yn − (Anxα − yn)k ≤ kyn − ynk ≤ δn. 2µ+1 Similarly, by the usual source condition, we obtain k(Anxα − yn)k ≤ Cα 1 for µ ≤ 2 (see Proposition 7). The result follows from the triangle inequality.

66 The proceeding results generally follow from the infinite dimensional setting and we similarly obtain the following bounds from below:

Lemma 5. Suppose that α ∈ I and also that (2.55) holds. Then

1  τ 2  ψ (α, yδ − y ) ≥ C . GCV n n ρ(α) α2p−ε

† Moreover, if kAnx k ≥ C0, with an n-independent constant, then there exists an n-independent constant C with 1 ψ (α, y ) ≥ C α2. GCV n ρ(α)

If (2.56) holds and α ∈ I, then 1 1 ψ (α, y ) ≥ C α2µ+1+2 , µ ≤ . GCV n ρ(α) 2

Proof. We can estimate

2 2 δ 1 X α δ 2 ψGCV(α, yn − yn) ≥ 2 2 |hyn − yn, uii| ρ (α) (λi + α) λi≥α 2 1 X α δ 2 ≥ C 2 2 |hyn − yn, uii| ρ (α) λi λi≥α (2.55) 1  τ 2  ≥ C . ρ2(α) α2p−ε

The remaining two inequalities in the lemma follow analogously to (2.51) and (2.52).

1 δ Theorem 11. Let µ ≤ 2 , assume α∗ is the minimiser of ψGCV(α, yn) and suppose further that α∗ ∈ I such that (2.55) holds. Then

− 1   2p−ε 2 − 2 2 2µ+1 2 2p−ε 2p−ε 2p−ε α∗ ≥ inf (Cα + Cδn) τ ≥ Cδn τ . α≥α∗

On the other hand

1   t 1 2µ+1 2 α∗ ≤ inf Cα + Cδn , α≤α∗ ρ(α) with t = 2. If α∗ ∈ I and (2.56) hold, then the above upper bound on α∗ holds with t = 2µ + 1 + 2.

67 Proof. Take an arbitraryα ¯ and consider first the case α∗ ≤ α¯. Following on from the previous lemmas and using (2.58), we have

 2  1 τ δ δ C 2p−ε ≤ ψGCV(α∗, yn − yn) ≤ ψGCV(α∗, yn) + ψGCV(α∗, yn) ρ(α∗) α∗ δ 1 2µ+1 1 2µ+1 2 1 2µ+1 ≤ ψGCV(¯α, yn) + C α∗ ≤ Cα¯ + δn + C α∗ . ρ(α∗) ρ(¯α) ρ(α∗)

Hence, by the monotonicity of α 7→ α2µ+1 and since ρ is monotonically increasing (cf. Proposition 16), we obtain that

 2  τ ρ(α∗) 2µ+1 2 2µ+1 C 2p−ε ≤ Cα¯ + δn + α∗ α∗ ρ(¯α) 2µ+1 2 2µ+1 2µ+1 2 ≤ Cα¯ + δn + α∗ ≤ Cα¯ + δn .

Hence,

− 1   2p−ε 2 − 2 2 2µ+1 2 2p−ε 2p−ε 2p−ε α∗ ≥ inf (Cα + Cδn) τ ≥ Cδn τ . α≥α∗

Now, suppose α∗ ≥ α¯. Then using that α∗ is a minimiser, and with t as in the statement of the theorem

C t δ δ α∗ ≤ ψGCV(α∗, yn) ≤ ψGCV(α∗, yn) + ψGCV(α∗, yn − yn) ρ(α∗)

1 2µ+1 2 1 2 ≤ (Cα¯ + Cδn) + C δn ρ(¯α) ρ(α∗) 1 1 ≤ (Cα¯2µ+1 + Cδ2) + C δ2. ρ(¯α) n ρ(¯α) n

Hence, as ρ(α∗) is bounded from above by 1, it follows that

1   t 1 2µ+1 2 α∗ ≤ inf Cα + Cδn . α≤α∗ ρ(α)

1 δ Theorem 12. Suppose that µ ≤ 2 , α∗ is the minimiser of ψGCV(α, yn), where α∗ ∈ I such that (2.55) and (2.56) are satisfied. Suppose further that 2 2µ+1 one has ρ(δn ) ≥ C. Then

1 2µ   2p−ε δ † t τ kxα∗ − x k ≤ δn + δn , δn with t as in Theorem 11.

68 Proof. Since δ † µ δn kxα∗ − x k ≤ Cα∗ + C √ , α∗

2 2µ+1 we may take the balancing parameterα ¯ = δn . From the previous theorem, it follows that if α∗ ≤ α¯, then by Theorem 11

2 2 τ 2p−ε  τ  2p−ε α∗ ≥ 1 ≥ . 2µ+1 2 2p−ε δn [infα≥α∗ (Cα + Cδn)]

On the other hand, if α∗ ≥ α¯, and ρ(¯α) ≥ C, then

2 α∗ ≤ Cδ t .

µ √ Thus, taking for α and δn/ α the worst of these estimates, we obtain the desired result. This result establishes convergence rates in the discrete case. However, the required conditions are somewhat restrictive as we need that the selected α∗ has to be in a certain interval (although this is to be expected in a finite- 2 dimensional setting). Note that the term δn in Theorem 11 can be replaced by 2 δ any reasonable monotonically decreasing upper bound for ψGCV(α, yn − yn). 2 δ In particular, if we could conclude that α∗ is in a region where ψGCV(α, yn − τ 2 yn) ≤ C α2p , then we would obtain similar convergence results as for the 2 2µ+1 predictive mean square error. Moreover, the condition that ρ(δn ) > C restricts the analysis to the weakly bounded noise scenario, in which case δn → ∞ as n → ∞. The standard “bounded noise” case is ruled out in Theorem 12, because if δn would tend to zero, then this would lead to a contradiction with Proposition 16. In general, however, the performance of the GCV-rule for the regularisation of deterministic inverse problems is subpar compared to other heuristic rules, e.g., those mentioned in the previous sections; cf., e.g., [9, 53]. This is also illustrated by the fact that we had to impose stronger conditions for the convergence results compared to the aforementioned rules.

2.3 Operator Perturbations

We now consider another deviation from the classical theory and attempt to reprove the standard results for this scenario: suppose that, in addition to noisy data yδ (that is, we return to the usual setting in which ky − yδk ≤ δ), one also only has knowledge of a noisy operator

Aη = A + ∆A,

69 such that kAη − Ak ≤ η. Note that this section is largely based on the paper [51]. We also refer the reader to [103,134,143,150]. The specific situation that we consider here, which is often met in practical situations, is that we suppose we have knowledge of the operator noise level, i.e., we assume η known, but we do not know the level of the data error δ. The regularised solution in this case defined as

δ ∗ −1 ∗ δ xα,η = (AηAη + αI) Aηy , (2.60) and the task is then to determine a parameter α such that

δ † kxα,η − x k → 0, as δ, η → 0. First, we state the following useful lemma from [51] which provides some useful bounds for the subsequent analysis:

1 Lemma 6. Let α ∈ (0, αmax) and for s ∈ {0, 2 , 1}, define ( ( (A∗A )s if s ∈ {0, 1}, (A∗A)s if s ∈ {0, 1}, B := η η B := η,s ∗ 1 s ∗ 1 Aη if s = 2 , A if s = 2 . ˆ ˆ Let Bη,s and Bs be the operators we get from Bη,s and Bs by changing the ∗ ∗ roles of the operators Aη with Aη and A with A , respectively. Then for 1 3 s ∈ {0, 2 , 1} and t ∈ {−1, − 2 , −2}, there exist positive constants Cs,t such that ∗ q ∗ t η (A Aη + αI) Bη,s − (A A + αI) Bs ≤ Cs,t . (2.61) η 1 −s−t α 2

∗ t ˆ ∗ t ˆ η (AηA + αI) Bη,s − (AA + αI) Bs ≤ Cs,t . (2.62) η 1 −s−t α 2 Proof. We prove (2.61), this gives (2.62) changing the roles of the operators ∗ ∗ Aη ↔ Aη and A ↔ A . We recall the elementary estimates 1 k(A∗A + αI)−1k ≤ , α 1 k(A∗A + αI)−1A∗k ≤ √ , (2.63) 2 α k(A∗A + αI)−1A∗Ak ≤ 1,

∗ ∗ which also hold with A and A replaced by Aη and Aη, respectively. For s ∈ {0, 1}, it follows from some algebraic manipulations, the fact that Bs,Bη.s

70 commute with the inverses below, and the previous estimates that

∗ −1 ∗ −1 (AηAη + αI) Bη,s − Bs(A A + αI) ∗ −1  ∗ ∗  ∗ −1 = (AηAη + αI) Bη,s(A A + αI) − (AηAη + αI)Bs (A A + αI) ∗ −1  ∗ ∗  ∗ −1 = (AηAη + αI) Bη,sA A − AηAηBs (A A + αI) ∗ −1 ∗ −1 + α(AηAη + αI) [Bη,s − Bs](A A + αI) .

In the case s = 0 and Bη,0 = B0 = I, we find

∗ ∗ ∗ ∗ ∗ Bη,0A A − AηAηB0 = (A − Aη)A + Aη(A − Aη), which, using (2.63), gives C0,−1 = 1. Similarly, we can prove that C1,−1 = 1. 1 ∗ ∗ 5 For the case s = , if Bη,s = A and Bs = A , we obtain C 1 = with 2 η 2 ,−1 4 minor modifications noting that (A∗A + αI)−1A∗ = A∗(AA∗ + αI)−1. The other cases of t follow in a similar manner by

∗ t ∗ t (AηAη + αI) Bη,s − Bs(A A + αI) ∗ t+1  ∗ −1 ∗ −1 = (AηAη + αI) (AηAη + αI) Bη,s − Bs(A A + αI)  ∗ t+1 ∗ t+1 ∗ −1 + (AηAη + αI) − (A A + αI) Bs(A A + αI) ,

3 and by using (2.63) and the result for t = −1. For t = − 2 , we employ an additional identity from semigroup operator calculus [92],

1 1 ∗ − 2 ∗ − 2 (AηAη + αI) − (A A + αI) π ∞ sin( ) Z 1 2 − 2  ∗ −1 ∗ −1 = w (AηAη + (α + w)I) − (A A + (α + w)I) dw, π 0 which leads to Z ∞ 1 1 C0,−1η 1 ∗ − 2 ∗ − 2 k(AηAη + αI) − (A A + αI) k ≤ √ 3 dw π 0 w(α + w) 2 2C η ≤ 0,−1 , π α thereby finishing the proof.

2.3.1 Semi-Heuristic Parameter Choice Rules In terms of parameter selection, previous work can be found in, e.g., [103, 104,134,143]. Note that the aforementioned references generally consider a- posteriori rules; namely, the generalised discrepancy principle which selects the parameter as

δ  δ δ δ  α∗ = α(δ, η, y ) = inf α ∈ (0, αmax) | kAηxα,η − y k = τ δ + ηkxα,ηk , (2.64)

71 with τ ≥ 1 and also a generalised balancing principle. The fact that prohibits the direct use of a minimisation-based rule with, say, a functional, which especially for this section, we define to be of the form

ψ : (0, αmax) × L(X,Y ) × Y → R ∪ {∞} δ δ (α, Aη, y ) 7→ ψ(α, Aη, y ), is that we are faced with an additional operator error, which is usually not random or irregular and hence it would be unrealistic to assume that for the operator perturbation an analogous inequality to the Muckenhoupt condition (2.17) holds. The remedy is to employ a modified functional, which uses the noisy operator Aη, but is designed to emulate a functional for the unperturbed operator. To guarantee a minimiser and convergence of the regularised solution, we restrict the minimisation to an interval [γ, αmax], where the lower bound γ is selected depending on η (but not on δ):

δ ¯ δ α∗ := α(η, y ) := argmin ψ(α, Aη, y ), γ = γ(η) > 0, (2.65) α∈[γ,αmax] with ¯ δ δ δ ψ(α, Aη, y ) := ψ(α, Aη, y ) − J (α, Aη, y , η). In this way, we combine heuristic rules with an η-based choice. Here, ψ is a standard heuristic parameter choice functional and J is the so-called compensating functional. Note that the noise restrictions for the convergence results of the heuristic parameter choice rules may fail to be satisfied in case of smooth noise (i.e., when the noise is in the range of the forward operator). Incidentally, operator noise, when it exists, tends to be smooth. Therefore, the purpose of the compensating functional is to subtract this smooth part of δ δ the noise, i.e., it should behave approximately like ψ(α, A, y ) − ψ(α, Aη, y ). For the compensating functional, we propose two possibilities:

δ δ J (α, Aη, y , η) = Dηkxα,ηk, (SH1) or η J (α, A , yδ, η) = D√ , (SH2) η α which define the (SH1) and (SH2) variants of the semi-heuristic functionals. As a consequence of Lemma 6, we obtain some useful bounds.

Lemma 7. For any of the functionals ψ ∈ {ψHD, ψHR, ψQO}, and any α ∈ (0, αmax), we have

ηkx†k ψ(α, A ,A x†) ≤ C √ + ψ(α, A, Ax†), (2.66) η η s,t α

72 δ ηkx†k ψ(α, A , yδ) ≤ √ + (1 + C ) √ + ψ(α, A, Ax†), (2.67) η α s,t α 1 with the constants Cs,t from Lemma 6: s = 2 , t = −1 for the heuristic 1 3 discrepancy, s = 2 , t = − 2 the Hanke-Raus, and s = 1, t = −2 for the quasi-optimality functionals, respectively. Proof. The inequality (2.66) follows from (2.61) and (2.62), the inequality (2.67) from (2.66) and from the inequalities

† † ψ(α, Aη, yδ) ≤ ψ(α, Aη, yδ − y) + ψ(α, Aη, (A − Aη)x ) + ψ(α, Aη,Aηx ) δ ηkx†k ≤ √ + √ + ψ(α, A ,A x†). α α η η

We remark that the term ψ(α, A, Ax†) converges to 0 as α → 0; see Proposi- tion 12. Furthermore, if x† additionally satisfies a source condition (2.5), then the expression can be bounded by a convergence rate of order α (with some exponent depending on the source condition) that agrees with the standard † rate for the approximation error kxα − x k (see Proposition 7).

Convergence Analysis

Suppose that α∗ is the selected parameter by the proposed parameter choice rules with the operator noise (2.65). In the following lemma, we show that for such a choice of parameter, it follows that α∗ → 0 if all noise (with respect to both the data and the operator) vanishes: ¯ Lemma 8. Let α∗ be selected as in (2.65), i.e., with ψ as in and ψ ∈ {ψHD, ψHR, ψQO} with either (SH1) or (SH2). Suppose there exist positive constants (not necessarily equal which we denote universally by C) such that δ ∗ δ ky k ≥ C for ψ ∈ {ψHD, ψHR} and kAηy k ≥ C for ψ = ψQO. η √ If γ = γ(η) is chosen such that γ → 0 as η → 0 then

α∗ → 0, as δ, η → 0. Proof. Firstly, we recall Proposition 9 of Chapter 2, which proved that one may estimate the parameter α by the ψ-functionals. In particular, it follows that ( √ δ C α if ψ = ψHD, ψ(α, Aη, y ) ≥ (2.68) Cα if ψ ∈ {ψHR, ψQO}. By the standard error estimate

δ δ ky k kxα∗,ηk ≤ √ , α∗

73 we find, for the case in which the compensating functional is chosen as in (SH1) using (2.68) and (2.67) with t ∈ {1/2, 1} suited to ψ according to (2.68),

δ t ky k ¯ δ ¯ δ Cα∗ − Dη √ ≤ ψ(α∗,Aη, y ) = inf ψ(α, Aη, y ) α∗ α∈[γ,αmax] δ ≤ inf ψ(α, Aη, y ) α∈[γ,αmax]  †  δ ηkx k † ≤ inf √ + (1 + Cp,q) √ + ψ(α, A, Ax ) α∈[γ,αmax] α α   † δ † ηkx k ≤ inf √ + +ψ(α, A, Ax ) + (1 + Cp,q) √ . (2.69) α∈[γ,αmax] α γ Hence,   t δ † η Cα∗ ≤ inf √ + ψ(α, A, Ax ) + (C + D)√ . α∈[γ,αmax] α γ It is not difficult to verify the same estimate analogously for the case in which the compensating functional is chosen according to (SH2). Inserting the (non-optimal) choice α = δ + γ in the infimum, we obtain an upper bound that tends to 0 as δ, γ → 0. By the hypothesis, the last two terms vanish, thereby proving the desired result.

Remark. If α∗ is the minimiser of ψ(α, Aη, yδ), then this functional is the same as (SH1) and/or (SH2) with D = 0 and one obtains the same result as above; namely, that α∗ → 0 as δ, η → 0 provided that the conditions in the lemma are fulfilled. Now, we can establish an estimate from above for the total error which is derived courtesy of a lower estimate of the parameter choice functional with the data error, which we recall, due to Bakushinskii’s veto (cf. Proposition 3), necessitates a restriction on the noise. At first we study bounds for the functional in (2.65). ¯ Proposition 17. Let α∗ be selected according to (2.65) with ψ as in (SH1). Suppose that for the noisy data yδ, the noise condition is satisfied, with (2.17) with e ∈ N1 for ψ ∈ {ψHD, ψHR} and e ∈ N2 for ψ = ψQO. Then, for η sufficiently small, we get the following error estimate for all ψ ∈ {ψHD, ψHR, ψQO}: δ † kxα∗,η − x k  −1 ηδ ¯ † ≤ (1 − DηCnc) C + Cnc inf ψ(α, Aη, yδ) + DCncηkx k α∗ α∈[γ,αmax]  η † † ∗ −1 † + C √ kx k + Cψ(α∗, A, Ax ) + α∗k(A A + α∗I) x k . α∗ (2.70)

74 Proof. We begin by estimating the terms:

δ † ∗ −1 ∗ δ ∗ −1 ∗ † kxα∗,η − x k = k(AηAη + α∗I) Aηy − (AηAη + α∗I) (AηAη + α∗I)x k ∗ −1  ∗ δ ∗ † † ≤ k(AηAη + α∗I) Aηy − AηAηx − α∗x k ∗ −1  ∗ δ ∗ † ≤ k(AηAη + α∗I) Aη(y − y) + Aη(A − Aη)x k ∗ −1 † + α∗k(AηAη + α∗I) x k ∗ −1 ∗ δ η † ∗ −1 † ≤ k(AηAη + α∗I) Aη(y − y)k + √ kx k + α∗k(AηAη + α∗I) x k. 2 α∗ By (2.61), the last term can be bounded by

∗ −1 †  ∗ −1 ∗ −1 † α∗k(AηAη + α∗I) x k ≤ α∗k (AηAη + α∗I) − (A A + α∗I) x k ∗ −1 † + α∗k(A A + α∗I) x k † ηkx k ∗ −1 † ≤ C0,−1 √ + α∗k(A A + α∗I) x k. α∗ This leaves the remaining term:

∗ −1 ∗ δ k(AηAη + α∗I) Aη(y − y)k  ∗ −1 ∗ ∗ −1 ∗ δ ≤ k (AηAη + α∗I) Aη − (A A + α∗I) A (y − y)k ∗ −1 ∗ δ + k(A A + α∗I) A (y − y)k

5ηδ δ ≤ + Cncψ(α∗, A, y − y), 4α∗ where we used the noise condition (2.17) for the last estimate. Utilising the noise condition (2.17) again, and with the operator error esti- mates (2.61), (2.62), for all ψ ∈ {ψHD, ψHR, ψQO}, we obtain

δ † ∗ −1 ∗ δ kxα∗,η − x k = k(AηAη + α∗I) Aη(y − y)k

5ηδ δ δη ≤ + Cncψ(α∗,Aη, y − y) + CncCp,q 4α∗ α∗

(5 + CncCp,q)ηδ δ ≤ + Cncψ(α∗,Aη, y ) + Cncψ(α∗,Aη, y) 4α∗ (5 + CncCp,q)ηδ ¯ δ δ ≤ + Cncψ(α∗,Aη, y ) + DCncηkxα∗,ηk 4α∗ † + Cncψ(α∗,Aη, Ax ) Cηδ ¯ δ δ † † ≤ + Cnc inf ψ(α, Aη, y ) + DCncηkxα∗,η − x k + DCncηkx k α∗ α∈[γ,αmax] † + Cncψ(α∗,Aη, (Aη + A − Aη)x ) Cηδ ¯ δ δ † † ≤ + Cnc inf ψ(α, Aη, y ) + DCncηkxα∗,η − x k + DCncηkx k α∗ α∈[γ,αmax] † † + Cncψ(α∗,Aη,Aηx ) + Cncψ(α∗,Aη, (A − Aη)x ).

75 The last terms can be bounded using standard error estimates (cf. Proposi- tion 11) by 1 ηkx†k ψ(α ,A , (A − A )x†) ≤ √ k(A − A )x†k = √ , ∗ η η α η α while for the other term we employ (2.61) and (2.66) ηkx†k ψ(α ,A ,A x†) ≤ C √ + ψ(α , A, Ax†). ∗ η η p,q α ∗

Hence, for all ψ ∈ {ψHD, ψHR, ψQO}, we obtain

δ † (1 − DηCnc)kxα∗,η − x k Cηδ ¯ † ≤ + Cnc inf ψ(α, Aη, yδ) + DCncηkx k α∗ α∈[γ,αmax] η † † ∗ −1 † + C √ kx k + Cψ(α∗, A, Ax ) + α∗k(A A + α∗I) x k. α∗

The proof is easily adapted to obtain a similar proposition for the alternative choice of compensating functional as in (SH2):

Proposition 18. Let the assumptions of the Proposition 17 hold. Let α∗ be selected according to (2.65) with ψ¯ as in (SH2). Then for η sufficiently small, we get

δ † kxα∗,η − x k ηδ ¯ η ≤ C + Cnc inf ψ(α, Aη, yδ) + CncC √ α∗ α∈[γ,αmax] α∗ (2.71) η † † ∗ −1 † + C √ kx k + Cψ(α∗, A, Ax ) + α∗k(A A + α∗I) x k. α∗ Note that the setting D = 0 in the previous propositions yields upper bounds for the total errors in the case of employing the unmodified heuristic rules.

Theorem 13. Let α∗ be selected as in (2.65). Suppose that the noise con- dition (2.17) and the conditions of Lemma 8 are satisfied and furthermore suppose that γ ∈ [0, αmax] satisfies η ≤ C as η → 0, γ where C is a constant. Then

δ † kxα∗,η − x k → 0, as δ, η → 0.

76 Proof. Since we have that α∗ ≥ γ, the conditions in the theorem imply that ηδ η √ † ∗ −1 † γ → 0, γ → 0. The terms with ψ(α∗, A, Ax ) and α∗k(A A + α∗I) x k vanish by standard arguments (cf. Proposition 4 and Proposition 6, respec- ¯ tively), as α∗ → 0 according to Lemma 8. Finally, infα∈[γ,αmax] ψ(α, Aη, yδ) tends to 0 because of (2.69) and we may take an appropriate choice for α in the infimum.

Remark. Note that one might use more general functionals than those in (SH1) and (SH2) by replacing η with ηs, s ∈ (0, 1). Still, in this case, similar convergence results are valid with a slightly adapted choice of γ (depending on s). However, we observed through some numerical experimentation (see Chapter 5) that s = 1 appeared to be a natural choice, which is fully in line with our motivation that the compensating term should represent the error in ψ due to operator perturbations. We further remark that the unmodified heuristic choice (i.e., with D = 0), stipulating the same condition as in the previous theorem, also yields conver- gence as the errors tend to zero. However, as will be observed in Chapter 5, which gives a numerical study related to the methods of this section, the modified rules represent a substantial improvement. Convergence rates were not proven in [51] and also lie beyond the scope of this thesis.

77 Chapter 3

Convex Tikhonov Regularisation

In this chapter, we revert to the original representation of Tikhonov reg- ularisation, namely (2.4) and make use of it to expand our horizons from linear ill-posed problems to convex regularisation. This form of regularisa- tion gained popularity in recent decades due to the demand from application which, for instance, seeks to recover solutions which may be sparse or in the case of image processing, one may want to retain the edges when denoising an image. For the aforementioned applications, one may replace the square quadratic regularisation term in (2.4) with an `1 norm or a total variation seminorm, respectively. This is the flexibility which (2.4) allows us over the spectral form (2.1). The downside is that the analysis is arguably much more difficult in the absence of spectral theory. We begin this chapter with an initial section which covers the classical theory, mainly from the works of [21, 68, 142]. Then the proceeding sections are on heuristic parameter choice rules with a focus on proving convergence with respect to an analogous noise restriction to (2.17) from the linear theory.

3.1 Classical Theory

We assume henceforth that A : X → Y is a continuous linear operator with X as a Banach space (and that Y is a Hilbert space). Then the functional calculus for self-adjoint operators used previously becomes void. Moreover, the Moore-Penrose generalised inverse may no longer exists, thus we must redefine our notion of the best approximate solution. One survey in this setting can be found in [140]. One may also consult with [142] in case one wants to consider Y as a Banach space also, although this is beyond the scope of this thesis. Note that this section is by and large analogous to [90]. In this context, it is common to define the best approximate solution as the

78 so-called R-minimising solution, i.e., x† = argmin R(x). x∈X Ax=y One may still regularise ´ala Tikhonov, however, if one sticks with the vari- ational form: δ 1 δ 2 xα ∈ argmin kAx − y k + αR(x), (3.1) α 2 where the regularisation term R : X → R ∪ {∞} may be a generalisation of the usual square norm, which we shall assume to be convex, proper, coercive and weak-∗ lower semi-continuous (akin to the aforementioned norm func- tional), which are in fact the standard assumptions for the well-defindness of (3.1) cf. [78, 140, 142]. In contrast with k · k2, the R functional need not be differentiable. The optimality condition for the Tikhonov functional may be stated as fol- lows: ∗ δ δ δ 0 ∈ A (Axα − y ) + α∂R(xα), (3.2) δ for all α ∈ (0, αmax) and y ∈ Y . We may define specific selections of the δ subgradient of R at xα as: 1 ξδ : = − A∗(Axδ − yδ) ∈ ∂R(xδ ), α α α α and denote by ξα its respective noise-free variant. In this scenario, we also have to consider a different source condition to, e.g., (2.5). Namely, ∂R(x†) ∈ range A∗, (3.3) i.e., there exists ω such that A∗ω ∈ ∂R(x†) [21,140]. Subsequently, we define ξ := A∗ω, to be a subgradient of R at x†. It is useful to define the residual vectors as before:

δ δ δ pα := y − Axα, pα := y − Axα. For notational purposes, we also define shorthands for differences of noisy and noise-free variables:

δ δ ∆y := y − y, ∆pα := pα − pα. Additionally, in this context, rather than prove convergence in norm, we follow the examples of e.g., [21,68,142], and prove estimates and convergence with respect to the Bregman distance, denoted by Dξ(·, ·), cf. Appendix B and in particular, Definition 13. We begin by stating a useful estimate for the data propagation error:

79 Lemma 9. We have the following upper bound for the data propagation error: 1 D (xδ , x ) ≤ h∆y − ∆p , ∆p i, (3.4) ξα α α α α α δ for all α ∈ (0, αmax) and y, y ∈ Y . Proof. We may estimate

δ sym δ 1 δ 1 Dξα (xα, xα) ≤ D δ (xα, xα) = h∆pα,A(xα − xα)i = h∆pα, ∆y − ∆pαi, ξα,ξα α α where Dsym is the symmetric Bregman distance, cf. Appendix B and Defini- tion B.4, which proves the desired result. We give the error estimates [21], which, contrary to the previous chapter, are in terms of the Bregman distance: Proposition 19. Assume that (3.3) holds. Then there exist positive con- stants such that

† 2 Dξ(xα, x ) ≤ Ckωk α, kAxα − yk ≤ Ckωkα, δ2 D (xδ , x ) ≤ C , kA(xδ − x )k ≤ Cδ, ξα α α α α α and  δ √ 2 D (xδ , x†) ≤ C √ + kωk α , kAxδ − yδk ≤ δ + Ckωkα, (3.5) ξ α α α

δ for all α ∈ (0, αmax) and y, y ∈ Y . Proof. We have 1 1 kAx − yk2 + αR(x ) ≤ kAx† − yk2 + αR(x†) = αR(x†). 2 α α 2 Now, from † † † Dξ(xα, x ) = R(xα) − R(x ) − hξ, xα − x i, it follows that

† † † αR(x ) = αR(xα) − αhξ, xα − x i − αDξ(xα, x ). Thus, 1 kp k2 + αR(x ) ≤ αR(x†) 2 α α 1 ⇐⇒ kp k2 + αR(x ) ≤ αR(x ) − αhξ, x − x†i − αD (x , x†) 2 α α α α ξ α 1 ⇐⇒ kp k2 + αD (x , x†) ≤ −αhξ, x − x†i, 2 α ξ α α

80 i.e., 1 kp k2 ≤ −αD (x , x†) − αhA∗ω, x − x†i ≤ −αhω, p i 2 α ξ α α α ≤ α|hω, pαi| ≤ αkωkkpαk.

Hence, we obtain the discrepancy estimate: kpαk ≤ 2kωkα. Now, for the approximation error, we recall the previous estimate 1 1 kp k2 + αD (x , x†) ≤ αhξ, x − x†i ≤ αkωkkp k ≤ kωk2α2 + kp k2 . 2 α ξ α α α 2 α Now, for the estimates with noise, we have 1 1 1 kpδ k2 + αD (xδ , x ) ≤ kpδ k2 − αh A∗p , xδ − x i 2 α ξα α α 2 α α α α α 1 = kpδ k2 − hp ,A(xδ − x )i 2 α α α α 1 = hpδ , pδ i − hA(xδ − x ), p i − hy − yδ,A(xδ − x )i 2 α α α α α α α 1 = h pδ + Axδ − Ax , pδ i − hy − yδ,A(xδ − x )i 2 α α α α α α 1 = hAx − yδ + Axδ + Axδ − Ax − Ax , Ax − Axδ + Axδ − yδi 2 α α α α α α α α δ δ − hy − y ,A(xα − xα)i 1 = hAx − yδ + Axδ + Axδ − Ax − Ax , Axδ − yδi 2 α α α α α α 1 − hAx − yδ + Axδ + Axδ − Ax − Ax ,A(xδ − x )i 2 α α α α α α α δ δ − hy − y ,A(xα − xα)i 1 1 1 = kAxδ − yδk2 + hA(xδ − x ), Axδ − yδi − kA(xδ − x )k2 2 α 2 α α α 2 α α 1 − hAxδ − yδ,A(xδ − x )i − hy − yδ,A(xδ − x )i 2 α α α α α 1 1 = kpδ k2 − kA(xδ − x )k2 − hy − yδ,A(xδ − x )i, 2 α 2 α α α α i.e., 1 kpδ k2 + αD (xδ , x ) 2 α ξα α α 1 1 ≤ kpδ k2 − kA(xδ − x )k2 − hy − yδ,A(xδ − x )i 2 α 2 α α α α 1 ⇐⇒ kA(xδ − x )k2 + αD (xδ − x ) ≤ −hy − yδ,A(xδ − x )i. 2 α α ξ α α α α

81 From the non-negativity of the Bregman distance, it follows that the above inequality implies that 1 kA(xδ − x )k2 ≤ −hy − yδ,A(xδ − x )i ≤ ky − yδkkA(xδ − x )k 2 α α α α α α δ ≤ δkA(xα − xα)k.

δ Thus, the data discrepancy can be estimated as kA(xα − xα)k ≤ 2δ. Now, from 1 kA(xδ − x )k2 + αD (xδ , x ) ≤ −hy − yδ,A(xδ − x )i 2 α α ξα α α α α 1 1 ≤ δkA(xδ − x )k ≤ δ2 + kA(xδ − x )k2, α α 2 2 α α we get the estimate for the data propagation error. For the total error and discrepancy estimates, we begin by recalling that 1 1 kAxδ − yδk2 + αD (xδ , x ) ≤ kAx − yk2 − αhξ , xδ − x i, 2 α ξα α α 2 α α α α which implies

δ δ 2 2 ∗ δ kAxα − y k ≤ kAxα − yk + 2hA (Axα − y), xα − xαi 2 2 δ ≤ 4kωk α + 2hAxα − y, A(xα − xα)i 2 2 δ ≤ 4kωk α + 2kAxα − ykkA(xα − xα)k ≤ Ckωk2α2 + Cδ2 ≤ (Cδ + Ckωkα)2 , thereby yielding the estimate for the total discrepancy. Finally, for the total δ error, we may use the optimality of xα to estimate 1 δ2 kAxδ − yδk2 + αR(xδ ) ≤ + αR(x†), 2 α α 2 which is equivalent to 1 δ2 kAxδ − yδk2 ≤ + R(x†) − R(xδ ). 2α α 2α α Therefore,

1 δ δ 2 δ † kAx − y k + D δ (x , x ) 2α α ξα α δ2 ≤ + R(x†) − R(xδ ) + R(xδ ) − R(x†) − hξ, xδ − x†i 2α α α α δ2 ≤ −hA∗ω, xδ − x†i + , α 2α

82 which allows us to estimate the total error was 2 δ † δ † δ 1 δ δ 2 D δ (x , x ) ≤ −hω, A(x − xα + xα − x )i + − kAx − y k ξα α α 2α 2α α δ2 ≤ kωkkA(xδ − x )k + kωkkAx − yk + α α α 2α δ2 ≤ Ckωkδ + Ckωk2α + 2α  δ √ 2 ≤ C √ + kωk α , α which completes the proof [78]. Corollary 5. Let a source condition (3.3) hold. Then we have

δ † δ † Dξ(xα, x ) ≤ Dξα (xα, xα) + Dξ(xα, x ) + Ckωkδ, (3.6)

δ for all α ∈ (0, αmax) and y, y ∈ Y .

δ † Proof. It follows from (B.3), with xα = x1, x = x2 and xα = x3, that

δ † δ δ † δ Dξ(xα, x ) = Dξα (xα, xα) + Dξ(xα, x ) + hξα − ξ, xα − xαi. Observe that the last term can be estimated as 1 hξ − ξ, xδ − x i = −h A∗(Ax − y) + A∗ω, xδ − x i α α α α α α α 1 = − hAx − y, A(xδ − x )i − hω, A(xδ − x )i α α α α α α  1  ≤ kA(xδ − x )k kAx − yk + kωk α α α α ≤ Cδ(Ckωk + kωk) = Ckωkδ, where we used the estimates of the previous proposition. Theorem 14. Let a source condition (3.3) hold. Then

δ † Dξ(xα, x ) → 0, as δ → 0. Proof. We may estimate the error in three parts, as in (3.6), with the first and second terms corresponding to the data propagation and approximation errors, respectively. Subsequently, from Proposition 19, we obtain δ2 D (xδ , x†) ≤ C + Cα + Cδ. (3.7) ξ α α Thus, choosing α such that δ2/α → 0 and α → 0 as δ → 0 yields the desired result.

83 In this setting, we can also generalise iterated Tikohonov regularisation (2.9) to a convex analogue, namely, Bregman iteration, which is defined as

δ 1 δ 2 δ xα,k ∈ argmin kAx − y k + αDξδ (x, xα,k−1), (3.8) x∈X 2 α,k−1

δ δ with ξα,k−1 ∈ ∂R(xα,k−1) for k ∈ N. For certain parameter choice rules, we are particularly interested in the second Bregman iterate (cf. [17,124,154]), which we denote by

II 1 δ 2 δ x ∈ argmin kAx − y k + αD δ (x, x ). (3.9) α,δ ξα α x∈X 2 Note that the second Bregman iterate may be computed by minimising a slightly simpler expression which does not involve the Bregman distance [153]:

Proposition 20. We may compute the second Bregman iterate as

II 1 δ δ δ 2 xα,δ ∈ argmin kAx − y − (y − Axα)k + αR(x). x∈X 2 for all α ∈ (0, αmax).

δ Proof. Expanding the second term in (3.9), and using the definition of ξα, II we see that xα,δ minimises 1 kAx − yδk2 + αR(x) − αR(xδ ) − αhξδ , x − xδ i 2 α α α 1 = kAx − yδk2 − hAxδ − yδ,A(x − xδ )i + αR(x) − αR(xδ ) 2 α α α 1 = kAx − yδk2 − hAxδ − yδ, Ax − yδ − (Axδ − yδ)i + αR(x) − αR(xδ ) 2 α α α 1 = kAx − yδ + Axδ − yδ − Axδ + yδk2 − hAx − yδ − (Axδ − yδ), Axδ − yδi 2 α α α α δ + αR(x) − αR(xα) 1 = kAx − yδ − (yδ − Axδ )k2 + hAx − yδ + Axδ − yδ, Axδ − yδi 2 α α α 1 + kAxδ − yδk2 − hAx − yδ − (Axδ − yδ), Axδ − yδi 2 α α α δ + αR(x) − αR(xα) 1 = kAx − yδ − (yδ − Axδ )k2 + αR(x) + 2kAxδ − yδk2 − R(xδ ). 2 α α α Notice that the last two terms do not depend on x; hence, the result follows.

84 For the residual vector with the second Bregman iterate, we stick to the notation: II δ II II II pα,δ := y − Axα,δ, pα := y − Axα , for noisy and exact data, respectively. Subsequently, we define:

II II II ∆pα := pα,δ − pα .

Proposition 21. We have

II δ δ II kpα,δk ≤ kpαk, and R(xα) ≤ R(xα,δ), (3.10) for all α ∈ (0, αmax).

II Proof. From the optimality of xα,δ, it follows that

1 II 2 II δ 1 δ 2 δ δ 1 δ 2 kp k + αD δ (x , x ) ≤ kp k + αD δ (x , x ) = kp k , 2 α,δ ξα α,δ α 2 α ξα α α 2 α i.e.,

1 II 2 1 δ 2 II δ 1 δ 2 kp k ≤ kp k − αD δ (x , x ) ≤ kp k , 2 α,δ 2 α ξα α,δ α 2 α which follows from the non-negativity of the Bregman distance. Similarly as for the Tikhonov functional (i.e. (3.2)), we can state the opti- mality condition for the Bregman functional in the same manner:

∗ II δ δ δ II 0 ∈ A (Axα,δ − y − (y − Axα)) + α∂R(xα,δ).

It follows immediately as a consequence of the above that one can define a II specific selection of the subgradient of R at xα,δ as

1 II ξII := − A∗(AxII − yδ − (yδ − Axδ )) ∈ ∂R(xδ ), α,δ α α,δ α α

II and denote by ξα its respective noise-free variant.

δ II Proposition 22. The residuals pα, pα,δ may be expressed in terms of a prox- imal mapping operator, proxJ : Y → Y ,

−1 proxJ = (I + ∂J ) , (3.11)

∗ 1 ∗ with the functional J := αR ◦ α A in the following form:

δ δ II δ  δ pα := proxJ (y ), pα,δ := proxJ y + pα − pα. (3.12)

85 Proof. It follows from the optimality condition for the Tikhonov functional that

δ ∗ δ δ δ ∗ ∗ δ δ ∂R(xα) 3 −A (Axα − y )/α ⇐⇒ xα ∈ ∂R (−A (Axα − y )/α), (3.13) since x∗ ∈ ∂R(x) ⇐⇒ x ∈ ∂R∗(x∗) for all x ∈ X and x∗ ∈ X∗ (cf. [32]). Furthermore, (3.13) is equivalent to

δ δ ∗ ∗ δ δ δ Axα − y ∈ A∂R (−A (Axα − y )/α) − y δ ∗ ∗ δ δ ⇐⇒ (−pα) − A∂R (A pα/α) ∈ −y .

We can rewrite the above as

∗ ∗ δ δ ∗ ∗ δ δ (I + A∂R (A · /α))(pα) ∈ y ⇐⇒ (I + α∂(R (A · /α)))(pα) ∈ y , due to the identity A(∂R∗(A∗·)) = ∂(R∗(A∗·)), which holds true if R∗ is finite and continuous at a point in the range of A∗ (cf. [32, Prop. 5.7]), for instance, at 0 = A∗0. By a result of Rockafellar [137, Thm. 4C, Thm 7A], this follows if R has bounded sublevel sets, which is a consequence of the assumed coercivity. By definition of the proximal mapping, (3.12) follows for δ II pα and analogously for pα,δ.

3.2 Parameter Choice Rules

Similarly as before, we define a ψ-functional based heuristic parameter choice rule by δ α∗ ∈ argmin ψ(α, y ). (3.14) α∈(0,αmax) 2 In the linear theory, αmax is usually chosen as kAk . In the convex case, its choice is irrelevant for the convergence theory. In practice, one might choose it similarly as in the linear case. In case no α∗ exists in the defined interval, the reader is referred to [83, (2.13), p. 237], which details how one can extend the definition of ψ to overcome that issue. For simplicity, we persevere with (3.14) and simply assume existence. The conceptual basis for (3.14) in this setting is that the surrogate functional δ † ψ should act as an error estimator of the form ψ ∼ Dξ(xα, x ), so that its minimiser should, in theory, be a good approximation of the optimal parameter choice. As in the linear theory, the error estimating behaviour is not guaranteed to hold in general (due to the Bakushinskii veto) unless restrictions on the noise are postulated cf. (2.17) in Chapter 2. Since the Bregman distance in the linear case corresponds to the norm squared, we henceforth, in this chapter, redefine the ψ-functionals of before such that they coincide in the linear case with the squared versions of the ones in,

86 2 e.g., Chapter 1 and 2 (i.e., e.g., ψHD in this chapter corresponds to ψHD of Chapter 2). As we continue to assume that Y is a Hilbert space and the heuristic dis- crepancy (cf. [79, 80, 154]), Hanke-Raus and heuristic monotone error rules are defined in the Y space, the expressions for their respective functionals II may remain unchanged with the consideration that xα,δ would be the second Bregman iterate rather than the second Tikhonov iterate: 1 1 ψ (α, yδ) := kAxδ − yδk2, ψ (α, yδ) := hAxII − yδ, Axδ − yδi. HD α α HR α α,δ α However, the quasi-optimality and simple-L rules were defined in the X space, and since we do not prove convergence in the strong topology of X, perhaps this would not be the appropriate formulation of the aforementioned rules in this setting. Indeed, we prove convergence with respect to the Bregman distance, however, its general non-symmetricity opens the possibility for sev- eral ways of defining the quasi-optimality functional, for instance (cf. [86]). Indeed, we define

δ II δ δ δ II ψRQO(α, y ) := D δ (x , x ), ψLQO(α, y ) := D II (x , x ), (3.15) ξα α,δ α ξα,δ α α,δ δ sym II δ ψSQO(α, y ) := D II δ (xα,δ, xα), (3.16) ξα,δ,ξα as the right, left and symmetric quasi-optimality functionals, respectively. For the simple-L rule, there also exist a plethora of options, one being

δ II δ ψL(α, y ) := R(xα,δ) − R(xα), (3.17) which we will not explore in this section. However, it does make an ap- pearance in Chapter 7, where we test it numerically, and this is thus far the author’s preferred definition of ψL in this setting. Of course, this does not coincide exactly with ψL as defined in (2.12) in Chapter 2, since there, δ δ δ II δ 2 δ II ψL(α, y ) = hxα, xα − xα,δi = kxαk − hxα, xα,δi. Note that we omit ψLQO from the following analysis as in preliminary numer- ical tests, it performed very poorly in comparison to the other rules (which are tested in Chapter 6) and this was also the conclusion of [86]. As in the previous settings, it is also possible to define the ratio rules in the δ 2 convex setting; for instance, by replacing the kxαk from the expressions in δ the linear setting with R(xα). Similarly as for the heuristic discrepancy and Hanke-Raus rules, the symmet- ric quasi-optimality functional may also be expressed in terms of residuals: Proposition 23. We have that 1 ψ (α, yδ) = hpδ − pII , pII i, SQO α α α,δ α,δ δ for all α ∈ (0, αmax) and y ∈ Y .

87 Proof. From the symmetric Bregman distance definition (cf. (B.4)), we have

δ sym II δ II δ II δ ψSQO(α, y ) = D II δ (xα,δ, xα) = hξα,δ − ξα, xα,δ − xαi ξα,δ,ξα 1 = hA∗(Axδ − yδ) − A∗(AxII − yδ + Axδ − yδ), xII − xδ i α α α,δ α α,δ α 1 = hA(xδ − xII ), AxII − yδi α α α,δ α,δ 1 = hAxδ − yδ − (AxII − yδ), AxII − yδi, α α α,δ α,δ

δ for all α ∈ (0, αmax) and y ∈ Y , which is what we wanted to show. The proposition below provides some useful estimates:

δ Proposition 24. For all α ∈ (0, αmax) and y ∈ Y , we have

δ δ δ δ 0 ≤ ψRQO(α, y ) ≤ ψSQO(α, y ) ≤ ψHR(α, y ) ≤ ψHD(α, y ). (3.18)

Moreover, if (3.3) holds, then

 δ √ 2 ψ (α, yδ) ≤ √ + 2kωk α . (3.19) HD α

In particular, with (3.3) and α∗ selected as (3.14) with ψ ∈ {ψHD, ψHR, ψSQO, ψRQO}, we have that δ lim ψ(α∗, y ) = 0. (3.20) δ→0 Proof. Since

δ δ δ II ψSQO(α, y ) = ψRQO(α, y ) + D II (x , x ), ξα,δ α α,δ it follows immediately that ψSQO ≥ ψRQO. Moreover, from Proposition 23 and Proposition 22, we get that 1 1 1 1 ψ (α, yδ) = hpδ − pII , pII i = hpδ , pII i − kpII k2 ≤ hpδ , pII i SQO α α α,δ α,δ α α α,δ α α,δ α α α,δ δ (3.12) 1 δ δ δ δ δ δ = ψ (α, y ) = prox ∗ (y + p ) − prox ∗ (y ), y + p − y HR α J α J α (B.7) 1 δ δ δ δ ≤ k prox ∗ (y + p ) − prox ∗ (y )kkp k α J α J α (3.10) 1 ≤ kpδ k2 = ψ (α, yδ). α α HD

The last estimate (3.19) follows from (3.5). As the respective α∗ are the δ minimisers, we may estimate ψ(α∗, y ) by the minimiser over α of (3.19), which is easily shown to be of the order of δ and thus tends to 0.

88 3.2.1 Convergence Analysis As observed in the previous chapters, for heuristic rules in the linear case, it is often standard to show convergence of the selected regularisation parameter α∗ as the noise level tends to zero. This is not necessarily true in the convex case, as we shall see. The next lemma deals with the (exceptional) case in which limδ→0 α∗ 6= 0.

Lemma 10. Assume that A : X → Y is compact. Let α∗ be the minimiser of ψ ∈ {ψHD, ψHR, ψSQO, ψRQO}. In case of ψSQO, ψRQO, we additionally assume that R is strictly convex. Suppose that limδ→0 α∗ =α ¯ > 0. Then

δ † Dξ(xα∗ , x ) → 0, as δ → 0.

δ † Proof. We show that any subsequence of Dξ(xα∗ , x ) has a convergent sub- sequence with limit 0. In case of ψHD, the result follows from [78, proof of Thm. 3.5] even without the compactness assumption of A. From the same δ proof, it follows also that xα∗ is bounded and similarly, we may show that II xα∗,δ is bounded as δ → 0. Hence, there exist weakly (or weakly-*) convergent subsequences with δ II xα∗ * v, xα∗,δ * z. δ In [78], it was shown that v = xα¯, i.e., the limit of xα∗ is the regularised solution for exact data and regularisation parameterα ¯. Now using the com- pactness of A, we may find strongly convergent subsequences for the residuals as δ → 0: δ II pα∗ → y − Axα¯ =: pα¯, pα∗,δ → y − Az. II From lower semicontinuity, the minimisation property of xα∗,δ, and the strong δ convergence of pα∗ , we obtain that for any x ∈ X 1 kAz − y − (y − Ax )k2 +α ¯R(z) 2 α¯ 1 II δ δ δ 2 δ ≤ lim inf kAxα ,δ − y − (y − Axα )k + α∗R(xα) δ→0 2 ∗ ∗ 1 δ δ δ 2 ≤ lim inf kAx − y − (y − Axα )k + α∗R(x) δ→0 2 ∗ 1 = kAx − y − (y − Ax )k2 +α ¯R(x). 2 α¯ Hence, z is the minimiser of the functional on the left-hand side and by its II II II II uniqueness, it follows that z = xα¯ and thus pα∗,δ → y − Axα¯ =: pα¯ . From the optimality conditions, we furthermore obtain the strong convergence of δ ξα∗ → ξα¯.

89 Consider ψHR: it follows from (B.7) that

1 II 2 1 δ II δ kpα∗,δk ≤ hpα∗ , pα∗,δi = ψHR(α∗, y ), α∗ α∗

II II and by (3.20), we find that pα¯ = 0. Since pα¯ = proxJ (y +pα¯)−pα¯, it follows that

proxJ (y + pα¯) = pα¯ = proxJ (y).

As proxJ is bijective, we obtain that pα¯ = 0. From this, the result follows as in [78, Proof of Thm. 3.5]. II δ In case of ψSQO, ψRQO, we have, by (3.20), that D δ (x , x ) → 0, and from ξα∗ α∗,δ α∗ δ the lower semicontinuity, and the strong convergence of the subgradient ξα, it follows that II Dξα¯ (xα¯ , xα¯) = 0. By the assumed strict convexity of R, the Bregman distance is strictly pos- II II itive for nonidentical arguments; thus xα¯ = xα¯ ⇒ pα¯ = pα¯ . Employing the proximal representation, we thus have that

proxJ (y + pα¯) − pα¯ = pα¯ ⇔ 2pα¯ + ∂J (2pα¯) = y + pα¯

⇔ pα¯ + ∂J (2pα¯) = y = pα¯ + ∂J (pα¯) ⇔ ∂J (2pα¯) = ∂J (pα¯).

The strict convexity implies 2pα¯ = pα¯, hence pα¯ = 0. The results then follows as before from those in [78].

Noise Restriction As discussed, the auto-regularisation condition (1.38) becomes void in this setting due to the dominating term ψ(α, y − yδ) no longer making sense when the regularisation operator is nonlinear. In order to grasp this, firstly consider that the ψ functionals can no longer be expressed in the spectral form `ala (1.34). Thus, there are no linear filter functions as such to act on the vector y − yδ. Instead, we stipulate an alternative auto-regularisation condition, as suggested in [90]. In the following, we state the main convergence theorem of this section along with the analogous auto-regularisation conditions:

Theorem 15. Let A : X → Y be compact, the source condition (3.3) be sat- isfied, α∗ be the minimiser of ψ ∈ {ψHD, ψHR, ψSQO, ψRQO}, and assume there exist constants C > 0 such that the respective auto-regularisation condition

2 h∆y − ∆pα, ∆pαi ≤ Ck∆pαk , (ARC-HD) II h∆y − ∆pα, ∆pαi ≤ Ch∆pα , ∆pαi, (ARC-HR) II II h∆y − ∆pα, ∆pαi ≤ Ch∆pα − ∆pα , ∆pα i, (ARC-SQR)

1 II δ h∆y − ∆pα, ∆pαi ≤ CD δ (x , x ) + O(α), (ARC-RQO) α ξα α,δ α

90 δ holds for all α ∈ (0, αmax) and y ∈ Y for the heuristic discrepancy, Hanke- Raus, symmetric quasi-optimality and right quasi-optimality rules. If ψ ∈ {ψSQO, ψRQO}, assume in addition that R is strictly convex. Then it follows that the method converges; i.e.,

δ † Dξ(xα∗ , x ) → 0, as δ → 0.

δ † Proof. Take an arbitrary subsequence of Dξ(xα∗ , x ). We show that it con- tains a convergent subsequence with limit 0, which would prove the state- ment. Since α∗ is bounded, we may consider a subsequence, which, in ad- δ † dition satisfies α∗ → α¯. In case α∗ → α¯ > 0, the result limδ→0 Dξ(xα∗ , x ) = 0 follows from Lemma 10 without even needing to invoke the auto-regularisation conditions. Thus, let us assume that α∗ → 0. Then it follows from the estimate (3.6) and Proposition 19 that one only δ needs to prove convergence of the data propagation error Dξα (xα, xα), which we can immediately estimate via (3.4) and the respective auto-regularisation condition in Theorem 15. For α∗ minimising the heuristic discrepancy functional, it follows from Propo- δ sition 19 and (3.20) that limδ→0 ψHD(α∗, y) = 0 and limδ→0 ψHD(α∗, y ) = 0. Thus, we may conclude that

2 k∆p k2 kpδ k kp k  2 α∗ α∗ α∗ p δ p ≤ √ + √ = ψHD(α∗, y ) + ψHD(α∗, y) → 0, α∗ α∗ α∗

δ as δ → 0. Hence, using (3.4) and (ARC-HD) yields that Dξα∗ (xα∗ , xα∗ ) → 0 as δ → 0. For the approximation error, it follows from Proposition 19 that † Dξ(xα∗ , x ) ≤ Cα∗ → 0. Therefore, each term in (3.6) tends to 0 as δ → 0. Let α∗ be the minimiser of the Hanke-Raus functional. Then, as before, we estimate the Bregman distance as (3.4) and from (ARC-HR), deduce that

δ C II C II C II Dξα∗ (xα∗ , xα∗ ) ≤ h∆pα∗ , ∆pα∗i = hpα∗ , ∆pα∗i − hpα∗,δ, ∆pα∗i α∗ α∗ α∗ C II C II δ C II δ C II = hpα∗ , pα∗ i − hpα∗ , pα∗ i + hpα∗,δ, pα∗ i − hpα∗,δ, pα∗ i α∗ α∗ α∗ α∗ δ C II δ C II ≤ CψHR(α, y ) + CψHR(α∗, y) − hpα∗ , pα∗ i − hpα∗,δ, pα∗ i. (3.21) α∗ α∗ The last two terms can be estimated via the Cauchy-Schwarz inequality and (3.10), and are bounded from above by Proposition 19

1 δ 1 C kpα∗ kkpα∗ k ≤ 2ωα∗(δ + 2kωkα∗). α∗ α∗ Moreover, the last terms decay to zero as δ → 0 and clearly, the first couple δ of terms vanish as the noise decays also, since limδ→0 ψHR(α∗, y ) = 0 due to

91 (3.20) and from (3.18) and Proposition 19, ψHR(α∗, y) ≤ ψHD(α∗, y) → 0 as δ → 0. For α∗ minimising ψSQO, note that from (3.6) and (ARC-SQR), it remains to estimate

δ 1 δ II II 1 II II Dξα∗ (xα∗ , xα∗ ) ≤ hpα∗ − pα∗,δ, pα∗,δi + hpα∗ − pα∗ , pα∗ i α∗ α∗ 1 II II 1 δ II II − hpα∗ − pα∗ , pα∗,δi − hpα∗ − pα∗,δ, pα∗ i α∗ α∗ δ ≤ ψSQO(α, y ) + ψSQO(α∗, y)

1 II II 1 δ II II − hpα∗ − pα∗ , pα∗,δi − hpα∗ − pα∗,δ, pα∗ i. α∗ α∗ As before, the result follow very similarly by estimating the first two terms from above (cf. Proposition 24) and the “remainder” terms via the Cauchy- Schwarz inequality, the triangle inequality, and (3.10) and all of which vanish as the noise decays. We omit the proof for the right quasi-optimality rule as it is analogous to the above (and in fact, even simpler as the RHS of its associated auto- regularisation condition (ARC-RQO) vanishes due to (3.20)). Remark. Note that Theorem 15 holds true if the left-hand side of the auto- δ regularisation conditions, i.e., h∆y − ∆pα, ∆pαi is replaced by Dξα (xα, xα). Moreover, it is easy to see that for (ARC-HD) it is enough to prove

2 h∆pα, ∆yi ≤ Ck∆pαk , (3.22) for some positive constant C. Remark. Observe that the heuristic discrepancy, the Hanke-Raus, and the symmetric quasi-optimality rules can all be expressed in terms of the residuals δ II of the Bregman iteration pα, pα,δ. The similarity of patterns in the formulae for ψ may provide a hint that such a larger family of rules could be defined in the convex case as well, similar as in the linear case; see (2.11) in Chapter 2. The auto-regularisation condition is an implicit condition on the noise. One may observe that it resembles the analogous condition of (1.38) in the linear case. To gain a better understanding and in particular show that they can be satisfied in practical situations, in Section 3.3, we will derive sufficient and more transparent conditions in the form of Muckenhoupt-type inequalities (2.17) .

3.2.2 Convergence Rates (for the Heuristic Discrep- ancy Rule) With the aid of the source condition, auto-regularisation condition and an additional regularity condition, we can even derive rates of convergence. We

92 will do so only for the heuristic discrepancy rule, as per the title of this section, as rates for the other rules lie beyond the scope of this thesis. These results stem from [90]. We start with the following proposition:

Proposition 25. Suppose that ∂R∗ satisfies the following condition:

x → 0 ⇒ ξ ∈ ∂R∗(x) * 0. (3.23)

Then, for all positive constants D > 0, there exists another constant D1 > 0 such that for all yδ ∈ Y with kyδk ≥ D, it holds that

δ α ≤ D1ψHD(α, y ) ∀α ∈ (0, αmax). (3.24)

Proof. Suppose that the statement (3.24) is not true. Then there would exist a constant D > 0 and a sequence of yδk such that kyδk k ≥ D, and a sequence of αk with pδk δk αk ψHD(αk, y ) = → 0 as k → ∞. αk αk

δk pαk Define zk := . From its representation as a proximal mapping, we have αk that zk satisfies δk ∗ ∗ y = αkzk + A∂R (A zk). ∗ Thus, from zk → 0, the boundedness of αk, A zk → 0, and (3.23), we obtain

δk 2 δk ∗ ∗ ∗ δk 0 < D ≤ ky k = αkhzk, y i + h∂R (A zk),A y iX∗,X∗∗ → 0, which yields a contradiction; hence the statement of the proposition is true.

We now state the main convergence rates result:

Proposition 26. Let the source condition (3.3) hold, α∗ minimise ψHD and suppose the auto-regularisation condition (ARC-HD) is satisfied. Assume that kyδk ≥ C and, in addition that (3.23) holds. Then δ † Dξ(xα∗ , x ) = O(δ), for δ → 0.

Proof. Note that from Proposition 25 and since α∗ is the global minimiser, δ δ for any α, we have that α∗ ≤ CψHD(α∗, y ) ≤ CψHD(α, y ). Observe that

93 from (3.6), it follows that kwk2 D (xδ , x†) ≤ D (xδ , x ) + α + 6kwkδ ξ α∗ ξα∗ α∗ α∗ 2 ∗ (ARC-HD)  2 p δ p ≤ ψHD(α, y ) + ψHD(α∗, y) + Cδ + Cα∗  δ √ √ 2 ≤ √ + C α + C α + Cδ + Cα via Proposition 19 α ∗ ∗ !  δ √ 2 δ2 = O √ + α + + α + δ since α ≤ Cψ (α, yδ), α α ∗ HD = O (δ) , choosing α = α(δ) = δ.

Note that if the source condition (3.3) holds, α∗ is selected as the minimiser of ψ ∈ {ψHR, ψSQO, ψRQO} and the respective auto-regularisation conditions ((ARC-HR),(ARC-SQR), or (ARC-RQO)) are satisfied, and additionally, that for some µ > 0 µ δ α∗ ≤ Cψ(α∗, y ), (3.25) then one may also prove, analogously that

1 δ † µ Dξ(xα, x ) = O(δ ), for δ → 0. In the linear theory, the inequality (3.25) corresponds to (12) from Chapter 2, and we recall that under certain additional regularity conditions on x†, e.g., (2.26), optimal convergence rates are obtained (cf. Theorem 6). We note that the condition on ∂R∗, (3.23), holds if R∗ is continuously differ- ∗ p entiable in 0 with ∇R(0) = 0. This is true, for instance, for R (x) = kxk`p with 1 < p < ∞. On the other hand we observe that the conclusion in Proposition 25 is not satisfied for exact penalisation methods [21] (such as 1 ` -regularisation) as then the residual pα (and hence ψHD) could vanish for nonzero α, which does not concur with the estimate (3.24).

3.3 Diagonal Operator Case Study

This section, as the rest of the chapter, is from the paper [90]. In the following analysis, we consider the case of the operator A : X → Y being diagonal between spaces of summable sequences; in particular, X = `q(N), with 1 < q < ∞, and Y = `2(N), and the regularisation functional is selected as the `q-norm to the q-th power divided by q. The main objective in this setting is to state sufficient conditions for the auto-regularisation conditions in the form of Muckenhoupt-type inequalities similar to (2.17) and to illustrate their validity for specific instances.

94 Let {en}n∈N be the canonical (i.e., Cartesian) basis for X and also Y , and let {λn}n∈N be a sequence of real (and for simplicity) positive scalars monoton- ically decaying to 0. Then we define a diagonal operator A : `q(N) → `2(N),

Aen = λnen. (3.26)

The regularisation functional is chosen as the q-th power of the `q-norm (divided by q):

1 q q−1 R := k · k q with ∂R(x) = {|x | sgn(x )} , q ∈ (1, ∞). q ` n n n∈N (3.27) As we assume q > 1, the choice of sgn at 0 is not relevant and we may assume sgn(0) = 0 throughout. In the present situation, the problem decouples and the components of the regularised solution can be computed independently of each other. Thus, for notational purposes, one may omit the sequence index n for the components of the regularised solutions and write

δ δ δ xα =: {xα,n}n =: {xα }n,

xα =: {xα,n}n =: {xα}n, δ δ y =: {y }n,

y =: {y}n,

δ δ δ where xα, xα , y , y ∈ R. As the problem decouples, xα and xα can be computed by an optimisation problem on R, i.e., the optimality conditions read

δ δ δ q−1 y α xα + γn|xα | sgn(x) = , with γn := 2 , λn λn

II II δ q−1 and similar expressions hold for xα,δ := xαn,δn . Because the term |xα | is homogeneous of degree q − 1, by an appropriate scaling we can further δ simplify expressions: define the components of pα, pα as

δ δ δ pα := y − λnxα , pα := y − λnxα,

δ II and we use the expressions y, y , ∆y, ∆pα, ∆pα to denote the components of δ II y , y, ∆y, ∆pα, ∆pα , respectively, where we again omit the sequence index n in the notation:

δ δ y = yn,

y = yn, δ ∆y : = yn − yn, δ ∆pα : = pαn − pαn, II II II ∆pα : = pαn − pα,δ,

95 for n ∈ N. Letting

1 1   2−q q−1 2−q α hq(x) := x + |x| sgn(x), x ∈ R, ηn := γn = 2 , λn −1 Φq(y) := hq (y), we obtain via some simple calculation that

δ yδ II  yδ yδ  = η Φ ( ), = η Φ , +Φ ∗ ( ) , (3.28) xα n q ηnλn xα,δ n q ηnλn q ηnλn δ yδ II   yδ yδ  yδ  = λ η Φ ∗ ( ), = λ η Φ ∗ + Φ ∗ ( ) − Φ ∗ ( ) , pα n n q ηnλn pα,δ n n q ηnλn q ηnλn q ηnλn (3.29)

∗ where q is the conjugate index to q. Note that Φq corresponds to a proximal operator on R. For x > 0 we have

q−1 x ≤ hq(x). (3.30)

Moreover, Φq is monotonically increasing and it is not difficult to verify that for any 1 < q < ∞,

1 q∗−2 x 7→ Φq∗ (x 2−q ) is increasing for x > 0. (3.31)

We now state useful estimates for the function Φq:

Lemma 11. For 1 < q < ∞, q 6= 2, there exist constants Dp,Dq, and for any τ > 0, a constant Dq,τ , such that for all x1 > 0 and |x2| ≤ x1, 1 Φ (x ) − Φ (x ) 1 ≤ q 1 q 2 ≤ , (3.32) q−2 q−2 1 + DpΦq(x1) x1 − x2 1 + DpΦq(x1)  1 2−q  x q−1 if 1 < q < 2, 1 D 1 ≤ q (3.33) q−2 1 2−q 1 + DpΦq(x1)  q−1  x1 if q > 2 and ∀x1 > τ. Dq,τ

Proof. For any z1 > 0, |z2| ≤ z1, we have

h (z ) − h (z ) zq−1 − |z |q−1 sgn(z ) q 1 q 2 = 1 + 1 2 2 z1 − z2 z1 − z2   ( q−2 q−2 z2 ≤ 1 + z1 Dp, = 1 + z1 k q−2 z1 ≥ 1 + z1 Dp, where 1 − |z|q−1 sgn(z) D ≤ k(z) := ≤ D , ∀z ∈ [−1, 1]. p 1 − z p

96 Replacing zi by Φq(xi) yields the lower bound and the first upper bound in (3.32). In case 1 < q < 2, we find that

2−q 1 Φ (x )2−q x q−1 = q 1 ≤ 1 , q−2 2−q 1 + DpΦq(x1) Φq(x1) + Dp Dp

2−q where we used that Φq(x1) ≥ 0 in the denominator and the estimate 1 q−1 Φq(x1) ≤ x1 that follows from (3.30). Now consider the case q > 2. Then

2−q q−1 2−q 1 Cτ q−1 ≤ x ∀x1 ≥ τ, q−2 1 1 + DpΦq(x1) Dp

q−1 where we used the estimate hq(x) ≤ Cτ x for x ≥ hq(τ). This yields the result.

3.3.1 Muckenhoupt Conditions As in the rest of this section, we follow the example of [90] very closely. In case the forward operator is diagonal, we may specialise the auto-regularisation condition to Muckenhoupt-type inequalities [83, 87, 88] similar to the linear case, cf. (2.17) in Chapter 2. If we consider A : X → Y a compact operator 2 and R = k · k`2 , then the Muckenhoupt-type conditions take the following form, with some t ∈ {1, 2}: There exist constants C1,C2 such that for all ∆y

 2 t−1 X 2 α X 2 σn |∆y| ≤ C2 |∆y| , (3.34) σ2 α 2 n 2 σn σn {n: α ≥C1} {n: α

δ for all α ∈ (0, αmax), where ∆y = hy − y, uni with un the eigenvectors and 2 ∗ σn the eigenvalues of AA . The integer t is taken as t = 1 for the heuristic discrepancy and Hanke-Raus rules and t = 2 for the quasi-optimality rules, which is analogous with the theory of Chapter 2, where the Np condition (2.17) held for the HD and HR rules with p = 1 and for the quasi-optimality rule with p = 2. Here, t plays the role of p. In this case, the linear auto- regularisation conditions hold for the respective rules and one can prove convergence of the method. The Muckenhoupt inequalities hold in many situations; see, e.g., [87,88]. In order to realise the insight (3.34) provides, one can observe that the right- hand side of the inequality (3.34) represents the high frequency components of the noise. Thus, in order for this upper bound to hold, one can interpret this as requiring that the noise be sufficiently irregular. This is analogous with the requirements of Chapter 2. In the case of the diagonal setting above, we have that σn = λn and the definition of ∆y agrees with that in (3.34).

97 For later reference, we define the following sequence of positive numbers:

q δ 2−q θq,n := λn max{|y|, |y |} . (3.35) Then the following theorem provides a sufficient condition for the auto- regularisation condition to hold:

The Heuristic Discrepancy Rule Theorem 16. Let A be a diagonal operator (3.26) and R the regularisation functional in (3.27) with q ∈ (1, ∞). Suppose that for some constant C1, δ there exists a constant C2 such that for all y and α ∈ (0, αmax)

X 2 α X 2 |∆y| ≤ C2 |∆y| . (3.36) θq,n θq,n θq,n {n: α ≥C1} {n: α

X X 2 ∆pα∆y ≤ C |∆pα| . (3.37) n∈N n∈N

Let IHD ⊂ N be a set of indices where, for some fixed constants β1, β2, it holds that

n ∈ IHD ⇒ |∆y| ≤ β1|∆pα|, (3.38) α n 6∈ IHD ⇒ |∆pα| ≤ β2|∆y| . (3.39) θq,n Note that we construct this set in the proceeding. We first show that for (3.37), it is sufficient that there exists a constant C2 such that

X 2 α X 2 |∆y| ≤ C2 |∆y| . (3.40) θq,n n6∈IHD n∈IHD Indeed, splitting the sum in (3.37) into two parts and using (3.39), (3.38), and (3.40), and noting that ∆pα always has the same sign as ∆y, we obtain X X X ∆pα∆y = ∆pα∆y + ∆pα∆y

n∈N n6∈IHD n∈IHD X 2 α X 2 ≤ β2 |∆y| + β1 |∆pα| θq,n n6∈IHD n∈IHD X 2 X 2 2 X 2 ≤ β2C2 |∆y| + β1 |∆pα| ≤ (β2C2β1 + β1) |∆pα| .

n∈IHD n∈IHD n∈IHD

98 The Lipschitz continuity of the proximal mapping |∆pα| ≤ |∆y| now implies (3.37). Note that (3.36) has the form of (3.40) with θ I := {n : q,n < C }. HD α 1

Thus, it remains to verify that for this index set, there exist constants β1, β2 for which (3.39) and (3.38) hold. We note that by monotonicity, the ratio ∆pα is always positive and invariant ∆y when y, yδ are switched and when y, yδ are replaced by −y, −yδ. Thus, without loss of generality, we may assume that y > 0 and |yδ| ≤ y such that y = max{|y|, |yδ|}. Using this convention, noting that

1  α  2−q λ η = n n λq and the definition of θq,n, (3.35), we have that  2−q δ 2−q q y max{|y|, |y |} λ θq,n = n = . (3.41) λnηn α α

Thus for n ∈ IHD and by (3.31), we have

∗ ∗  q −2  1 q −2 y 2−q Φq∗ ≤ Φq∗ C1 . λnηn Using (3.29), Lemma 11 with y max{|y|, |yδ|} yδ x1 = = and x2 = , λnηn λnηn λnηn we find that for n ∈ IHD

∆pα 1 1 ≥ q∗−2 ≥ q∗−2 > 0, (3.42) ∆y  y   1  1 + D Φq∗ 2−q p λnηn 1 + DpΦq∗ C1 which verifies (3.38). θq,n In view of (3.39), let n 6∈ IHD, then as α ≥ C1, we conclude that 1 y 2−q ∗ ≥ C1 if q ≤ 2 ( i.e. q > 2). (3.43) λnηn y Applying Lemma 11 with Φ ∗ , x = , we observe that the conditions on q 1 λnηn the right-hand side of (3.33) hold true by (3.43). Noting that the exponents 2−q∗ in (3.32) satisfy q∗−1 = q − 2, we obtain the upper bound

2−q∗   q∗−1  q−2 ∆pα y y α ≤ C˜ = C˜ = C˜ , (3.44) ∆y λnηn λnηn θq,n which verifies (3.39) and thus completes the proof.

99 The Hanke-Raus Rule Contrary to the heuristic discrepancy case, we have to impose a restriction on the regularisation functional exponent q in order to keep certain expressions positive: 3 Lemma 12. If q ≥ 2 , then it follows that II ∆pα ∆pα ≥ 0, δ for all α ∈ (0, αmax) and y ∈ Y .

δ Proof. Setting z = y , z = y , noting (3.29) and the identity 1 ηnλn 2 ηnλn

Φq∗ (z1 + Φq∗ (z1)) = Φq∗ (hq∗ (Φq∗ (z1)) + Φq∗ (z1)) , it is enough to verify that the mapping

F : p 7→ Φq∗ (hq∗ (p) + p) − p, is monotonically increasing. As this function is differentiable everywhere except at p = 0, it suffices to prove the inequality ∗ q∗−2 0 2 + (q − 1)|p| 0 ≤ F (p) = ∗ q∗−2 − 1, 1 + (q − 1)|Φq∗ (hq∗ (p) + p) | for any p ∈ R. Since F is antisymmetric and hence F 0 is symmetric, it is in fact sufficient to prove this inequality for p > 0. Setting r = Φq∗ (hq∗ (p) + p) ≥ p, we thus have to show that 2 + (q∗ − 1)|p|q∗−2 ≥ 1, where h ∗ (r) = h ∗ (p) + p. (3.45) 1 + (q∗ − 1)|r|q∗−2 q q

Defining the number ζ implicitly by hq∗ (ζp) = hq∗ (p) + p, (i.e., r = ζp), it q∗−2 2−ζ follows that ζ ∈ [1, 2] and that p = ζq∗−1−1 . Plugging this formula and that for r into the inequality (3.45), we obtain that monotonicity holds if (2 − ζ)(ζq∗−2 − 1) (q∗ − 1) ≤ 1, ∀ζ ∈ [1, 2]. (3.46) ζq∗−1 − 1 Some detailed analysis reveals that this is satisfied for q∗ ≤ 3, which is 3 equivalent to q ≥ 2 . The next lemma is needed to estimate a term in (ARC-HR). Lemma 13. Let n ∈ N be such that θ q,n ≤ C , (3.47) α 1 with a constant C1 that is sufficiently small. Then there is a constant β1 depending on C1 but independent of n with II ∆pα∆y ≤ β1∆pα ∆pα.

100 Proof. Define

δ δ δ δ y y y+pα y y y +pα x = + Φ ∗ ( ) = , x = + Φ ∗ ( ) = . 1 ηnλn q ηnλn ηnλn 2 ηnλn q ηnλn ηnλn

δ From the definition of pα, pα it follows that sgn(x1) = sgn(y), sgn(x2) = δ δ sgn(y ) and x1, x2 are increasing functions of pα, pα, respectively. Thus the following ratio

II ∆p + ∆pα Φq∗ (x1) − Φq∗ (x2) RII := α = , ∆pα + ∆y x1 − x2 is always positive and, moreover, invariant when x1, x2 are switched and, respectively, replaced by −x1, −x2. Thus, we may assume without loss of generality (otherwise we redefine the variables x1, x2) that x1 > 0 and |x2| ≤ δ x1, which is equivalent to y > 0 and |y | ≤ y. Applying Lemma 11 yields then

II II ∆pα + ∆pα Φq∗ (x1) − Φq∗ (x2) 1 R = = ≥ q∗−2 . ∆pα + ∆y x1 − x2 1 + DpΦq∗ (x1)

It follows from (3.47) and y ≤ y + pα ≤ 2y that

 2−q  2−q  2−q 2−q y + pα y θq,n 2−q x1 = ≤ C4 = C4 = C4 C1, ληn ληn α where C4 ∈ {2, 1} depending whether q > 2 or q < 2. In any case, we obtain with (3.31) that as before,

∗ ∗  q −2  1 q −2 y + pα 2−q Φq∗ ≤ Φq∗ C4C1 , λnηn and hence II 1 R ≥ ∗ .  1 q −2 2−q 1 + DpΦq∗ C4C1

Some standard calculus furthermore reveals that

∗  1 q −2 2−q lim Φq∗ C4C1 = 0. C1→0

Thus we may choose C1 sufficiently small such that

∗  1 q −2 2−q DpΦq∗ C4C1 ≤ θ < 1, (3.48)

101 II 1 1 as then R ≥ 1+θ > 2 . From this inequality and using Lipschitz continuity of the residuals, |∆pα| ≤ |∆y|, we find that II II 2  2 ∆pα ∆pα = R |∆pα| + ∆y∆pα − |∆pα|

1 1 2 ≥ ∆y∆pα − (1 − )|∆pα| 1 + θ 1 + θ 2 ≥ ( − 1)∆y∆pα. 1 + θ This completes the proof. 3 Theorem 17. Let A and R be as in Theorem 16 with 2 ≤ q < ∞. Suppose that there is a sufficiently small constant C1, and a constant C2 such that for all yδ, (3.36) holds. Then the auto-regularisation condition (ARC-HR) holds for the Hanke-Raus rule.

Proof. As in the HD-case, we define a set of indices IHR with the property II n ∈ IHR ⇒ ∆pα∆y ≤ β1∆pα ∆pα and ∆y ≤ β2∆pα, (3.49) α n 6∈ IHR ⇒ |∆pα| ≤ β3|∆y| . (3.50) θq,n

Then, sufficient for (ARC-HR) is that a constant C2 exists with

X 2 α X 2 |∆y| ≤ C2 |∆y| . (3.51) θq,n n6∈IHR n∈IHR This can be seen as follows: X 2 X X ∆pα∆y − |∆pα| ≤ ∆pα∆y + ∆pα∆y

n∈N n∈IHR n6∈IHR X II X 2 α ≤ β1 ∆pα ∆pα + β3 |∆y| θq,n n∈IHR n6∈IHR X II X 2 X II ≤ β1 ∆pα ∆pα + β3C2 |∆y| ≤ (β1 + β3C2β2β1) ∆pα ∆pα

n∈IHR n∈IHR n∈IHR X II ≤ (β1 + β3C2β2β1) ∆pα ∆pα, n∈N II where we used that ∆pα ∆pα ≥ 0 in the last step. Hence (ARC-HR) follows from (3.51). Note that (3.36) has the form (3.51) with θ I := {n : q,n ≤ C }, HR α 1 and C1 sufficiently small. We have already shown that for such indices, ∆y ≤ β2∆pα holds and for n 6∈ IHR, (3.50) holds. Moreover, from Lemma 13 it fol- II lows that on IHR also ∆pα∆y ≤ β1∆pα ∆pα holds. Thus collecting these re- sults yields that (3.36) implies the auto-regularisation condition (ARC-HR).

102 The smallness condition on C1 is given by (3.48).

The Symmetric Quasi-Optimality Rule Similar as for the Hanke-Raus rule, we first have to verify the nonnegativity of certain expressions:

3 Lemma 14. If q ≥ 2 , then

II II (∆pα − ∆pα )∆pα ≥ 0,

δ for all α ∈ (0, αmax) and y, y ∈ R. II Proof. Recall the mapping F : pα 7→ pα defined in the proof of Lemma 12. In order to prove the statement, it is enough to show that

((p1 − F (p1)) − (p2 − F (p2)) (F (p1) − F (p2)) ≥ 0 ∀p1, p2. It is not difficult to see that if F is monotone and Lipschitz continuous, then this inequality holds true. Thus, we have to prove that

0 ≤ F 0(p) ≤ 1, ∀p.

As in the proof of Lemma 12, we may employ the variable ζ ∈ [1, 2], where 0 3 also the inequality 0 ≤ F (p) was verified for q ≥ 2 . The additional condition 0 q∗−2 1 ∗ F (p) ≤ 1 leads to ζ ≥ 2 , which holds for any q ≥ 1. This shows the result.

3 Theorem 18. Let A and R be as in Theorem 16 with 2 ≤ q < ∞. Suppose that there are constants C1,C2,C3, with C1 sufficiently small, such that for δ all y and α ∈ (0, αmax)

1   q−1 X 2 α X 2 X θq,n 2 |∆y| + |∆y| ≤ C2 |∆y| , θq,n α θq,n θq,n θq,n {n: ≥C1} {n: ≤C }∩Ic {n:

Proof. We define an index set ISOR with the property that

II II n ∈ ISOR ⇒ |∆pα − ∆y| ≤ β1|∆pα − ∆pα | and |∆pα| ≤ β2|∆pα |. (3.53) Then, for (ARC-SQR) it is sufficient to prove that

X X II II (∆y − ∆pα)∆pα ≤ C (∆pα − ∆pα )∆pα . (3.54)

n6∈ISOR n∈ISOR

103 This can be seen as in the previous cases since the sum P (∆ − ∆ )∆ n∈ISOR y pα pα can be bounded by P (∆ − ∆ II )∆ II by the definition of I and n∈ISOR pα pα pα SOR the sum P on the right can be bounded from below by 0 according to n6∈ISOR Lemma 14. Now take ISOR = IHR ∩ I2. II Similar as for the Hanke-Raus rule, the inequality |∆pα| ≤ β2|∆pα | holds true on IHR with C1 sufficiently small, thus ISOR satisfies the requirements (3.53). Thus, it remains to show that the stated condition (3.52) implies (3.54). The left-hand side in (3.54) can be bounded from above by X X X (∆y − ∆pα)∆pα ≤ (∆y − ∆pα)∆pα + (∆y − ∆pα)∆pα c n6∈ISOR n/∈IHR nIHR∩I2 X α X ≤ |∆y|2 + |∆y|2, θq,n c n/∈IHR n∈IHR∩I2 where we used the estimates (3.39) on the complement of IHR and the Lipschitz-continuity of the proximal mapping for the second sum:

2 (∆y − ∆pα)∆pα ≤ |∆y||∆pα| ≤ |∆y| . Thus, the left-hand side of (3.52) serves as an upper bound for the left-hand side of (3.54). The sum on the right-hand side of (3.54) can be bounded from below as follows: the summation index n is in ISQO ⊂ IHR, hence

II II (∆pα − ∆pα )∆pα ≥ |∆pα − ∆y||∆pα| ≥ β1(∆pα − ∆y)∆y   ∆  1 2 pα 2   ≥ β1|∆y| − 1 ≥ β1|∆y| q∗−2 − 1 ∆y   y   1 + D Φ ∗ p q λnηn ∗  1 q −2  θq,n  2−q ∗ DpΦq α 2 = β1|∆y| ∗  1 q −2  θq,n  2−q ∗ 1 + DpΦq α   1  C  θ  q−1 2   q,n ≥ β1|∆y| q∗−2 ,   1   α  2−q  1 + DpΦq∗ C1 where we used (3.41), (3.42), and a bound for z > 0 on ISQO of the form

q∗−2  1  0 1 Φq∗ z 2−q ≥ C z q−1 ,

104 that can be obtained by similar means as above. Thus, the right-hand side of (3.52) is a lower bound for the right-hand side of (3.54). Together, (3.52) implies (3.54) and thus the desired auto-regularisation condition. Remark. The condition in (3.52) has an additional sum over the index set c IHR ∩ I2 on the left-hand side. It might be possible to prove that this set is empty, e.g., if IHR ⊂ I2. Then the corresponding sum would vanish, and this happens in the linear case (q = 2). However, we postpone a more detailed analysis of this issue to the future. We also point out that the Muckenhoupt-type conditions (3.36), and (3.52) (except for the additional sum) agree with the respective ones for the linear case q = 2 so that they appear, in fact, as natural extensions of the linear convergence theory.

Case Study of Noise Restrictions For the cases that the operator ill-posedness, the regularity of the exact solution and the noise show some typical behaviour, we investigate the re- strictions that the Muckenhoupt-type condition (3.36) impose on the noise. In particular, we would like to point out that the restrictions are completely realistic and they are satisfied in paradigmatic situations. Consider a polynomially ill-posed problem, with a given decay of the exact solution and a polynomial decay of the error:

D1 D2 1 λn = , |y| = , ∆y = δsn , nβ nν nκ for ν > β > 0, 0 < κ < ν, and sn ∈ {−1, 1}. The restrictions κ < ν, ν > β > 0 are natural as the noise is usually less regular than the exact solution and the exact solution has higher decay rates than λn due to regularity. In the linear case, Muckenhoupt-type conditions lead to restrictions on the regularity of the noise, i.e., upper bounds for the decay rate κ. This is perfectly in line with their interpretation as conditions for sufficiently irregular noise. In the following, we write ∼ if the left and right expressions can be estimated by constants independent of n. (There may be a q-dependence, however). The numbers θq,n that appear in (3.36) now read as

δ 2−q q δ q−2 2−q q θq,n := max{|y|, |y |} λn = max{1, |y |/|y|} |y| λn

1 snδ ν−κ 2−q ∼ βq+ν(2−q) max{1, |1 + n |} . n C2

We additionally impose the restriction that for sufficiently large n, θq,n → 0 monotonically. If 2 − q > 0, this is trivially satisfied, while for 2 − q < 0, we require that βq + κ(2 − q) > 0, if 2 − q < 0. (3.55)

105 Under these assumptions, for any α sufficiently small, we find an n∗ such ∗ that θq,n = C1α and θq,n ≤ C1α for n ≥ n . Expressing α in terms of θq,n yields a sufficient condition for (3.36), as

∗ n 2 ∞ X |∆y| X 2 1 θq,n∗ ≤ C |∆y| ∼ . (3.56) θ n∗2κ−1 n=1 q,n n=n∗+1

By the straightforward estimate max{1, |1 + snδ nν−κ|} ∼ 1 + δnν−κ, the C2 inequality (3.56) reduces to

n∗ (1 + δn∗ν−κ)2−q X nβq+ν(2−q)−2κ ≤ Cn∗. (3.57) n∗βq+ν(2−q)−2κ (1 + δnν−κ)2−q n=1 For any x ≥ 0 and 0 ≤ z ≤ 1, it holds that 1 + x 1 1 ≤ ≤ . 1 + zx z

n ∗ We use this inequality with z = n∗ and x = δn . Then, we obtain the sufficient conditions

 n∗  1 X  nβq+ν(2−q)−2κ−(2−q)(ν−κ) ≤ Cn∗, 2 − q > 0, n∗βq+ν(2−q)−2κ−(2−q)(ν−κ)  n=1 n∗  1 X βq+ν(2−q)−2κ ∗  n ≤ Cn , 2 − q < 0. n∗βq+ν(2−q)−2κ  n=1 These inequalities are satisfied if the exponent for n is strictly larger than −1. This finally leads to the restrictions  1 κ ≤ β + , q < 2,  q q 2 − q 1 κ ≤ β + ν + , q > 2.  2 q 2

Note that for q > 2, we additionally require (3.55). We hope to have the reader convinced that the imposed conditions on the noise are not too restrictive and, in particular, the set of noise that satis- fies them is nonempty. These conditions provide a hint for which cases the methods may work or fail: 3 In case 2 ≤ q ≤ 2, both the heuristic discrepancy and the Hanke-Raus are reasonable rules. The conditions on the noise are less restrictive the smaller q is. 3 In case 1 < q < 2 , our convergence analysis only applies to the heuristic discrepancy rule, as the nonnegativity condition of the Hanke-Raus rule is

106 not guaranteed in this case. It could be said that the heuristic discrepancy rule is the more robust one then. In case q > 2, we observe that the restriction on the noise depends on the regularity of the exact solution. For highly regular exact solutions (ν  1) the noise condition might fail to be satisfied as q becomes very large. This happens for both the heuristic discrepancy, and the Hanke-Raus rules. We did not include the quasi-optimality condition in this analysis as it still requires further analysis. However, the conditions for it are usually even more restrictive than for the Hanke-Raus rules and we expect similar problems for the case q > 2.

107 Chapter 4

Iterative Regularisation

In this chapter, we initially revert to treating linear ill-posed problems through an iterative method, namely Landweber iteration (cf. [97]) which the reader might recall from Section 1.3.2. This presents an alternative approach to Tikhonov regularisation which we covered quite thoroughly in the preceding chapters. In the second section of this chapter, we then cover the basics of Landweber iteration for nonlinear forward operators.

4.1 Landweber Iteration for Linear Opera- tors

Let A : X → Y be a linear operator mapping between Hilbert spaces. Then recall that we may solve the linear problem Ax = y via Landweber iteration (1.24), more specifically defined as the iterative procedure: δ δ ∗ δ δ xk = xk−1 + ωA (y − Axk−1), k ∈ N (4.1) δ with an initial guess x0 := x0, which we take as x0 = 0, and a relaxation (i.e., step-size) parameter ω ∈ (0, kAk−2]. Note that without loss of generality, we may assume kAk < 1 and thence drop the parameter ω. In terms of spectral theory, which we will use for our analysis as we did for Tikhonov regularisation, the Landweber iterates may be expressed as Z ∞ ∗ xk = gk(λ) dEλA y, (4.2) 0 with the filter function this time defined as k−1 X j gk(λ) := (1 − λ) , j=0 i.e., k−1 X ∗ j ∗ xk = (I − A A) A y, j=0

108 where the second equality follows from the geometric sum formula and our assumption that kAk < 1. Incidentally, the filter function in (4.2) may also be expressed as 1 − (1 − λ)k g (λ) = , (0 < λ < 1) k λ and consequently we can write the filter function for the associated residual as k rk(λ) = (1 − λ) . (4.3) We now state the following proposition from [35, Theorem 6.1, p. 155] which proves that Landweber iteration is a convergent regularisation method: † † † Proposition 27. If y ∈ dom A , then xk → x as k → ∞. If y∈ / dom A , then kxkk → ∞ as k → ∞. Proof. For y ∈ dom A†, we have † ∗ † ∗ k † xk − x = rk(A A)x = (I − A A) x , which follows from the definition of the residual filter function, cf. (4.3). Moreover, due to our assumption that kAk < 1, we have that for all λ ∈ (0, 1):

k−1 X j k |λgk(λ)| = |λ (1 − λ) | = |1 − (1 − λ) | ≤ C, j=0 and k λgk(λ) = 1 − (1 − λ) → 1, as k → ∞ for λ > 0. Hence, 1 g (λ) → , (λ > 0) k λ as k → ∞. The rest of the proof then follows `ala [35, Theorem 4.1, p.72]. The observant reader will have noticed that the above proposition is merely a particular instance of Theorem 3. Now, in the next proposition, we provide the standard error estimates [35]: Proposition 28. Assume that x† satisfies the source condition (2.5). Then we have √ δ † −µ kx − xkk ≤ C kδ, kxk − x k ≤ C(µ, ω)(k + 1) , k √ δ −µ−1 kA(xk − xk)k ≤ 2δ, kAxk − yk ≤ C(µ, ω)(2µ + k + 1) k, and consequently √ kxδ − x†k ≤ C kδ + C(µ, ω)(k + 1)−µ, k √ δ δ −µ−1 kAxk − y k ≤ δ + C(µ, ω)(2µ + k + 1) k, for all k ∈ N and y, yδ ∈ Y .

109 Proof. We begin by following the example of [35, Lemma 6.2, p. 156] by first estimating the data propagation error as

k−1 k−1 δ X ∗ i ∗ δ X ∗ i ∗ δ kxk − xkk = k (I − A A) A (y − y)k ≤ k (I − A A) A kky − yk. i=0 i=0

Moreover, recalling that {Rk} define our family of regularisation operators, the first term in the product can be estimated as

k−1 X ∗ i ∗ 2 2 ∗ k (I − A A) A k = kRkk = kRkRkk j=0 k−1 X = k (I − A∗A)i(I − (I − A∗A)k)k i=0 k−1 k−1 X X = k (I − A∗A)i − (I − A∗A)i(I − A∗A)kk i=0 i=0 k−1 k−1 X X ≤ k (I − A∗A)ik + k (I − A∗A)ikk(I − A∗A)kk i=0 i=0 k−1 X ≤ 2k (I − A∗A)ik ≤ 2k, i=0 which allows us to prove the first estimate. We proceed to estimate the approximation error [35, Theorem 6.5, p. 159]:

Z 1+ † 2 ∗ k † 2 k † 2 kxk − x k = k(I − A A) x k = (1 − λ) dkEλx k 0 Z 1+  2µ Z 1+ 2µ k 2 µ 2 = λ (1 − λ) dkEλωk ≤ dkEλωk 0 µ + k 0 ( (k + 1)−2µkωk2, if µ ≤ 1, = µ2µ(k + 1)−2µkωk2, if µ > 1.

The residual with exact data may be estimated as

2 ∗ k 2 ∗ k † 2 kAxk − yk = k(I − AA ) yk = kA(I − A A) x k Z 1+ 2µ+1 k 2 = λ (1 − λ) dkEλωk 0  k   2µ + 1 2µ+1 ≤ kωk2 2µ + 1 + k 2µ + 1 + k = (2µ + 1)2µ+1(2µ + k + 1)−2µ−2kkωk2.

110 And

δ ∗ ∗ δ ∗ ∗ δ kA(xk − xk)k = kAgk(A A)A (y − y)k = kAA gk(AA )(y − y)k = k(I − (I − AA∗)k)(yδ − y)k ≤ δ + k(I − AA∗)k(yδ − y)k ≤ 2δ.

Finally, the residual with noise can be estimated as follows:

δ δ ∗ k δ ∗ k δ kAxk − y k = k(I − AA ) y k = k(I − AA ) (y − y + y)k ≤ k(I − AA∗)k(yδ − y)k + k(I − AA∗)kyk √ ≤ δ + C(µ, ω)(2µ + k + 1)−µ−1 k.

Note that the estimate for the total error also follows from a simple applica- tion of the triangle inequality.

Remark. We observe courtesy of the proposition above that, as opposed to Tikhonov regularisation, Landweber iteration does not exhibit H¨oldertype saturation, i.e., with a H¨older type source condition (2.5), Proposition 2 holds for all µ > 0. In other words, the qualification index is µ0 = ∞ [35]. It should be noted, however, that one may take a more general view of qualification `a la [108], i.e., for all 0 ≤ λ ≤ αmax there exists a constant Cλ > 0 such that

|1 − λgα(λ)| inf ≥ Cλ, 0≤α≤αmax ρ(α)

1 then ρ(α) is said to be the maximal qualification. Now, taking α ∼ k , [108] −κk gives the maximal qualification for Landweber iteration as ρκ(k) = e , with a positive constant κ > 0, and consequently with saturation φ(δ) = 1 1  2κ δ log δ . Thus, one concludes that one of the most “powerful” regu- larisations may not regularise optimally, even for mildly ill-posed problems (cf. [108]).

Remark. Note that in the absence of a source condition, one may still ob- tain estimates and subsequently convergence for the error components with respect to the exact data. For instance, √ † kAxk − yk ≤ kx1 − x k k, where x1 is the first Landweber iterate (cf. [35, p. 158]). In light of the estimates above in the previous proposition, one may observe δ † that the asymptotic behaviour of the function k 7→ kxk −x k is opposite that δ † δ of α 7→ kxα − x k, where√ xα minimises (2.1). That is, the data propagation error is of the order of kδ which is the dominating term for large k and tends “blows up” (i.e., tends to infinity) as k → ∞. The approximation error is of the order of (k + 1)−µ which is larger for small k. Thus, as stated previously,

111 the behaviour of the stopping index can be somewhat loosely related to the 1 Tikhonov regularisation parameter by α ∼ k . We now state the general convergence rates theorem for Landweber iteration [35]: Theorem 19. If x† satisfies a source condition (1.17), then choosing k =  − 2  k(δ) = kopt := O δ 2µ+1 yields that

 2µ  δ † 2µ+1 kxα − x k = O δ , as δ → 0. Proof. From previous estimates of Proposition 28, we have

† δ † kxk∗ − x k ≤ kxk − xkk + kxk − x k √  = O kδ + (k + 1)−µ , and the result simply follows by bounding (k + 1)−µ ≤ k−µ, and the choice of k = kopt.

4.1.1 Heuristic Stopping Rules

We may use a heuristic stopping rule to select k∗ according to (1.33) with the functionals defined as in Chapter 1: √ √ δ δ ∗ δ ψHD(k, y ) = kkpkk = kkrk(AA )y k, √ 1 √ 3 δ δ δ 2 2 ∗ δ ψHR(k, y ) = khp2k, pki = kkr (AA )y k, 1 1 δ δ δ δ 2 ∗ ∗ 2 ∗ δ ψL(k, y ) = hxk, x2k − xki = kA gk(AA )rk (AA )y k, δ δ δ ∗ ∗ ∗ δ ψQO(k, y ) = kx2k − xkk = kA gk(AA )rk(AA )y k, δ δ where x2k, p2k may be identified with the second iterates as explained in Chapter 1; see (1.27). Recalling that the heuristic functionals may be expressed in the form Z 1+ 2 δ δ 2 ψ (k, y ) = Φk(λ) dkFλy k , 0 we can, similarly as for Tikhonov regularisation, write the associated filter functions Φk : (0, ∞) → R for the heuristic rules as [83,91]: 2 2k Φk(λ) = krk(λ) = k(1 − λ) , (HD) 3 3k Φk(λ) = krk(λ) = k(1 − λ) , (HR) 1 Φ (λ) = λg2(λ)r (λ) = 1 − (1 − λ)k (1 − λ)k − (1 − λ)2k , (L) k k k λ 1 Φ (λ) = λg2(λ)r2(λ) = [(1 − λ)k − (1 − λ)2k]2. (QO) k k k λ

112 The observant reader will note from the above equations that for Landweber δ δ iteration, one has that ψHD(k, y ) ≈ ψHR(k, y ). This was an observation also made in [59] (where both rules were originally introduced). Analogously as Proposition 9 of Chapter 2, we first prove that we can esti- mate the stopping index from above by its associated heuristic functional:

Proposition 29. Let δ k∗ = argmin ψ(k, y ), k∈N with ψ ∈ {ψHD, ψHR, ψQO, ψL} and assume there exists a positive constant δ ∗ δ such that ky k ≥ C1 whenever ψ ∈ {ψHD, ψHR} and kA y k ≥ C2 whenever ψ ∈ {ψL, ψQO}. Then

δ p ∗ k∗ ψHD(k, y ) ≥ C k∗ (1 − kAA k) , 3 δ p ∗ k∗ ψHR(k, y ) ≥ C k∗ (1 − kAA k) 2 , 1 δ ∗ k∗ ψL(k, y ) ≥ C(1 − kA Ak) 2 ,

δ ∗ k∗ ψQO(k, y ) ≥ C(1 − kA Ak) , for all yδ ∈ Y .

Proof. This follows identically as in the proof of Proposition 9. For instance, if ψ = ψHD, then since

kyδk = k(I − AA∗)−k(I − AA∗)kyδk ≤ k(I − AA∗)−kkk(I − AA∗)kyδk, one has that,

kyδk k(I − AA∗)kyδk ≥ ≥ C(1 − kAA∗k)k, k(I − AA∗)−kk which follows from the observation that 1 k(I − AA∗)−kk ≤ k(I − AA∗)−1kk ≤ , (1 − kAA∗k)k where we used the Neumann series estimate k(I − AA∗)−1k ≤ (1 − kAA∗k)−1 [139]. Thus, combining the above yields √ √ δ ∗ k δ ∗ k ψHD(k, y ) = kk(I − AA ) y k ≥ C k(1 − kAA k) ,

0 ∗ −2k 2 δ from which we can derive the estimate k ≤ C (1 − kAA k) ψHD(k, y ). For ψ = ψHR, it follows√ in exactly the same way, observing that we may δ ∗ 3 k δ write ψHR(k, y ) = kk(I − AA ) 2 y k. For ψ = ψQO, it follows similarly.

113 In particular, since we may write

2k−1 k−1 δ δ X ∗ j ∗ δ X ∗ j ∗ δ kx2k − xkk = (I − A A) A y − (I − A A) A y j=0 j=0

2k−1 X ∗ j ∗ δ = (I − A A) A y , j=k we thus have that

2k−1 2k−1 2 δ δ δ 2 X X ∗ j ∗ δ ∗ i ∗ δ ψQO(k, y ) = kx2k − xkk = h(I − A A) A y , (I − A A) A y i j=k i=k 2k−1 2k−1 X X = h(I − A∗A)i+jA∗yδ,A∗yδi j=k i=k 2k−1 2k−1 X X ≥ (1 − kA∗Ak)i+jkA∗yδk2 j=k i=k 1 − (1 − kA∗Ak)k 2 = (1 − kA∗Ak)2kkA∗yδk2, 1 − (1 − kA∗Ak) where we have twice used the formula for the geometric sum. Now, since 1 − kA∗Ak < 1, and

0 ≤ (1 − kA∗Ak)k ≤ (1 − kA∗Ak) =⇒ 1 − (1 − kA∗Ak)k ≥ 1 − (1 − kA∗Ak) = kA∗Ak, it follows that 1 − (1 − kA∗Ak)k kA∗Ak ≥ = 1. 1 − (1 − kA∗Ak) kA∗Ak Hence, δ ∗ k ∗ δ ψQO(k, y ) ≥ (1 − kA Ak) kA y k, yielding the desired estimate.

114 Finally, for ψ = ψL, it similarly follows that *k−1 2k−1 + 2 δ δ δ δ X ∗ j ∗ δ X ∗ i ∗ δ ψL(k, y ) = hxk, x2k − xki = (I − A A) A y , (I − A A) A y j=0 i=k k−1 2k−1 X X ≥ (1 − kA∗Ak)j+ikA∗yδk2 j=0 i=k k−1 ! 2k−1 ! X X = (1 − kA∗Ak)j (1 − kA∗Ak)i kA∗yδk2 j=0 i=k k−1 ! 1 − (1 − kA∗Ak)k  X = (1 − kA∗Ak)k (1 − kA∗Ak)j kA∗yδk2 1 − (1 − kA∗Ak) j=0 1 − (1 − kA∗Ak)k 2 = (1 − kA∗Ak)kkA∗yδk2 1 − (1 − kA∗Ak) ≥ C(1 − kA∗Ak)k, by similar arguments as we used for the ψQO estimate, which completes the proof. Proposition 30. Let the source condition (2.5) be satisfied. Then for all ψ ∈ {ψHD, ψHR, ψL, ψQO}, there exist positive constants such that √ δ −µ ψ(k, y − y ) ≤ C kδ, and ψ(k, y) ≤ Cµ(k + 1) , for all k ∈ N and y, yδ ∈ Y .

Proof. In case ψ = ψHD, it follows immediately that √ √ δ ∗ k δ ψHD(k, y − y ) = kk(I − AA ) (y − y )k ≤ 2 kδ, and √ −µ−1 ψHD(k, y) = kkAxk − yk ≤ C(2µ + k + 1) k ≤ C(k + 1)−µ−1(k + 1) = C(k + 1)−µ.

For ψ = ψHR, the estimate follows similarly as for the HD rule following the observation that √ √ δ ∗ 3 k δ ψHR(k, y − y ) ≤ kk(I − AA ) 2 kk(y − y )k ≤ C kδ, and √  3 −µ−1 ψHR(k, y) = kkAx 3 k − yk ≤ C 2µ + k + 1 k 2 2 3 −µ−1  2−µ ≤ C k + 1 (k + 1) ≤ C k + ≤ C (k + 1)−µ, 2 µ 3 µ

115 2 −µ−1 with Cµ = 3 . For ψ = ψQO, note that we may write

∗ ∗ ∗ ψQO(k, y) = kA gk(AA )rk(AA )yk ∗ 1 ∗ 1 ∗ 1 ∗ ≤ k(AA ) 2 gk(AA ) 2 kkgk(AA ) 2 kkrk(AA )yk √ −µ ≤ C kkpkk ≤ C(k + 1) , since |λgk(λ)| ≤ C and |gk(λ)| ≤ k for all λ ∈ (0, 1) and due to similar considerations as above. Furthermore,

δ ∗ ∗ ∗ δ ψQO(k, y − y ) = kA gk(AA )rk(AA )(y − y )k ∗ 1 ∗ 1 ∗ ∗ 1 δ ≤ k(AA ) 2 gk(AA ) 2 kkrk(AA )kkgk(AA ) 2 kky − y k √ ≤ kδ, which follows from the previous recollections and that |rk(λ)| ≤ C for all 0 < λ < 1. For ψ = ψL, the upper bounds follow similarly as

∗ ∗ ∗ 1 ∗ ∗ ∗ 1 † ψL(k, y) = kA gk(AA )rk(AA ) 2 yk = kA Agk(A A)rk(A A) 2 x k  −µ ∗ ∗ ∗ 1 † k µ −µ ≤ kA Ag (A A)kkr (A A) 2 x k ≤ C + 1 = C · 2 (k + 2) k k 2 −µ ≤ Cµ(k + 1) , and also,

δ ∗ 1 ∗ 1 ∗ 1 ∗ 1 δ ψL(k, y − y ) ≤ k(AA ) 2 gk(AA ) 2 kkrk(AA ) 2 kkgk(AA ) 2 kky − y k √ ≤ C kδ, due to similar estimates as before, which completes the proof. Note that the Muckenhoupt condition (2.17) remains identical with that used 1 for Tikhonov regularisation when we consider the relationship α ∼ k ; that is, e ∈ Np if there exists a positive constant such that

1 Z ∞ Z k 1 −1 2 p−1 2 p λ dkFλek ≤ C λ dkFλek , (4.4) k 1 k 0 for all k ∈ N. Proposition 31. Assume that k > 1 and ( N , if ψ ∈ {ψ , ψ }, e = y − yδ ∈ 1 HD HR N2, if ψ ∈ {ψQO, ψL}.

116 Then there exists a positive constant such that

δ δ kxk − xkk ≤ Cψ(k, y − y ), for all k ∈ N [83, 91]. Proof. First we estimate the data propagation error as

1 ! Z k Z 1+ δ 2 2 δ 2 kxk − xkk = + λgk(λ) dkFλ(y − y)k . 1 0 k We may bound the integrand of the first integral from the estimates:

2 2 2 |λgk(λ)| ≤ Ck λ, and |λgk(λ)| ≤ C|gk(λ)| ≤ Ck,

2 −1 and similarly, since λgk(λ) = λ (1−(1−λ)), it follows that the integrand of the second integral can be bounded from above by λ−1, since 1−(1−λ)k < 1, where both the aforementioned estimates follow in fact for all λ ∈ (0, 1). Thus, we have

1 Z k Z 1+ δ 2 2 δ 2 −1 δ 2 kxk − xkk ≤ Ck λ dkFλ(y − y)k + λ dkFλ(y − y)k , (4.5) 1 0 k 1 Z k Z 1+ δ 2 −1 δ 2 ≤ Ck dkFλ(y − y)k + λ dkFλ(y − y)k . (4.6) 1 0 k In case p = 2, the second term in (4.5) may be bounded by the first term. Whilst for p = 1, it follows similarly, i.e., the first term in (4.6) may be estimated by the second term. Hence, for each ψ-functional, similarly as for Tikhonov regularisation, it suffices to bound the ψ-functional acting on the noise y − yδ by the first term in either (4.6) or (4.5); that is,

1 Z k δ p p−1 δ 2 ψ(k, y − y ) ≥ Ck λ dkFλ(y − y )k , (4.7) 0 where p = 1 for ψ ∈ {ψHD, ψHR} and p = 2 for ψ ∈ {ψQO, ψL} as per the statement of the proposition. For ψ = ψHD, we have

1 Z k 2 δ ∗ k δ 2 2k δ 2 ψHD(k, y − y ) = kk(I − AA ) (y − y )k ≥ k (1 − λ) dkFλ(y − y )k 0 1 Z k (4.4) Z 1+ δ 2 −1 δ 2 ≥ Ck dkFλ(y − y )k ≥ λ dkFλ(y − y )k , 1 0 k

117 1 where the second last inequality follows from the fact that for 0 < λ ≤ k , we have  1 2 (1 − λ)2k ≥ (1 − )k ≥ C2, (k > 2) (4.8) k which is what we wanted to show. Similarly for ψ = ψHR, we have

δ ∗ 2k δ ∗ k δ ψHR(k, y − y ) = kh(I − AA ) (y − y ), (I − AA ) (y − y )i 1 Z k 2k k δ 2 ≥ k (1 − λ) (1 − λ) dkFλ(y − y )k 0 1 (4.8) Z k (4.4) Z 1+ δ 2 −1 δ 2 ≥ Ck dkFλ(y − y )k ≥ λ dkFλ(y − y )k , 1 0 k by the same token. In case ψ = ψQO,

1 Z k δ 2 2 δ 2 ψQO(k, y − y ) ≥ λgk(λ)rk(λ) dkFλ(y − y )k 0 1 Z k 2 δ 2 ≥ Ck λ dkFλ(y − y )k , 0

1 which follows from the fact that for 0 < λ ≤ k , we have

k gk(λ) ≥ k 1 − (1 − λ) ≥ Ck, since 1 − λ ≤ C < 1 for λ > 0, and

k rk(λ) = (1 − λ) ≥ C > 0, which follows from the fact that 1−λ > 0 for all λ < 1, and that yields (4.7), as desired. Finally, for ψ = ψL,

1 Z k δ 2 δ 2 ψL(k, y − y ) ≥ λgk(λ)rk(λ) dkFλ(y − y )k 0 1 Z k 2 δ 2 ≥ Ck λ dkFλ(y − y )k , 0 which follows similarly as for the quasi-optimality rule above. We now state the following convergence rates theorem, partly courtesy of [83]:

118 Theorem 20. Without loss of generality, assume kAk < 1 and suppose x† satisfies (1.17). Let k∗ be chosen as in (1.33) with ψ ∈ {ψHD, ψHR, ψQO, ψL}. δ Furthermore, assume that there exist positive constants such that ky k ≥ C1 ∗ δ and kA y k ≥ C2 whenever ψ ∈ {ψHD, ψHR} and ψ ∈ {ψQO, ψL}, respectively. We also assume that ( N , if ψ ∈ {ψ , ψ }, e = y − yδ ∈ 1 HD HR N2, if ψ ∈ {ψL, ψQO}.

Then

   4µ −µ 2µ+1 δ † O −W Cδ , ψ ∈ {ψHD, ψHR}, kxk∗ − x k =  −µ O (− log (δ)) , ψ ∈ {ψL, ψQO}, as δ → 0, where W is Lambert’s W -function. Proof. We can approximate

δ † δ †  kxk∗ − x k = O kxk∗ − xk∗ k + kxk∗ − x k δ −µ = O ψ(k∗, y − y ) + (k∗ + 1) δ −µ = O ψ(k, y ) + ψ(k∗, y) + (k∗ + 1) δ −µ = O ψ(k, y − y ) + ψ(k, y) + ψ(k∗, y) + (k∗ + 1) . (4.9)

For ψ = ψHD, it follows from Propositions 30 and 31 that √ δ †  −µ −µ kxk∗ − x k = O kδ + (k + 1) + (k∗ + 1) . (4.10)

Thus, from Proposition 29, we may recall

∗ 2k∗ 2 δ 2 δ Ck∗(1 − kAA k) ≤ ψHD(k, y ) ⇐⇒ Cϕ(k∗) ≤ ψHD(k, y ), with ϕ(x) := x(1 − kAA∗k)2x. If we assume that kAA∗k is not sufficiently small (which may always be achieved through appropriate scaling), then ϕ is a decreasing function for x > 1. Thus,

−1 2 δ  k∗ ≥ ϕ CψHD(k, y ) , with W (2 log(1 − kAA∗k)y) |W (2 log(1 − kAA∗k)y)| ϕ−1(y) = = , 2 log(1 − kAA∗k) |2 log(1 − kAA∗k)| since both numerator and denominator are negative, where W represents Lambert’s W -function (cf. [26]), which is negative for negative arguments. Hence, 2 δ  k∗ ≥ O |W (−CψHD(k, y ))| ,

119 as δ → 0, where we have replaced 2 log(1 − kAA∗k) by −C < 0, where C is a positive constant, as it does not depend on any variable. Therefore, we can −µ −µ estimate the (k∗ + 1) term in (4.10) by k∗ , and then with the above, get √ δ †  −µ 2 δ −µ kxk∗ − x k = O kδ + (k + 1) + |W (−CψHD(k, y ))|) √  √  −µ −µ −µ 2 = O kδ + (k + 1) + W −C( kδ + (k + 1) )

  4µ  −µ   4µ −µ 2µ+1 2µ+1 = O W −Cδ = O −W −Cδ , as δ → 0, where we have once again used that W is negative for negative arguments and that |w| = −w for any w < 0. The third equality hold since for all z ∈ (0, 1), it follows that W (−Cz2) ∈ (−1, 0), which, in turn, implies that z < 1 < |W (−Cz2)|−µ, for all µ > 0. δ  Note that for ψ = ψHR, since k∗ = O W (CψHR(k, y )) due to Proposition 29 with a slightly different constant to what we had for the ψHD case, we get the same rates as for the HD rule, since the estimates in Proposition 30 are more or less identical. Similarly, for ψ = ψQO, from (4.9), we have √ δ †  −µ −µ kxk∗ − x k = O kδ + (k + 1) + (k∗ + 1) . (4.11)

Recalling the estimates from Proposition (29):

∗ k∗ δ δ C(1 − kA Ak) ≤ ψQO(k, y ) ⇐⇒ Cφ(k∗) ≤ ψQO(k, y ), with φ(x) := (1 − kA∗Ak)x. Since φ−1(y) = log(y)/ log(1 − kA∗Ak), where we note that the denominator is negative (since 1 − kA∗Ak < 1) , we have

log(ψ (k, yδ)) | log(ψ (k, yδ))| k ≥ C QO = C QO . ∗ log(1 − kA∗Ak) | log(1 − kA∗Ak)|

−µ −µ Therefore, we can estimate the (k∗ + 1) term in (4.11) by k∗ , and then with above, get

√ √ −µ kxδ − x†k = O kδ + (k + 1)−µ + log( kδ + (k + 1)−µ) k∗  √ −µ  −µ  −µ  2µ  2µ+1 = O log kδ + (k + 1) = O log δ = O |log (δ)|−µ = O (− log(δ))−µ ,

120 as δ → 0. Note that the second equality holds by the following logic: for all z ∈ (0, 1), we have that there should exist a positive constant such that

z ≤ C| log(z)|−µ ⇐⇒ z(− log(z))µ ≤ C. (4.12)

Let f(z) := z(− log z)µ be a function. Then it is clear that f → 0 as z → 0, f(1) = 0, f > 0 on (0, 1) and f ∈ C1[(0, 1)]. Moreover, its maximum is attained at some z0 ∈ (0, 1), which satisfies

0 f (z0) = 0.

That is to say, µ µ−1 (− log z0) − µ(− log z0) = 0, i.e., −µ − log(z0) = µ ⇐⇒ z0 = e . Thus, the maximum of f is µµ f(z ) = z (− log z )µ = e−µµµ = . 0 0 0 e Hence, (4.12) holds for z ∈ (0, 1) if and only if µµ ≤ C, e i.e., for µµ C = C = . µ e

Finally, since the estimates for ψL are nigh on identical to those of ψQO, the proof is also analogous.

Remark. The above theorem thus exemplifies that “infinite” qualification leads to slow convergence rates (cf. [83]). In particular, we see that the con- vergence is much lower than the rates for Tikhonov regularisation (cf. Corol- lary 3). Thus, one may conclude that for heuristic rules, saturation is in fact beneficial for faster convergence rates.

In order to get optimal convergence rates, we must once again, as in Chap- ter 2, prove that we may bound the approximation error from above by the ψ-functionals, as we did in Lemma 1 in Chapter 2, courtesy of a regularity condition on the solution. Note, however, that these would be somewhat more restrictive than, e.g., (2.26) for Tikhonov regularisation [83]. However, this is beyond the scope of this thesis and therefore we end our treatment of linear ill-posed problems with Landweber iteration here, with the suboptimal convergence rates theorem above. The aforementioned optimal convergence rates results with the required regularity conditions may be found in [83]. In

121 particular, the said regularity condition is of the form: there exists an index function φ : R+ → R+ such that

1 ! Z k Z 1+ 2 † 2 2 † 2 rk(λ) dkEλx k ≤ φ rk(λ) dkEλx k . (4.13) 1 0 k For further details, we point the reader in the direction of the aforementioned reference.

4.2 Landweber Iteration for Nonlinear Oper- ators

Now we consider the case in which the forward operator is nonlinear. Note that, as far as the author’s knowledge is concerned, the theory of heuristic rules in this instance has not been developed whatsoever and indeed, there are no further ground-breaking results presented here either. Instead, we conjecture some possible heuristic parameter choice rules which are based on the ones we highlighted earlier for linear ill-posed problems and stated in the forthcoming paper [77]. Let X and Y be Hilbert spaces, as in the previous section. Then, when F : dom F ⊂ X → Y is a nonlinear operator (which we also assume is continuously Fr´echet differentiable), i.e., we have

F (x) = y, (4.14) where we have noisy data yδ satisfying

kyδ − yk ≤ δ, (4.15)

(4.1) may be replaced by

δ δ 0 ∗ δ δ  xk = xk−1 + ωF (xk−1) y − F (xk−1) , k ∈ N (4.16)

0 where F (xk−1) denotes denotes the Fr´echet derivative of F at xk−1, with an δ initial guess x0 = x0 (cf. [35,58,81]). Note that since the argument of F 0(·)∗ in (4.16) changes at each iteration δ step, the sequence {xk} is not guaranteed to remain within a certain invari- ant subspace (e.g., in range((A∗A)µ) as in the preceding section); therefore, rather restrictive additional conditions must be imposed in order to derive convergence rates (cf. [35]). A typical source condition (cf. [35]) one considers for nonlinear Landweber iteration is † 0 † x − x0 = F (x )ω. (ω ∈ Y ) (4.17)

122 A restrictive condition, of which we mentioned there would be before, is the so-called tangential cone condition: 1 kF (x) − F (˜x) − F 0(x)(x − x˜)k ≤ ηkF (x) − F (˜x)k, η < , (4.18) 2 for x, x˜ ∈ B2ρ(x0) ⊂ dom F (cf. [35, 81]), which is a condition on the non- linearity of F . The debate on whether this condition is too strong to be satisfied in many practical situations is one which rages on, yet there are some results which show that it is indeed satisfied in certain realistic exam- ples, e.g., [76,82]. Furthermore, if proving convergence with respect to the discrepancy princi- ple, the parameter τ in (1.29) has to satisfy 1 + η τ > 2 ≥ 2 , (4.19) 1 − 2η where the factor 2 can be slightly improved by an expression depending on η which tends to 1 as η → 0, thereby recovering the optimal bound in the linear case [56]. If, in addition, the step-size ω is chosen small enough such that, locally, one has ωkF 0(x)k2 ≤ 1 , (4.20) then Landweber iteration combined with the discrepancy principle (1.29) converges to x∗ as the noise level δ goes to 0 [58, 81]. Furthermore, conver- gence to the minimum-norm solution x† can be established given that in a sufficiently large neighbourhood, it holds that ker(F 0(x†)) ⊂ ker(F 0(x)). In order to prove convergence rates, in addition to source conditions further restrictions on the nonlinearity of F are necessary, see (4.18), cf. [35,81]. Although the tangential cone condition (4.18) holds for a number of different applications (see e.g. [82] and the references therein), and despite the fact that several attempts have been made to replace it by more general conditions [85], these can still be difficult to prove for specific applications. Furthermore, even if the tangential cone condition can be proven, the exact value of η typically remains unknown. Since this also renders condition (4.15) impractical, the parameter τ in the discrepancy principle then has to be chosen manually. Since in the linear case the theory implies that τ close to 1 is optimal, popular choices include τ = 1.1 or τ = 2. These work well in many situations, but are also known to fail in others (compare with Section 8 below). In any case, this shows that for practical applications involving nonlinear operators, informed “heuristic” parameter choices remain necessary even if the noise level δ is known.

4.2.1 Heuristic Parameter Choice Rules As mentioned, to the best of the author’s knowledge, there is no convergence analysis available for heuristic rules in the case of nonlinear Landweber iter-

123 ation within the current spectrum of scientific literature. One may, however, postulate that the parameter choice rules represented by the ψ-functionals of Section 4.1.1 may be applied to the nonlinear Landweber case with the appropriate adjustments. Certainly, in Section 8, we include the results from the upcoming paper [77], in which a numerical study of the aforementioned ψ-functionals was done, where they were defined as: √ δ δ δ ψHD(k, y ) := kkF (xk) − y k , ψ (k, yδ) := khyδ − F (xδ ), yδ − F (xδ )i , HR 2k k (4.21) δ δ δ ψQO(k, y ) := kx2k − xkk , δ δ δ δ ψL(k, y ) := hxk, x2k − xki ,

δ δ with xk, x2k ∈ X defined by (4.16).

124 Part II

Numerics

125 Foreword In order to illustrate the theory of the preceding sections, we conduct various numerical tests spanning a plethora of different settings from the previous chapters. The outline of this part of the thesis is as follows: in Chapter 5, we examine the actual performance of the semi-heuristic rules of Section 2.3 and compare them to standard heuristic rules and also to an a- posteriori rule for Tikhonov regularisation in the context of weakly bounded noise. For a comparison of heuristic rules for Tikhonov regularisation in the standard setting with the presence of (strongly) bounded noise, we refer the reader to [53]. Note that the numerics and results are from the paper [51]. Thereafter, in Chapter 6, we present the results of [90] which demonstrate the effectiveness of certain novel and already known heuristic rules for convex Tikhonov regularisation. We include the results of [91] which showcase the numerical performance of the ψL and ψLR rules against the more classical L- curve rule of Hansen and also the quasi-optimality rule in Chapter 7. Finally, we test some heuristic rules for a couple of examples for nonlinear Landweber iteration in Chapter 8. Note that at the end of each section, we shall provide a brief summary detailing the concluding observations. The manner in which the numerics are assessed varies between the chapters. For instance, Chapter 5 is the only one in which we compute the ratio of the relative error with respect to each heuristic rule and the optimal error, and subsequently plot a heat diagram. We leave the details as ambiguous for now as they will be explained in the relevant section. In Chapter 6, however, we away with the aforementioned heat plots and merely compute the total error for each noise level and plot alongside exemplary functional plots; similarly in Chapter 8. In Chapter 7, there are no plots, but results displayed in tabular form. Typical issues which arise in the numerical implementation of the ψ-functional based heuristic rules are most often associated with the mandatory discreti- sation as will be detailed in the proceeding chapters.

126 Chapter 5

Semi-Heuristic Rules

In this chapter, we revert to the setting of Section 2.3 and demonstrate the numerical performance of the semi-heuristic rules discussed there. As in that section, all of the results are taken from [51]. We pit the various modified functionals ψ¯ against one another, and also the generalised discrepancy principle, see, e.g. [103], for comparison’s sake, on a series of test problems. We provide two types of experiments: one with random operator noise and the other with a smooth operator perturbation. Note that heuristic rules can fail in the case of smooth errors that do not satisfy a noise condition. Thus, a smooth operator perturbation is the most critical case for heuristic rules, and, as we will observe, the semi-heuristic methods will prove to be more effective in that case. For each of the proposed rules, we compute the relative error with respect to the selected regularisation parameter α∗ and the error obtained by the theoretically optimal choice of α by kxδ − x†k kxδ − x†k Err := α∗,η and Err := αopt,η , rel kx†k opt kx†k respectively. Furthermore, we denote the ratio of these errors by

δ † kxα∗,η − x k Errper := δ † . kxαopt,η − x k Note that in our simulations, we are afforded the luxury of knowing x†, thereby allowing us to minimise the error and compute αopt and Erropt. 2 For the standard heuristic rules, we search for α ∈ [λmin, kAk ], where λmin ∗ is the minimum eigenvalue of the matrix A A. (However, if λmin is below −14 −14 10 , then we choose αmin = 10 to avoid numerical instabilities). In some cases of large operator noise, the heuristic rules selected α∗ = αmax; thus in this situation, we select the parameter corresponding to the smallest interior local minimum. For the semi-heuristic rules, however, we restrict our search to the interval [γ, αmax], where γ = O(η) as above. Furthermore, in

127 each experiment, we have scaled the operator and the exact solution so that kAk = 1 and kx†k = 1.

5.1 Gaußian Operator Noise Perturbation

5.1.1 Tomography Operator Perturbed by Gaußian Op- erator. In this experiment, we use the tomo package from Hansen’s Regularisation n×n Tools [61] to define the finite-dimensional operator (i.e., matrix) Aη ∈ R , where Aη = A + C∆A, with A the tomography operator, which is a pene- tration of a two dimensional domain by rays in random directions. We use random Gaußian distributed operator noise, i.e., ∆A ∈ Rn×n is a matrix with random entries. The data noise is defined as δ = Ckk, where  ∈ Rn is a Gaußian distributed noise vector. In the following configuration, we set n = 625 and f = 1, according to Hansen’s Tools. We provide a dot plot, namely Figure 5.1, in which we compare the error Errrel according to the relative error function for each parameter choice rule and for 100 different realisations of data errors and operator perturbations with values of δ and η ranging from 1% to 10%. Each asterisk in the plot corresponds to the relative error, Errrel, for a realisation of operator and data noise. Note that “SH1” and “SH2” in Figure 5.1 refer to the modified rules with δ √ ηkxα,ηk, cf. (SH1), and η/ α, cf. (SH2), as compensating functionals, re- spectively. Recall that the standard rules (in blue) correspond to the semi- heuristic rules with D = 0 and search for a parameter α in the interval 2 [λmin, kAk ]. The last row in the plot is then the dot plot of the relative error for the optimal choice of α, namely eopt. In each row, the green circles rep- resent the median of the respective relative errors over the 100 realisations. We see that the semi-heuristic rules present a noticeable improvement for all parameter choice rules, although the discrepancy in performance seems to be slightly more pronounced for the quasi-optimality and Hanke-Raus rules. One may also observe that the generalised discrepancy principle (cf. (2.64)), marked in the first row of Figure 5.1, is the worst performing. We also compare the difference between the values of Errper with respect to the modified parameter choice rule and its unmodified counterpart, re- spectively as a percentage. For example, for any configuration of data and operator noise, we would compute the value

ϑ(δ, η) = (Errper − Errper) × 100, (5.1) where Errper and Errper denote the error ratio for the standard heuristic rule (i.e., D = 0 and αmin = λmin) and the modified rule (SH1) or (SH2),

128 respectively. This value is computed for several noise-levels δ and operator error levels η. Note that positive values indicate that the semi-heuristic rules outperform their heuristic counterparts and vice versa. The plots of Figure 5.4 indicate that the semi-heuristic rules do not nec- essarily offer improvements for small data and operator noise, but exhibit increased performance for larger noise of both aforementioned varieties. In particular, this is more pronounced for the quasi-optimality rule where we may observe blotches of dark red which indicate significant improvement over the standard heuristic rule. The standard heuristic rules also performed reasonably well and a possible explanation could be the argumentation for using the compensating func- tional was based on the regularity of the operator noise and therefore it is probable that the irregularity of the operator noise in this scenario did not aid the premise of using the modified rules.

5.2 Smooth Operator Perturbation

5.2.1 Fredholm Integral Operator Perturbed by Heat Operator To simulate a deterministic, possibly smooth, operator perturbation, we first consider the Fredholm integral operator of the first kind perturbed by a heat operator, which we think is an instance where the noise condition for Aη might fail and where a semi-heuristic modification is highly advisable. For the implementation, we use the baart and heat packages on Hansen’s n×n Regularisation Tools [61] to define the finite dimensional operator Aη ∈ R , with n = 400, where Aη = A + C∆A is the superposition of the baart operator and scaled heat operator, respectively. More precisely, the baart operator is the discretisation of a Fredholm integral equation of the first kind with kernel K1 :(s1, t1) 7→ exp(s1 cos t1), where s1 ∈ [0, π/2], t1 ∈ [0, π], and the heat operator is taken to be the Volterra integral operator with kernel K2 :(s2, t2) 7→ k(s2 − t2), where

− 3 t 2  1  k(t) := √ exp − , 2 π 4t for t ∈ [0, 1]. The exact solution is given by y(s) = 2 sin s/s and the data noise is defined as before. We proceed similarly as in the previous experiment. In Figure 5.2, we observe that the best performing rule is in fact the semi-heuristic quasi-optimality rule (SH1). The semi-heuristic variants of the Hanke-Raus and heuristic discrepancy rules are also improvements on the original rules, although this is slightly more pronounced for the semi-heuristic Hanke-Raus rules. As in

129 the previous experiment, the generalised discrepancy rule performs worst; al- though, in this scenario, this is even more pronounced as the average relative error (which is not visible in the plot) is 4.1627. Note that the exact solu- tion in this experiment is smooth and we hypothesised that the well-known suboptimality (i.e., saturation) of the discrepancy principle (see, e.g., [35]) was a possible cause. Indeed, when we reran the experiment using a less smooth (piecewise constant) solution, the results turned out to be similar as in the previous experiment; thus, failure of the discrepancy rule here is due to the saturation effect. Note that a-posteriori rules which do not exhibit the saturation effect include the modified discrepancy principle and monotone error rule (cf. [134]). In Figure 5.5, the plots for the heuristic discrepancy and Hanke-Raus rules demonstrate that the semi-heuristic rules offer an overall improvement for all ranges of operator and data noise. However, we observe that the semi- heuristic quasi-optimality rules performs slightly worse for small data and op- erator noise, but exhibit much better performance when both the mentioned noises are larger. Additionally, one may also observe that the semi-heuristic Hanke-Raus rules perform significantly better than their standard heuristic counterparts for very large operator noise.

5.2.2 Blur Operator Perturbed by Tomography Oper- ator In a next experiment, we again simulate a deterministic operator perturba- tion by considering the blur operator from Hansen’s tools and perturbing it by the tomography operator from before. For the blur operator, we set band = 8 and sigma = 0.9, which is modelled by the Gaußian point spread function: 1  x2 + y2  h(x, y) = exp − . 2πσ2 2σ2 In Figure 5.3, we observe as before that the semi-heuristic rules exhibit im- provements over their standard counterparts for the heuristic discrepancy and Hanke-Raus rules, although the standard quasi-optimality rule performs quite well and in this case, its semi-heuristic variants do not necessarily present a better choice. The generalised discrepancy rule performs similarly as in the other experiments. In Figure 5.6, it is difficult to draw any meaningful conclusions, although it seems that for large operator noise and reasonable data noise, the semi- heuristic discrepancy and Hanke-Raus rules perform better than the standard heuristic rules. Consequently, one may conclude that for many situations, the semi-heuristic rules offer an improvement on their standard heuristic counterparts.

130 Note that in all experiments, the minimiser in the range [λmin, αmax] of the standard heuristic functionals was occasionally αmax; particularly when the operator noise was large. We rectified this failure by the interior minima search as described above. Had we not rectified this failure, the improvement of the semi-heuristic methods would have been even greater pronounced.

5.3 Summary

The numerical experiments above confirm that the semi-heuristic methods may yield an improvement over the standard parameter choice rules in a great many situations. In case one does not have knowledge of the operator noise level, then we recommend that one uses the standard heuristic rules as they also perform quite well in many situations. The numerical experiments, at least in our setup, showed little benefit of additional knowledge of the data noise level. Incidentally, the optimal choices of D and γ presents room for further research.

131 Relative Error Plot

GDP

QO SH2

QO SH1

QO

HR SH2

HR SH1

HR

HD SH2

HD SH1

HD

Optimal

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Relative Error

Figure 5.1: Tomography operator per- turbed by random operator: D = 600 for SH1, D = 0.05 for SH2 and γ = 0.005 × η.

Relative Error Plot

GDP

QO SH2

QO SH1

QO

HR SH2

HR SH1

HR

HD SH2

HD SH1

HD

Optimal

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Relative Error

Figure 5.2: Fredholm operator of the first kind perturbed by heat operator: D = 600 for SH1, D = 0.12 for SH2 and γ = 0.07 × η.

Relative Error Plot

GDP

QO SH2

QO SH1

QO

HR SH2

HR SH1

HR

HD SH2

HD SH1

HD

Optimal

0 0.1 0.2 0.3 0.4 0.5 0.6 Relative Error

Figure 5.3: Blur operator perturbed by tomography operator: D = 500 for SH1, D = 0.2 for SH2 and γ = 0.01 × η.

132 Semi-Heuristic Discrep. 1 Semi-Heuristic Discrep. 2 10 10

3.59 3.59

1.29 1.29 (%) (%) η η

0.46 0.46

0.17 0.17

0.17 0.46 1.29 3.59 10 0.17 0.46 1.29 3.59 10 δ (%) δ (%)

Semi-Heur. Hanke-Raus 1 Semi-Heur. Hanke-Raus 2 10 10

3.59 3.59

1.29 1.29 (%) (%) η η

0.46 0.46

0.17 0.17

0.17 0.46 1.29 3.59 10 0.17 0.46 1.29 3.59 10 δ (%) δ (%)

Semi-Heur. Quasi-Opt. 1 Semi-Heur. Quasi-Opt. 2 10 10

3.59 3.59

1.29 1.29 (%) (%) η η

0.46 0.46

0.17 0.17

0.17 0.46 1.29 3.59 10 0.17 0.46 1.29 3.59 10 δ (%) δ (%)

-100 0 100 200 300 400 Performance of Modified Rule vs Unmodified Rule (%)

Figure 5.4: Tomography operator perturbed by random operator: set-up identical to Figure 5.1. Red indicates that the semi-heuristic rules perform better than their standard heuristic counterparts and vice versa.

133 Semi-Heuristic Discrep. 1 Semi-Heuristic Discrep. 2 10 10

3.59 3.59

1.29 1.29 (%) (%) η η

0.46 0.46

0.17 0.17

0.17 0.46 1.29 3.59 10 0.17 0.46 1.29 3.59 10 δ (%) δ (%)

Semi-Heur. Hanke-Raus 1 Semi-Heur. Hanke-Raus 2 10 10

3.59 3.59

1.29 1.29 (%) (%) η η

0.46 0.46

0.17 0.17

0.17 0.46 1.29 3.59 10 0.17 0.46 1.29 3.59 10 δ (%) δ (%)

Semi-Heur. Quasi-Opt. 1 Semi-Heur. Quasi-Opt. 2 10 10

3.59 3.59

1.29 1.29 (%) (%) η η

0.46 0.46

0.17 0.17

0.17 0.46 1.29 3.59 10 0.17 0.46 1.29 3.59 10 δ (%) δ (%)

0 50 100 150 200 Performance of Modified Rule vs Unmodified Rule (%)

Figure 5.5: Fredholm operator of the first kind perturbed by heat operator: set-up identical to Figure 5.2. Red indicates that the semi-heuristic rules perform better than their standard heuristic counterparts and vice versa.

134 Semi-Heuristic Discrep. 1 Semi-Heuristic Discrep. 2 10 10

3.59 3.59

1.29 1.29 (%) (%) η η

0.46 0.46

0.17 0.17

0.17 0.46 1.29 3.59 10 0.17 0.46 1.29 3.59 10 δ (%) δ (%)

Semi-Heur. Hanke-Raus 1 Semi-Heur. Hanke-Raus 2 10 10

3.59 3.59

1.29 1.29 (%) (%) η η

0.46 0.46

0.17 0.17

0.17 0.46 1.29 3.59 10 0.17 0.46 1.29 3.59 10 δ (%) δ (%)

Semi-Heur. Quasi-Opt. 1 Semi-Heur. Quasi-Opt. 2 10 10

3.59 3.59

1.29 1.29 (%) (%) η η

0.46 0.46

0.17 0.17

0.17 0.46 1.29 3.59 10 0.17 0.46 1.29 3.59 10 δ (%) δ (%)

-120 -100 -80 -60 -40 -20 0 20 40 Performance of Modified Rule vs Unmodified Rule (%)

Figure 5.6: Blur operator perturbed by tomography operator: set-up identi- cal to Figure 5.3. Red indicates that the semi-heuristic rules perform better than their standard heuristic counterparts and vice versa.

135 Chapter 6

Heuristic Rules for Convex Regularisation

In this chapter, we illustrate and verify the theory of Chapter 3; in particular, the rules based on the functionals ψHD ,ψHR, ψSQO and ψRQO as defined in Section 3.2. We omit ψLQO as this performed far and away the most poorly in preliminary testing and this was also the observation in [86]. These numerics are from [90] and there, the ψL functional as defined in Section 3.2 was not tested. However, the reader should not despair, as in Chapter 7, one will find numerical tests for ψL. It should be stressed that the convergence analysis of that chapter provides an understanding of the behaviour of the rules, but there exist further factors that influence the actual quality of the results using heuristic rules (e.g., for optimal-order results (see Theorem 6 in Chapter 2), a regularity condition on x† is often required (see, e.g., (2.26)) and the value of the constants in the estimates are important). It should be noted as well that the results of Chapter 3 only proved sufficient conditions for convergence which does not necessarily mean that the methods fail when the conditions are violated. (For instance, by including δ-dependent estimates, one may find weaker con- vergence conditions that hold for certain ranges of the noise level). Still, a preliminary understanding can be gained from the numerical experiments in this section. We remind the reader once more that all the results of this chapter are based on [90]. In all experiments, we consider the discretised space Rn and for the 1 q regularisation functional, R = q k · k`q with q to be specified and R = TV defined below. Due to the discretisation, we opt to choose the parameter 2 α∗ ∈ [αmin, αmax]. We also choose to select αmax = kAk (apart from for TV regularisation below), and for the more tricky issue of the lower bound, we ∗ set αmin = σmin, the smallest singular value of A A, (again excluding the TV −9 regularisation case, for which we fix αmin = 10 ). Other methodologies for selecting αmin were suggested in [51,84].

136 A difficult issue in the parameter selection procedure is how global min- ima appearing at the boundary, i.e., α∗ = αmin or α∗ = αmax, should be treated. For linear regularisation methods, usually only the lower bound at αmin is an issue, which can be handled with the above mentioned techniques. However, for the present convex regularisation case, we observed several in- stances (in particular for `1-penalties), where the global minimum was at the right boundary at αmax leading to a suboptimal choice of α∗, which is quite an unusual phenomenon (in the linear case). We treated this issue through “brute force” by explicitly excluding boundary minima and selecting α∗ from the set of interior minima, even when the boundary values of ψ would have smaller values. Only when the set of interior minima is empty do we consider boundary minima again. Additionally, we always rescale the forward operator and exact solution so that kAk = kx†k = 1. In each experiment, we compute 10 different real- isations of the noise in which the noise level is logarithmically increasing. We also compute the error (the measure for which will differ for the various regularisation schemes) induced by each parameter choice rule, as well as the optimal parameter choice, which will be computed as the minimiser of the respective error functional itself. Moreover, we include plots of the total error at each noise level for all considered scenarios (left-hand side plots), and we also provide a supplementary plot of the parameter choice functionals (right- hand side plots) for an example noise level which is specified in the error plot via a dotted vertical line. Note that in case we plot the ψ-functionals for δ = 10%, the aforementioned line will not be visible due to it lying on the boundary of the error graph. For our operators, we use the tomography operator (tomo) from Hansen’s Regularisation Tools (cf. [61]) with n = 625 and f = 1. We also define a diag- n×n 1 onal matrix A ∈ R , initially with eigenvalues λi = C iβ and subsequently also with λi = C exp(−iβ) which simulate mildly and severely ill-posed prob- † 1 lems, respectively. In either case, we take an exact solution x = C · si iν and 1 data perturbed by noise ei = CN (1, 0) iκ , where si ∈ {−1, 1} are random and set the parameters as n = 20, β = 4, ν = 2 and κ = 1.

6.1 `1 Regularisation

A particularly interesting application of convex variational Tikhonov regu- larisation is the case in which q = 1, since it is sparsity enforcing. In fact, it is the most sparsity enforcing regularisation method whilst still remaining a convex minimisation problem. Significant work in the area of sparse regularisation includes [20,48,101,132]. Whilst it does not fit with the Muckenhoupt-type conditions we derived ear- lier, it is nevertheless an interesting regularisation scheme for the practitioner who would be eager to see the performance of the studied rules. Note that in

137 this case, we minimise the Tikhonov and Bregman functionals using FISTA (cf. [12]). The corresponding proximal mapping operator is the soft thresh- olding operator. In this experiment, we use the tomography operator defined above. The solution x† in our experiment is chosen to be sparse in a custom built manner. For each parameter choice rule, we compute the error as

δ † 1 1 Err` (α∗) = kxα∗ − x k` . We may observe in Figure 6.1 that, for smaller noise levels, the heuristic dis- crepancy rule appears to be the best performing, while for larger noise levels, the right quasi-optimality rules performs best. The two worst performers are the symmetric QO and HR rules, where the latter bucks this trend for small noise. In Figure 6.2, we display a plot of the surrogate functionals ψ for a specific ex- ample. In particular, the Hanke-Raus, symmetric, and right quasi-optimality functionals exhibit some erroneous negative values which hamper their per- formances. Indeed, in this example, negative values were frequently observed for the symmetric QO rule, less so for the HR rule and less still for the right QO rule. This issue will be discussed later on.

3 6.2 `2 Regularisation

3 An interesting case for the purposes illustrating our theory is when q = 2 , since the respective proximal mapping operator for computing the minimiser (3.1) has a closed form solution which is easily computed. In this scenario, we use the diagonal operator defined above, in the foreword of this chapter, with the given parameters and we compute the error with the Bregman distance; namely, δ † q Err` (α∗) = Dξ(xα∗ , x ), q ∈ (1, ∞). We firstly consider the mildly ill-posed case and observe in Figure 6.3 that the Hanke-Raus rule is the best performing one in case that the noise level is relatively small, although for mid-range noise levels, the heuristic dis- crepancy rule performs slightly better and for larger noise levels still, the quasi-optimality rules match the heuristic discrepancy rule. Note that the quasi-optimality rules appear indistinguishable in this plot and we remark too that the plots of their respective functionals were very similar (see Fig- ure 6.4). The relatively poor performance of the quasi-optimality rules may be ex- plained by the observation in Figure 6.4 that the selected minimisers of the quasi-optimality functionals are suboptimal. If the other local (instead of global) minima were selected (e.g. those left of α = 10−2), then the results would be much improved. This is a common phenomenon in many of our

138 3 experiments involving the diagonal operator with q = 2 . We observe that the HD and HR functionals oscillate as well, although in Figure 6.4, at least, the “correct” minimisers were chosen. In the severely ill-posed setting, we notice in Figure 6.5 that the heuristic rules display a mixed performance. Observe in Figure 6.6 that the global min- imum of the quasi-optimality functionals appears to be the “wrong” choice as in the previous experiment. Indeed, the optimal parameter choice almost coincides with a local minimum of the quasi-optimality functionals. We also see in this example plot that the HD and HR rules grossly overestimate the parameter. A similar pattern was observed for other considered noise levels.

6.3 `3 Regularisation

Based on the Muckenhoupt-type conditions of Chapter 3, we postulated that for q > 2, the parameter choice rules we consider are likely to face mishaps. Consequently, we have elected to run a numerical experiment with q = 3 in order to illustrate what happens in practice. As in the previous experiment, we consider the diagonal operator and compute the error induced by the parameter choice rules as before. Firstly considering the mildly ill-posed case, we see in Figure 6.7 that all the rules appear to perform very well. In Figure 6.8, we examine an example case in which the heuristic discrepancy would have selected α∗ = αmax, being its global minimum, were it not for our method which effectively disqualifies it. For the severely ill-posed setting, we observe in Figure 6.9 that the rules perform very well in general, with the Hanke-Raus rule being the worst and the quasi-optimality rules being the best of the bunch.

6.4 TV Regularisation

Selecting Z R(x) := sup x(t) div φ(t) dt, ∞ n φ∈C0 (Ω;R ) Ω kφk∞≤1 with div denoting the divergence and Ω ⊂ Rn an open subset, yields total variation (TV) regularisation [138]. For the numerical treatment and for P functions on the real line, this is often discretised as R = k∆xk`1 with a (e.g., forward) difference operator ∆. For our numerical implementation, we used the FISTA algorithm [12] with the proximal mapping operator for the total variation functional being computed using a fast Newton-type method, courtesy of the code provided by [5,6].

139 δ Note that in this case, we choose αmax such that kxα∗ k ≤ C for a reasonable constant. Moreover, for each parameter choice rule, we compute the error via the so-called “strict metric”:

δ † δ † 1 ErrTV(α∗) = |R(xα∗ ) − R(x )| + kxα∗ − x k` , which was suggested in [86]. In this instance, we consider the tomography operator. One may observe in Figure 6.11 that the right quasi-optimality rule ap- pears to overall be the best performing one, although for smaller noise levels, the heuristic discrepancy rule seems to be preferable. The Hanke-Raus and symmetric quasi-optimality rules are generally the worst performing with a tendency to exhibit negative values; see, e.g., Figure 6.12. In Figure 6.12, we display a plot of the ψ-functionals for one particular exam- ple in which we can see a demonstration of the symmetric quasi-optimality functional’s proneness to exhibiting negative values due to numerical errors. In this particular plot, we see that even when the Hanke-Raus functional is non-negative, it overestimates the parameter. Recall that the same phe- nomenon of negativity was observed for `1 regularisation, cf. Figure 6.2, al- though on this occasion, the right QO rule is more well behaved. The existence of negative values is a numerical artefact as all ψ-functionals are, by definition, non-negative. The symmetric QO functional appears to be the most susceptible of them. The negative values are only occurrent for `1 and TV penalties. Hence, we conjecture that this could be related to the lack of strict convexity of the penalty functional R. The negative values are likely caused by numerical cancellation and perhaps more importantly, approxima- tion errors in the numerical computation of the minimiser of the Tikohnov functional. Indeed, when we calculate the minimiser more accurately, then negative values are less occurrent.

6.5 Summary

To summarise the numerical experiments presented above, we remark that the rules worked reasonably well, even in instances contrary to the expecta- tions set by the theory. We observed that while none of the studied parameter choice rules were completely immune to mishaps, the heuristic discrepancy rule could perhaps be said to be the most robust overall with the right quasi- optimality rule often presenting itself as the better choice for medium to large noise levels. The symmetric quasi-optimality rule, on the other hand, appears to be less reliable (cf. Sections 6.1 and 6.4) and prone to numerical errors especially for non-strict convex penalties. For such cases, one should use other methods instead. The Hanke-Raus rule is not a stellar performer either. An important issue that seems to influence the actual performance is

140 that the surrogate functionals ψ sometimes do not estimate the error precise enough, for instance, when a “wrong” local minimum is selected. In the lin- ear theory this issue can be related to the lack of certain regularity conditions for x† (e.g.,(2.26) in Chapter 2) being satisfied [83], however, for the present convex case no such analysis is available yet, which, also in light of the above experiments, makes it difficult to offer a particular recommendation for a rule.

141 100

102

-1 10 100

10-2 Total Error 10-2 || xδ - x || Opt. Param. α 1 -4 ψ α δ HD Rule 10 HD( ,y ) HR Rule ψ α δ HR( ,y ) Sym. QO Rule ψ (α,yδ) Right QO Rule SQO -3 -6 ψ (α,yδ) 10 10 RQO 10-2 10-1 100 101 δ (%) 10-8 10-6 10-4 10-2 α

Figure 6.1: Error plots Err 1 (α ) for dif- ` ∗ Figure 6.2: Plot of ψ-functionals: `1 regu- ferent rules and optimal choice: `1 regu- larisation, δ = 0.02%. larisation, tomo operator.

δ Dξ(xα,x) ψ α δ HD( ,y ) ψ α δ HR( ,y ) ψ α δ SQO( ,y ) 100 0 ψ (α,yδ) 10 RQO Total Error

-1 Opt. Param. 10 HD Rule HR Rule Sym. QO Rule Right QO Rule

-2 10-3 10-2 10-1 100 101 10 δ (%) 10-4 10-2 100 102 α

Figure 6.3: Error plots Err`q (α∗) for dif- 3 3 Figure 6.4: Plot of ψ-functionals: ` 2 reg- ferent rules and optimal choice: ` 2 regu- ularisation, δ = 0.06%. larisation, mildly ill-posed.

4 10 δ Dξ(xα,x) ψ α δ HD( ,y ) ψ α δ HR( ,y ) 0 10 ψ α δ SQO( ,y ) 2 ψ (α,yδ) 10 RQO Total Error

Opt. Param. 100 HD Rule HR Rule Sym. QO Rule Right QO Rule

-3 -2 -1 0 1 10 10 10 10 10 10-2 δ (%) 10-4 10-2 100 102 α

Figure 6.5: Error plots Err`q (α∗) for differ- 3 3 Figure 6.6: Plot of ψ-functionals: ` 2 ent rules and optimal choice: ` 2 , severely severely ill-posed, δ = 1.3%. ill-posed.

142 102 Opt. Param. 5 HD Rule 10 δ Dξ(xα,x) 1 HR Rule 10 ψ α δ Sym. QO Rule HD( ,y ) Right QO Rule ψ (α,yδ) 4 HR 10 ψ α δ SQO( ,y ) 0 10 ψ α δ RQO( ,y ) 103 -1

Total Error 10

102 10-2

101 10-3 10-3 10-2 10-1 100 101 δ (%) 10-4 10-2 100 102 α

Figure 6.7: Error plots Err q (α ) for dif- ` ∗ Figure 6.8: Plot of ψ-functionals: `3 regu- ferent rules and optimal choice: `3 regu- larisation, δ = 10% larisation, mildly ill-posed

102 Opt. Param. 106 HD Rule δ Dξ(xα,x) 1 HR Rule 10 ψ α δ Sym. QO Rule HD( ,y ) Right QO Rule ψ α δ HR( ,y ) ψ α δ SQO( ,y ) 0 4 10 ψ α δ 10 RQO( ,y )

-1

Total Error 10

102 10-2

10-3 10-3 10-2 10-1 100 101 100 δ (%) 10-4 10-2 100 102 α

Figure 6.9: Error plots Err q (α ) for differ- ` ∗ Figure 6.10: Plot of ψ-functionals: `3 ent rules and optimal choice: `3, severely severely ill-posed, δ = 10% ill-posed

102 Opt. Param. HD Rule HR Rule Error ψ (α,yδ) Sym. QO Rule HD 1 1 10 10 ψ α δ Right QO Rule HR( ,y ) ψ α δ SQO( ,y ) ψ (α,yδ) 100 RQO 100

-1 Total Error 10

10-1 10-2

10-3 10-2 10-3 10-2 10-1 100 101 δ (%) 10-8 10-6 10-4 10-2 α Figure 6.11: Error plots Err (α ) for dif- TV ∗ Figure 6.12: Plot of ψ-functionals: TV ferent rules and optimal choice: TV regu- regularisation, δ = 0.06% larisation, tomo operator

143 Chapter 7

The Simple L-curve Rules for Linear and Convex Tikhonov Regularisation

In this chapter, we essentially present the results of [91] which was the paper that introduced the ψL and ψLR functionals for minimisation based heuristic regularisation as an alternative to the more commonly known L-curve rule of Hansen [60]. We therefore pit these rules against one another for both linear and convex Tikhonov regularisation, and also include the quasi-optimality rule for reference. Moreover, these rules get their own separate chapter since they are the most recently suggested rules and thus, were not compared in previous papers preceding [91]. The numerical tests are performed with the following noise-levels 0.01%, 0.1%, 1%, 5%, 10%, 20%, 50%. Here, the first two are classified as “small”, the second pair as “medium” and the last triple is classified as “large”. For each noise-level, 10 experiments were done. Moreover, as stated, the methods in question are: ψL (simple-L), ψLR (simple-L ratio), the QO-method, and the original L-curve method defined by maximising the curvature. A general observation was that whenever the L-curve showed a clear corner, then the selected parameter by both ψL and ψLR was very close to that corner, which confirms the idea of those methods being simplifications of the L-curve method (for the derivations, see Section 2.1.1). Note, however, that closeness on the L-curve does not necessarily mean that the selected parameter is close as well since the parameterisation around the corner becomes “slow”. We compare the four methods, namely, the two new simple-L rules, the QO- rule, and the original L-curve, according to their total error for the respective selected α and calculate the ratio of the relative error to the optimal error: δ † d(xα∗ , x ) Errper(α∗) = δ † , (7.1) d(xαopt , x ) as we computed in Section 5, where one would typically compute Errper with

144 Table 7.1: Tikhonov Regularisation, Diagonal Operator: Median of Ratio (7.1) of errors rules over 10 runs.

simple-L simple-L rat. QO L-curve s = 2, µ = 0.25 δ small 1.02 1.02 1.03 9.49 δ medium 1.01 1.02 1.08 1.78 δ large 1.79 1.06 1.18 1.15 δ = 50% 1.97 3.64 1.42 1.46 s = 2, µ = 0.5 δ small 1.48 1.48 1.01 50.68 δ medium 1.66 1.72 1.07 3.78 δ large 1.78 1.59 1.01 2.52 δ = 50% 3.09 5.07 1.48 1.90 s = 2, µ = 1 δ small 3.88 3.88 1.07 77.12 δ medium 2.01 2.01 1.07 7.98 δ large 1.57 1.66 1.08 2.33 δ = 50% 2.97 4.07 1.27 1.32 d(x, y) := kx − yk for the case of linear regularisation.

7.1 Linear Tikhonov Regularisation

We begin with classical Tikhonov regularisation, in which case we compute the regularised solution as (2.1).

7.1.1 Diagonal Operator At first we consider a diagonal operator A with singular values having poly- −s nomial decay: σi = i , i = 1, . . . n, for some value s and consider an exact † i −p solution also with polynomial decay hx , vii = (−1) i . The size of the di- agonal matrix A ∈ Rn×n was chosen as n = 500. Furthermore we added −0.6 random noise (coloured Gaußian noise) he, uii = δi e˜i, wheree ˜i are stan- dard normally distributed values. Table 7.1 displays the median of the values of Errper over 10 experiments with different random noise realisations and for varying smoothness indices µ. The table provides some information about the performance of the rules. Based on additional numbers not presented here, we can state some conclusions: • The simple-L and simple-L ratio outperform the other rules for small smoothness index µ = 0.25 and small data noise. Except for very large

145 Table 7.2: Tikhonov Regularisation, Rotational Blur Operator: Median of Ratio (7.1) of errors rules over 10 runs.

simple-L simple-L rat. QO L-curve δ small 1.29 1.29 74.04 9.43 δ medium 1.05 1.09 1.19 2.59 δ large 1.02 1.01 1.10 1.02 δ = 50% 1.00 65.21 1.05 1.00

δ, the simple-L ratio is slightly better than the simple-L curve. For very large δ, the simple-L method works but is inferior to QO while the simple-L ratio method fails then.

• For high smoothness index, the QO-rule outperforms the other rules and it is the method of choice then.

• The original L-curve method often fails for small δ. For larger δ it works often only acceptably. Only in situations when δ is quite large (> 20%) we found several instances when it outperforms all other rules.

A similar experiment was performed for a higher smoothing operator by setting s = 4 with similar conclusions. We note that the theory has indicated that for µ = 0.5, the simple-L curve is order optimal without any additional condition on x† while for the QO-rule this happens at µ = 1 (see Theorem 6 in Chapter 2). One would thus expect that the simple-L rule perform better for µ = 0.5. However, this was not the case (only for µ ≤ 0.25) and the reason is unclear. (We did not do experiments with an x† that does not satisfy the regularity condition (2.27), though). Still, the result that the simple-L methods perform better for small µ is backed by the numerical results.

7.1.2 Examples from IR Tools For the next scenario, we consider a rotational blurring operator from IR Tools [42], namely PRblur which outputs a sparse operator (which we chose as 1024×1024) and seek to reconstruct the satellite image solution provided in the package. The data is corrupted with white Gaußian noise which is chosen such that kek/kx†k yields the relative noise level. Note that the operator and solution are normalised such that kAk = kx†k = 1 and our parameter search is restricted to the interval α ∈ [10−10, kAk2]. Similarly as for the previous experiment, in Table 7.2, we record the median of the values of Errper over 10 different experiments with varying random noise realisations. Note that the simple-L rules outperform the other ones in the majority of cases. However, the margin of improvement compared to the QO rule is not

146 Table 7.3: Tikhonov Regularisation, Tomography Operator: Median of Ratio (7.1) of errors rules over 10 runs.

simple-L simple-L rat. QO L-curve δ medium 1.36 1.36 4.32 1.68 δ large 1.38 1.39 1.20 2.71 δ = 50% 2.37 1.99 1.63 4.00

large. We observed that for small noise, the QO rule often overestimates the optimal parameter. All rules performed quite well, and the L-curve method in particular showed noticeable improvement as the noise level increased. The simple-L ratio method failed for 50% noise, however. In [91, Theorem 2.13], a smallness condition was required for the noise in order to prove convergence rates for the simple-L rules. Thus, the numerical results indicate that this does indeed seem to be necessary. Whilst using the IRtools package, we encountered an occurring problem, especially for the tomography operator (to be discussed), whereby the error and functional plots did not display the typical shape one would expect. In particular, the stability error did not “blow up” as α → 0. Similar problems have been encountered for the L-curve method which fails to show a clear corner point. The theoretical explanation for this appears to be that the tomography operator in IRtools is not sufficiently ill-conditioned, i.e., the singular values do not decay sufficiently fast, which is a point regarding that package which we would like to emphasise! Note that the Muckenhoupt- type condition may not be satisfied in this case which is a cause for possible failure of the heuristic rules. Interesting to note is that this phenomena is not occurring for the Hansen’s Regularisation Tools problems [61] from Chapters 5 and 6. We consider now the tomography operator from the IR Tools package, i.e., PRtomo with the true solution being a Shepp Logan phantom. The operator in question is of size 8100 × 1024. In Table 7.3, one can find a record of the median values of Errper for 10 different realisations of each noise level, where we omit the “small” noise case in this scenario as the problem is not sufficiently ill-conditioned so that for small noise levels, α∗ = αmin yields the best choice. We also subsequently switch the 1% noise level consideration for 2%. (We revert back to 1% noise level in the proceeding scenarios). In this scenario, the L-curve method actually performed worse for larger noise and most often chose α∗ = αmin. All rules, however, were guilty of quite often underestimating the optimal parameter, which is a result of the lack of a sufficiently high ill-posedness mentioned in the paragraph above.

147 7.2 Convex Tikhonov Regularisation

We now investigate the heuristic rules for convex Tikhonov regularisation, δ i.e., we consider xα as the minimiser of the functional (3.1) with a non- quadratic penalty R. Henceforth, the simple-L methods will consist of min- imising the functionals as defined in (3.17), i.e.,

II δ δ II δ δ R(xα,δ) − R(xα) ψL(α, y ) = R(xα,δ) − R(xα), ψLR(α, y ) := δ , R(xα) which define the simple-L and simple-L ratio functionals, respectively. Note that one can also consider a discrete-type functional:

δ δ n ψL(α) = R(xαn+1 ) − R(xαn ), α = α0q q < 1, as an alternative “convexification” of the simple L-curve method, but the formerly stated method yielded more fruitful results in our experiments and we therefore opted to stick with that.

7.2.1 `1 Regularisation

To begin with, we consider R = k · k1 and the rotational blur operator as before (of the same size as our earlier configuration too), but this time we would like to reconstruct a sparse solution x† and therefore we opt for the sppattern solution from the IRtools package which is a sparse image of geometric shapes. We choose Gaußian white noise as before, corresponding to the respective noise levels. Note that we compute a minimiser via FISTA [12]. 1 In this case, we measure the error with the ` norm, i.e., we compute Errper with d(x, y) := kx − yk`1 . In our experiments, we observed that the values of the aforementioned simple- L functionals were particularly small, therefore on occasion yielding negative values due to numerical errors. This problem was easily rectified however by taking the absolute value of (3.15), which is theoretically equivalent to the original functional in any case. For the quasi-optimality functional, we opted to use the so-called right quasi-optimality rule, which we have defined in this thesis as the standard quasi-optimality rule (3.15) (cf. [90]). For selecting the parameter according to the L-curve method of Hansen, maximising the curvature via (2.16) is no longer an implementable strategy as R is now non- smooth. However, it was still possible to compute the corner point due to the discretisation of the problem. In Table 7.4, one may find a recording of the results. As mentioned already, the simple-L functionals produced very small values and therefore were somewhat oscillatory, i.e., they were prone to exhibiting multiple local minima. Our algorithm selected the smallest interior min- imum, but in some plots, we observed that there were larger local minima

148 Table 7.4: `1 Regularisation, Rotational Blur Operator: Median of Ratio (7.1) of errors rules over 10 runs.

simple-L simple-L rat. QO L-curve δ small 2.07 2.07 1.18 2.19 δ medium 4.82 4.82 1.04 4.68 δ large 1.00 1.00 1.00 6.73 δ = 50% 1.08 1.08 1.08 10.37

which would have corresponded to a more accurate estimation of the optimal parameter. It should be noted that for medium noise, the L-curve was quite “hit & miss” and for larger noise, quite unsatisfactory.

3 7.3 `2 Regularisation

Continuing with the theme of convex Tikhonov regularisation and more 3 specifically ` 2 regularisation, we now consider (3.1) with R = k · k 3 . The 2 3 2 considered forward operator A : ` 2 (N) → ` (N) is a diagonal operator with −s polynomially decaying singular values as considered previously i.e., σi = i † i −p and we also consider a solution with polynomial decay x , vi = (−1) i −0.6 and add random noise he, uii = δi e˜i. The size of the operator in ques- tion is 625 × 625. Note that in this scenario, we are easily able to compute the Tikhonov solution and second Bregman iterate as we have a closed form solution of the associated proximal mapping operator; see Chapter 3 and [90]. A table of results is compiled in Table 7.5 and the following observations are noted:

• Barring the quasi-optimality rule, all methods were generally subpar in case of small noise for all tested smoothness indices. In general, the quasi-optimality rule would appear to be the best performing overall at least, although trumped on a few occasions.

• The “sweet spot” for both simple-L methods appears to be medium to large noise. Overall, at least, they appear to perform marginally better for smaller smoothness indices. The original L-curve method performs quite well for larger noise, as has been observed in other experiments, but the margin for error is quite large for smaller noise levels.

149 3 Table 7.5: ` 2 Regularisation, Diagonal Operator: Median of Ratio (7.1) of errors rules over 10 runs.

simple-L simple-L rat. QO L-curve s = 2, µ = 0.25 δ small 6.59 6.59 1.02 471.03 δ medium 1.95 1.95 1.18 31.22 δ large 1.10 1.10 1.07 1.11 δ = 50% 1.13 1.21 1.11 1.22 s = 2, µ = 0.5 δ small 14.41 14.41 1.00 8.91 δ medium 2.05 2.05 1.01 115.00 δ large 1.09 1.09 1.03 1.15 δ = 50% 4.72 5.61 1.01 1.80 s = 2, µ = 1 δ small 20.51 20.51 1.46 4.29 δ medium 1.36 1.36 1.33 107.77 δ large 1.14 1.14 1.40 1.34 δ = 50% 7.06 9.28 1.08 1.53

7.4 TV Regularisation

δ We now suppose that xα is the minimiser of (3.1) with R = TV the total variation seminorm. The functional is minimised using FISTA with the prox- imal mapping operator for the total variation seminorm being computed by a fast Newton-type method as in [90]. In this case, we compute the error with respect to α via the so-called strict metric

δ † δ † δ † dstrict(xα, x ) := |R(xα) − R(x )| + kxα − x k`1 , which was suggested in, e.g., [86], and we subsequently record the values of Errper with d = dstrict, the results of which are provided in Table 7.6. The operator in question is the tomography one arising from PRtomo with the same configuration as before. We add white Gaußian noise, corresponding to the respective noise levels. We note the following observations: all the functionals were oscillatory, ex- hibiting local minima which were much more pronounced compared to the linear case. This oscillatory behaviour is often a cause for selection of a false parameter, cf. the subpar results in Table 7.6. An inspection of this table reveals that the simple-L ratio method appears to be the best performing overall which we also observed in other experiments involving TV regulari- sation not recorded here.

150 Table 7.6: TV Regularisation, Tomography Operator: Median of Ratio (7.1) of errors rules over 10 runs.

simple-L simple-L rat. QO L-curve δ small 6.86 6.86 504.24 5.65 δ medium 59.55 6.67 61.85 8.43 δ large 8.17 6.98 8.20 9.73 δ = 50% 2.35 2.47 2.35 9.99

7.5 Summary

To summarise the numerical results presented above, the simple-L methods are near optimal for linear Tikhonov regularisation in case of low smoothness of the exact solution. Moreover, the simple-L rule in particular edges the simple-L ratio rule, but the margin of difference is small and only apparent for larger noise levels. We also considered convex Tikhonov regularisation for which the simple-L functionals had to be adapted from their original forms. In any case, they were successfully implemented and demonstrated above satisfactory results. Interesting to note however, was that in this setting, the simple-L ratio method appeared to present itself as the slightly superior of the two variants, especially for TV regularisation where it also outperformed the RQO rule, in general.

151 Chapter 8

Heuristic Rules for Nonlinear Landweber Iteration

Note that this section is taken entirely from the as of yet unpublished preprint [77], which the author of this thesis is a coauthor of, but for which the numerics are largely credited to Simon Hubmer and Ekaterina Sherina. We introduce a couple of test problems in which we evaluate the performance of the heuristic stopping rules described in (4.21). We opt to present “classical” examples from the aforementioned paper, although, there, one will find a greater expanse of problems including, e.g., integral equations, tomography, and parameter estimation. For the problems we treat here, we provide a short review and describe their precise mathematical setting in Section 8.1 below. For each of the problems in question, we start from a known solution x†, which defines exact data y. Random noise corresponding to different noise levels δ is subsequently added to y to simulate noisy data yδ. The step- size ω for Landweber iteration (4.16) is computed via (4.20) based on a numerical estimate of kF 0(x†)k. Afterwards, we run Landweber iteration for a predetermined number of iterations kmax, which is chosen manually for each problem via a visual inspection of the error, residual, and heuristic functionals, such that all important points of the experiment are captured for this comparison. Following each application of Landweber iteration, we compute the values of the heuristic functionals, as well as their corresponding minimisers k∗. For each of the different heuristic rules, we then compute the resulting absolute error δ † Err(k∗) = kxk∗ − x k , and as in the previous chapters, for comparison, we also compute the optimal stopping index δ † kopt := argmin kxk − x k , k∈N

152 together with the corresponding optimal absolute error. Furthermore, we also compute the stopping index kDP determined by the discrepancy principle (1.29), which can also be interpreted as the “first” minimizer of the functional

δ δ φDP(k) := |kF (xk) − y k − τδ| .

As noted in Section 4.2, since the exact value of η in (4.18) is unknown for our test problems, a suitable value for τ has to be chosen manually. Depending on the problem, we use either one of the popular choices τ = 1.1 or τ = 2, although as we are going to see below, these are not necessarily the “optimal” ones. In any case, the corresponding results are useful reference points to the performance of the different heuristic rules. Let us point out at least two peculiarities that we encounter when applying heuristic rules to nonlinear Landweber regularisation.

• The convergence theory for nonlinear inverse problems is a local one, i.e., one can only prove convergence when the initial guess is sufficiently close to the exact solution x†, and in the case where noise is present, the iteration usually diverges out of the neighbourhood of the solution † δ x as k → ∞. In particular, it may also happen that xk “falls” out of the domain of the forward operator. Consequently, the functionals ψ(k, yδ) in (4.21) may not be defined for very large k. By definition, however, one would have to compute a minimiser over all k, which is then not practically possible. Note that this is in contrast to Tikhonov regularisation (see Chapter 2), where solution is always well-defined for any α. In practice, a remedy is the introduction of an upper limit for the number of iterations up to which the functional ψ(k, yδ) is computed.

• The second issue concerns the fact that the approximation error (i.e., (1.36)) is quite often poorly estimated by ψ, only for the first few itera- tions. For practical considerations, the consequence is that the ψ(k, yδ) functionals typically exhibit a local minimum for small k. This hap- pens quite often, even in the linear case. However, this local minimum is rarely the global one, which usually appears much later for larger k, although an inexperienced practitioner may be tempted to take this local minimum for the global one in order to save having to compute further iterations. Note that the underlying reason for this local min- imum may be explained (at least in the linear case) from the analysis found in [83]. That is, in order to estimate the approximation error re- liably, (i.e., for (1.36)) some regularity conditions for x† have to be sat- isfied (see (4.13)). These conditions are more restrictive for Landweber iteration than for Tikhonov regularisation and usually hold for small iteration numbers k, and typically with “bad” constants.

Note that due to the above considerations, we consequently discard the first

153 few iterations (with regard to selecting a global minimiser) in our experi- ments.

8.1 Test problems

8.1.1 Nonlinear Hammerstein Operator A typical nonlinear inverse problem [58, 74, 75, 121–123] which is used for testing, in particular, the behaviour of iterative regularisation methods, is based on so-called nonlinear Hammerstein operators of the form Z 1 F : H1[0, 1] → L2[0, 1] ,F (x)(s) := k(s, t)φ(x(t)) dt . 0 Here, we look at a special instance of this operator, namely Z s F (x)(s) := x(t)3 dt , 0 for which it is known (see for example [123]) that the tangential cone condi- tion (4.18) holds locally around a solution x†, given that it is bounded away from zero. Furthermore, the Fr´echet derivative and its adjoint, which are needed for the implementation of Landweber iteration, are easy to compute explicitly. In our experiments here, we discretise with n = 128, and the exact solution † is x (s) = 2 + (s − 0.5)/10. As initial guess, we take x0 = 1 and for the discrepancy principle, opt for τ = 2. We conduct experiments for noise levels δ ∈ (0.1, 2) with step-size 0.1. In Figure 8.1, we observe that in general, none of the rules perform particu- larly well, including the a-posteriori rule: the discrepancy principle. Amongst the heuristic rules, at least, the heuristic discrepancy and Hanke-Raus rules appear to perform the best for lower noise levels, but for noise levels over 1.2% perform the worst, where the Hanke-Raus rule exhibits this phenomenon to the greatest degree. The quasi-optimality and simple-L rules maintain a consistent performance across the entire tested range of δ. in Figure 8.2, we see example plots corresponding to the 1% noise level case. In this particular instance, we observe that the quasi-optimality and simple-L rules overestimate the optimal stopping index, whereas the heuristic discrepancy and Hanke-Raus rules fare better.

8.1.2 Auto-Convolution As an additional test example, we consider the problem of (de-)auto-convolution [22, 38, 47, 131]. Among the many inverse problems based on integral oper- ators, auto-convolution is particularly interesting due to its importance in

154 Figure 8.1: Nonlinear Hammerstein equation: error plots for different pa- rameter choice rules and the optimal choice.

Figure 8.2: Plot of ψ-functionals and k∗: nonlinear Hammerstein equation, δ = 1%. laser optics [1, 14, 43]. Mathematically, it amounts to solving an operator equation of the form (4.14) with the operator

Z 1 F : L2[0, 1] → L2[0, 1] ,F (x)(s) := (x ∗ x)(s) := x(s − t)x(t) dt , 0 where the functions on L2[0, 1] are interpreted as 1-periodic functions on R. While deriving the Fr´echet differentiability and its adjoint of F is straight- forward, it is not known whether the tangential cone condition (4.18) holds.

155 However, for small enough noise levels δ, the residual functional is locally convex around the exact solution x†, given that it only has finitely many non-zero Fourier coefficients. √ † In this experiment, we take x (√s) = 10 + 2 sin(2πs) as the exact solution, 1 initial guess as x0(s) = 10 + 4 2 sin(2πs) and τ = 1.1 for the discrepancy principle. We conduct experiments for noise levels δ ∈ (0.01, 0.1) with step- size 0.005.

Figure 8.3: Autoconvolution: error plots for different parameter choice rules and the optimal choice.

Figure 8.4: Plot of ψ-functionals: autoconvolution, δ = 0.5%.

In Figure 8.3, we observe that barring the Hanke-Raus rule, the other heuris-

156 tic rules outperformed the discrepancy principle, with the heuristic discrep- ancy rule even matching, or almost matching the optimal stopping rule in all tested instances. From this plot, it is evident that the Hanke-Raus rule is by far the worst performing. in Figure 8.4, we provide example plots of the functionals and their associ- ated stopping indices. In this example, we see that the simple-L rule over- estimates the optimal stopping index, but since the error does not increase significantly as the iterations progress, it is the Hanke-Raus rule which ex- hibits the largest error, as it underestimates the optimal stopping index and the error is significantly larger for smaller k.

8.1.3 Summary The results of this chapter merely present a first step into the realm of apply- ing heuristic rules to nonlinear Landweber iteration. However, it is at least possible to draw some speculative conclusions based on the two experiments presented in the preceding sections. Certainly, in Section 8.1.2 on auto- convolution, the heuristic stopping rules presented themselves as not only viable, but also very effective methods for terminating Landweber iteration. The tests of Section 8.1.1 on the nonlinear Hammerstein problem were less conclusive as neither the heuristic rules nor the discrepancy rule performed consistently. Finally, we remind the interested reader that further experi- ments will be conducted in the upcoming paper [77]. The additional tests in that paper, in addition to subsequent future research, will hopefully help achieve a better understanding of heuristic nonlinear Landweber iteration.

157 Part III

Future Scope

158 Chapter 9

Future Scope

Several topics in current contention remain beyond the scope of this thesis and therefore we did not discuss them in the preceding chapters. However, for the interested reader, we provide a brief summary of them so far as the author’s knowledge permits:

9.1 Convex Heuristic Regularisation

In Section 3.3, we presented the recent work of [90], which proved that under Muckenhoupt-type noise restrictions, one can prove convergence of certain heuristic rules in the Bregman distance in case the operator is diagonal. Clearly, the restriction of a diagonal operator is very restrictive and does not correspond to the majority of practical situations. Therefore, it would be interesting to see whether restrictions such as (3.36) can be lifted to the case where A is a more general linear operator. One may conjecture that this could perhaps be achieved through projections on the noise vector. Furthermore, there is also little theory as of yet for heuristic rules in case A : X → Y is even a nonlinear operator and/or Y is a Banach space (as well as X). The development of a noise restricted analysis in either of those cases would be quite a feat and certainly a project for the future, especially given how the majority of natural phenomena are usually modelled via nonlinear operators.

9.2 Heuristic Blind Kernel Deconvolution

A popular instance for inverse problems comes in form of the deconvolution problem [3, 127]. In particular, we consider the ill-posed problem (1.1) with A replaced by the convolution operator A = k∗; that is:

k ∗ f = g, (9.1)

159 where k, f, g ∈ H1 and only noisy data gδ = g + e are available and δ is, as usual, the noise level. Moreover, we assume, also as usual, that the solution does not depend continuously on the data. This problem is a typical example of when the fields of inverse problems and image processing overlap. It is typical for k to represent a point-spread function that acts on the image f to produce the blurred (and noisy) image gδ.

9.2.1 Deconvolution In the standard deconvolution case, one generally assumes knowledge of the kernel k and may therefore determine a regularised solution of the one un- known (i.e., f †), via

δ δ δ 1 δ 2 fα ∈ argmin Tα (f), where Tα (f) := kk ∗ f − g k + αT V (f), f∈H1 2 (9.2) is a special instance of the Tikhonov functional with regularisation term as the functional TV : X → R ∪ {∞} and α ∈ (0, αmax). Note that when k = id, the identity function, then (9.2) is used for denoising and is commonly referred to in the literature as Rudin-Osher-Fatemi (ROF) regularisation [138]. For working purposes, one may consider R k∇fk with a difference operator ∇. This is the so-called isotropic version and the version with the `1 norm is known as the anisotropic case. In this case, we may solve the minimisation problem (9.2) with the alternating direction multiplier method (ADMM) [16,27,41,46]. Note that the augmented Lagrangian [13,66,129] for the discretised version of (9.2) is given by

n 1 X ω1 L(f˜, fˆ, f, λ, ξ) = kk ∗ f − gδk2 + α kfˆk + kf˜− fk2 − λT (f˜− f) 2 i 2 i=1 n X ω2  + kfˆ − D fk2 − ξT (fˆ − D f) , 2 i i i i i i=1 where λ ∈ Rn and ξ ∈ Rn×2 [19]. Certainly, it would be interesting to see how the heuristic rules perform. In some preliminary testing, which involved deconvolving a noisy image which had been blurred with a Gaußian point spread function, the ψLR rule seemed most promising.

9.2.2 Semi-Blind Deconvolution In this case, we have two unknowns: namely, we do not know k in addition to the unknown f †. However, the seperating factor from (completely) blind

160 deconvolution is that in semi-blind deconvolution, we assume knowledge of an approximation of k, namely k satisfying kk − kk ≤  where  is the approximation error. We consider a different functional to (9.2) which was suggested in [19]; namely, 1 T δ, (k, f) = kk ∗ f − gδk2 + α kfk2 + TV (f) + β kk − kk2 + TV (k) , α,β 2 (9.3) from which we may obtain the regularised solutions for the two unknowns:

 δ δ, (kβ, fα) ∈ argmin Tα,β(k, f), (9.4) k,f∈H1

In [19], they also proposed the following computationally efficient way of computing the minimisers based on the associated augmented Lagrangian of (9.3):

n ! ˜ ˆ ˜ ˆ δ 2 2 X ˆ L(f, f, f, k, k, k, ξ, ζ, µ) = kk ∗ f − g k + α kfk + kfik i=1 n ! X ω1 + β kk − kk2 + kkˆ k + kf˜− fk2 − hλ, f˜− fi i 2 i=1 n X ω2  ω3 + kfˆ − D fk2 − hξ , fˆ − D fi + kk˜ − kk2 − hζ, k˜ − ki 2 i i i i i 2 i=1 n X ω4  + kkˆ − D kk2 − hµ , kˆ − D ki , 2 i i i i i i=1 where λ, ζ ∈ Rn and ξ, µ ∈ Rn×2, and the ADMM method. In the paper [19], the parameters α and β were chosen by “trial and error”. Therefore, an inter- esting scope of further work would be to develop and implement a heuristic parameter choice rule for selecting (α∗, β∗). Clearly, the parameter choice rules previously discussed would be void in their current form as presented in the previous chapters. There are no doubt many possibilities on how one could proceed, but one idea in contention could be to do it in an iterative manner, i.e., for a fixed initial guess f0, one would first solve

 1 δ 2  2  kβ ∈ argmin kk ∗ f0 − g k + β kk − k k + TV (k) , k∈H1 2 with a heuristic parameter choice rule

δ β∗ ∈ argmin ψ(β, g ). β∈(0,βmax)

161 Then, one could proceed to minimise

δ 1  δ 2 2  fα(β∗) ∈ argmin kkβ∗ ∗ f − g k + α(β∗) kfk + TV (f) , f∈H1 2 and finally obtain a solution via a heuristic rule

δ α∗ ∈ argmin ψ(α(β∗), g ), α∈(0,αmax)

δ which would yield fα∗ . Of course, there may be more effective alternatives and they, along with the aforementioned algorithm, should be investigated numerically.

9.3 Meta-Heuristics

Note all from the previous section may be considered an application of the greater branch of meta-heuristics, e.g., when one has a regularisation of the form: j 1 X T δ (x) := kAx − yδk2 + α R (x), (j ∈ ), (9.5) αj 2 i i N i=1 where we can observe that (9.3) is merely the above with j = 2 and R1 = 2  2 k · k + TV and R2 = k · −k k + TV . The general principle of meta- heuristic regularisation has been applied (numerically) with some success (cf. [39, 119, 120]), with a typical applications involving the application of the theory of reproducing kernel Hilbert spaces (RKHS) to machine learning with other examples also including, e.g., in finance (cf. [72]). Thus, there is much potential for further investigation of this “sub-branch” of heuristic parameter choice rules.

162 Bibliography

[1] S. W. Anzengruber, S. Burger,¨ B. Hofmann, and G. Stein- meyer, Variational regularization of complex deautoconvolution and phase retrieval in ultrashort laser pulse characterization, Inverse Prob- lems, 32 (2016), pp. 035002, 27. [2] M. A. Arino˜ and B. Muckenhoupt, Maximal functions on classi- cal Lorentz spaces and Hardy’s inequality with weights for nonincreas- ing functions, Transactions of the American Mathematical Society, 320 (1990), pp. 727–735. [3] G. Ayers and J. C. Dainty, Iterative blind deconvolution method and its applications, Optics Letters, 13 (1988), pp. 547–549. [4] A. Bakushinskiy, Remarks on choosing a regularization parameter using quasi-optimality and ratio criterion, USSR Computational Math- ematics and Mathematical Physics, 24 (1985), pp. 181–182. [5] A. Barbero and S. Sra, Fast Newton-type methods for total varia- tion regularization, in Proceedings of the 28th International Conference on Machine Learning (ICML-11), Citeseer, 2011, pp. 313–320. [6] , Modular proximal optimization for multidimensional total- variation regularization, arXiv preprint arXiv:1411.0589, (2014). [7] F. Bauer and S. Kindermann, The quasi-optimality criterion for classical inverse problems, Inverse Problems, 24 (2008), pp. 035002, 20. [8] , Recent results on the quasi-optimality principle, J. Inverse Ill- Posed Probl., 17 (2009), pp. 5–18. [9] F. Bauer and M. A. Lukas, Comparing parameter choice methods for regularization of ill-posed problems, Mathematics and Computers in Simulation, 81 (2011), pp. 1795–1841. [10] H. H. Bauschke and P. L. Combettes, Convex analysis and monotone operator theory in Hilbert spaces, CMS Books in Mathe- matics/Ouvrages de Math´ematiquesde la SMC, Springer, Cham, sec- ond ed., 2017. With a foreword by H´edyAttouch.

163 [11] A. Beck, First-order methods in optimization, vol. 25 of MOS-SIAM Series on Optimization, Society for Industrial and Applied Mathe- matics (SIAM), Philadelphia, PA; Mathematical Optimization Society, Philadelphia, PA, 2017.

[12] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci., 2 (2009), pp. 183–202.

[13] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: numerical methods, Athena Scientific, Belmont, MA, 2014. Originally published by Prentice-Hall, Inc. in 1989. Includes corrections (1997).

[14] S. Birkholz, G. Steinmeyer, S. Koke, D. Gerth, S. Burger,¨ and B. Hofmann, Phase retrieval via regularization in self- diffraction-based spectral interferometry, J. Opt. Soc. Am. B, 32 (2015), pp. 983–992.

[15] N. Bissantz, T. Hohage, A. Munk, and F. Ruymgaart, Con- vergence rates of general regularization methods for statistical inverse problems and applications, SIAM Journal on Numerical Analysis, 45 (2007), pp. 2610–2636.

[16] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learning via the alternating di- rection method of multipliers, Found. Trends Mach. Learn., 3 (2011), p. 1–122.

[17] L. M. Bregman` , A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming, Z.ˇ Vyˇcisl.Mat i Mat. Fiz., 7 (1967), pp. 620–631.

[18] C. Brezinski, G. Rodriguez, and S. Seatzu, Error estimates for linear systems with applications to regularization, Numerical Algo- rithms, 49 (2008), pp. 85–104.

[19] A. Buccini, M. Donatelli, and R. Ramlau, A semiblind regu- larization algorithm for inverse problems with application to image de- blurring, SIAM Journal on Scientific Computing, 40 (2018), pp. A452– A483.

[20] M. Burger, J. Flemming, and B. Hofmann, Convergence rates in `1-regularization if the sparsity assumption fails, Inverse Problems, 29 (2013), pp. 025013, 16.

164 [21] M. Burger and S. Osher, Convergence rates of convex variational regularization, Inverse Problems, 20 (2004), pp. 1411–1421.

[22] S. Burger¨ and B. Hofmann, About a deficit in low-order conver- gence rates on the example of autoconvolution, Applicable Analysis, 94 (2015), pp. 477–493.

[23] L. Cavalier, Inverse problems in statistics, in Inverse problems and high-dimensional estimation, vol. 203 of Lect. Notes Stat. Proc., Springer, Heidelberg, 2011, pp. 3–96.

[24] G. Chen and M. Teboulle, Convergence analysis of a proximal- like minimization algorithm using Bregman functions, SIAM Journal on Optimization, 3 (1993), pp. 538–543.

[25] D. L. Colton, Analytic theory of partial differential equations, vol. 8 of Monographs and Studies in Mathematics, Pitman (Advanced Pub- lishing Program), Boston, Mass.-London, 1980.

[26] R. M. Corless, G. H. Gonnet, D. E. G. Hare, D. J. Jeffrey, and D. E. Knuth, On the Lambert W function, Adv. Comput. Math., 5 (1996), pp. 329–359.

[27] J. Douglas, Jr. and H. H. Rachford, Jr., On the numerical solution of heat conduction problems in two and three space variables, Trans. Amer. Math. Soc., 82 (1956), pp. 421–439.

[28] H. Egger, Regularization of inverse problems with large noise, Journal of Physics: Conference Series, 124 (2008), p. 012022.

[29] P. N. Eggermont, V. LaRiccia, and M. Z. Nashed, Noise Mod- els for Ill-Posed Problems, Springer Berlin Heidelberg, Berlin, Heidel- berg, 2015, pp. 1633–1658.

[30] P. P. B. Eggermont, V. N. LaRiccia, and M. Z. Nashed, On weakly bounded noise in ill-posed problems, Inverse Problems, 25 (2009), pp. 115018, 14.

[31] I. Ekeland and R. Temam´ , Analyse convexe et probl`emesvariation- nels, Dunod; Gauthier-Villars, Paris-Brussels-Montreal, Que., 1974. Collection Etudes´ Math´ematiques.

[32] , Convex analysis and variational problems, vol. 28 of Classics in Applied Mathematics, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, English ed., 1999. Translated from the French.

165 [33] H. W. Engl, Integralgleichungen, Springer Lehrbuch Mathematik. [Springer Mathematics Textbook], Springer-Verlag, , 1997.

[34] H. W. Engl and H. Gfrerer, A posteriori parameter choice for general regularization methods for solving linear ill-posed problems, Appl. Numer. Math., 4 (1988), pp. 395–417.

[35] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of inverse problems, vol. 375 of Mathematics and its Applications, Kluwer Academic Publishers Group, Dordrecht, 1996.

[36] L. C. Evans, Partial differential equations, vol. 19 of Graduate Stud- ies in Mathematics, American Mathematical Society, Providence, RI, second ed., 2010.

[37] W. Fenchel, On conjugate convex functions, Canad. J. Math., 1 (1949), pp. 73–77.

[38] G. Fleischer and B. Hofmann, On inversion rates for the auto- convolution equation, Inverse Problems, 12 (1996), pp. 419–435.

[39] M. Fornasier, V. Naumova, and S. V. Pereverzyev, Parame- ter choice strategies for multipenalty regularization, SIAM Journal on Numerical Analysis, 52 (2014), pp. 1770–1794.

[40] G. Frasso and P. H. Eilers, L-and V-curves for optimal smoothing, Statistical Modelling, 15 (2015), pp. 91–111.

[41] D. Gabay and B. Mercier, A dual algorithm for the solution of nonlinear variational problems via finite element approximation, Com- puters & Mathematics with Applications, 2 (1976), pp. 17–40.

[42] S. Gazzola, P. C. Hansen, and J. G. Nagy, IR Tools: a MAT- LAB package of iterative regularization methods and large-scale test problems, Numer. Algorithms, 81 (2019), pp. 773–811.

[43] D. Gerth, B. Hofmann, S. Birkholz, S. Koke, and G. Stein- meyer, Regularization of an autoconvolution problem in ultrashort laser pulse characterization, Inverse Problems in Science and Engineer- ing, 22 (2014), pp. 245–266.

[44] H. Gfrerer, An a posteriori parameter choice for ordinary and it- erated Tikhonov regularization of ill-posed problems leading to optimal convergence rates, Math. Comp., 49 (1987), pp. 507–522, S5–S12.

[45] V. Glasko and Y. Kriskin, On the quasi-optimality principle for ill- posed problems in Hilbert space, U.S.S.R. Comput.Maths.Math.Phys., 24 (1984), pp. 1–7.

166 [46] R. Glowinski and A. Marroco, Sur l’approximation, par ´el´ements finis d’ordre un, et la r´esolution,par p´enalisation-dualit´ed’une classe de probl`emesde Dirichlet non lin´eaires, ESAIM: Mathematical Mod- elling and Numerical Analysis-Mod´elisationMath´ematiqueet Analyse Num´erique,9 (1975), pp. 41–76.

[47] R. Gorenflo and B. Hofmann, On autoconvolution and regular- ization, Inverse Problems, 10 (1994), pp. 353–373.

[48] M. Grasmair, M. Haltmeier, and O. Scherzer, Sparse regular- ization with `q penalty term, Inverse Problems, 24 (2008), pp. 055020, 13.

[49] C. W. Groetsch, Inverse problems in the mathematical sciences, Vieweg Mathematics for Scientists and Engineers, Friedr. Vieweg & Sohn, Braunschweig, 1993.

[50] J. Hadamard, Lectures on Cauchy’s problem in linear partial differ- ential equations, Dover Publications, New York, 1953.

[51] U. Hamarik,¨ U. Kangro, S. Kindermann, and K. Raik, Semi- heuristic parameter choice rules for Tikhonov regularisation with opera- tor perturbations, Journal of Inverse and Ill-posed Problems, 27 (2019), pp. 117–131.

[52] U. Hamarik,¨ U. Kangro, R. Palm, T. Raus, and U. Taut- enhahn, Monotonicity of error of regularized solution and its use for parameter choice, Inverse Probl. Sci. Eng., 22 (2014), pp. 10–30.

[53] U. Hamarik,¨ R. Palm, and T. Raus, Comparison of parameter choices in regularization algorithms in case of different information about noise level, Calcolo, 48 (2011), pp. 47–59.

[54] U. Hamarik,¨ R. Palm, and T. Raus, A family of rules for parame- ter choice in Tikhonov regularization of ill-posed problems with inexact noise level, Journal of Computational and Applied Mathematics, 236 (2012), pp. 2146–2157.

[55] M. Hanke, Limitations of the L-curve method in ill-posed problems, BIT, 36 (1996), pp. 287–301.

[56] , A note on the nonlinear Landweber iteration, Numer. Funct. Anal. Optim., 35 (2014), pp. 1500–1510.

[57] M. Hanke and C. W. Groetsch, Nonstationary iterated Tikhonov regularization, Journal of Optimization Theory and Applications, 98 (1998), pp. 37–53.

167 [58] M. Hanke, A. Neubauer, and O. Scherzer, A convergence anal- ysis of the Landweber iteration for nonlinear ill-posed problems, Nu- merische Mathematik, 72 (1995), pp. 21–37.

[59] M. Hanke and T. Raus, A general heuristic for choosing the regu- larization parameter in ill-posed problems, SIAM Journal on Scientific Computing, 17 (1996), pp. 956–972.

[60] P. C. Hansen, Analysis of discrete ill-posed problems by means of the L-curve, SIAM Review, 34 (1992), pp. 561–580.

[61] , Regularization tools: a Matlab package for analysis and solution of discrete ill-posed problems, Numer. Algorithms, 6 (1994), pp. 1–35.

[62] , Rank-deficient and discrete ill-posed problems, SIAM Monographs on Mathematical Modeling and Computation, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1998. Numerical aspects of linear inversion.

[63] , The L-curve and its use in the numerical treatment of inverse problems, in in Computational Inverse Problems in Electrocardiology, ed. P. Johnston, Advances in Computational Bioengineering, WIT Press, 2000, pp. 119–142.

[64] P. C. Hansen and D. P. O’Leary, The use of the L-curve in the regularization of discrete ill-posed problems, SIAM Journal on Scientific Computing, 14 (1993), pp. 1487–1503.

[65] B. Harrach, T. Jahn, and R. Potthast, Beyond the Bakushinkii veto: regularising linear inverse problems without knowing the noise distribution, Numer. Math., 145 (2020), pp. 581–603.

[66] M. R. Hestenes, Multiplier and gradient methods, J. Optim. Theory Appl., 4 (1969), pp. 303–320.

[67] B. Hofmann, Ill-posedness and regularization of inverse problems—a review of mathematical methods, in The Inverse Problem (Tegernsee, 1994), Res. Meas. Approv., VCH, Weinheim, 1995, pp. 45–66.

[68] B. Hofmann, B. Kaltenbacher, C. Poeschl, and O. Scherzer, A convergence rates result for Tikhonov regularization in Banach spaces with non-smooth operators, Inverse Problems, 23 (2007), p. 987.

[69] B. Hofmann and S. Kindermann, On the degree of ill-posedness for linear problems with non-compact operators, Methods Appl. Anal., 17 (2010), pp. 445–461.

168 [70] B. Hofmann and O. Scherzer, Factors influencing the ill-posedness of nonlinear problems, Inverse Problems, 10 (1994), pp. 1277–1297.

[71] , Local ill-posedness and source conditions of operator equations in Hilbert spaces, Inverse Problems, 14 (1998), pp. 1189–1206.

[72] C. Hofmann, B. Hofmann, and A. Pichler, Simultaneous identi- fication of volatility and interest rate functions—a two-parameter regu- larization approach, Electron. Trans. Numer. Anal., 51 (2019), pp. 99– 117.

[73] T. Hohage, Logarithmic convergence rates of the iteratively regular- ized Gauss-Newton method for an inverse potential and an inverse scat- tering problem, Inverse Problems, 13 (1997), p. 1279.

[74] S. Hubmer and R. Ramlau, Convergence analysis of a two-point gradient method for nonlinear ill-posed problems, Inverse Problems, 33 (2017), pp. 095004, 30.

[75] , Nesterov’s accelerated gradient method for nonlinear ill-posed problems with a locally convex residual functional, Inverse Problems, 34 (2018), pp. 095003, 30.

[76] S. Hubmer, E. Sherina, A. Neubauer, and O. Scherzer, Lam´e parameter estimation from static displacement field measurements in the framework of nonlinear inverse problems, SIAM Journal on Imaging Sciences, 11 (2018), pp. 1268–1293.

[77] S. Humber, S. Kindermann, K. Raik, and E. Sherina, A numer- ical comparison of some heuristic stopping rules for nonlinear Landwe- ber iteration, (to appear).

[78] B. Jin and D. A. Lorenz, Heuristic parameter-choice rules for con- vex variational regularization based on error estimates, SIAM J. Numer. Anal., 48 (2010), pp. 1208–1229.

[79] Q. Jin, Hanke-Raus heuristic rule for variational regularization in Ba- nach spaces, Inverse Problems, 32 (2016), pp. 085008, 19.

[80] , On a heuristic stopping rule for the regularization of inverse prob- lems by the augmented Lagrangian method, Numer. Math., 136 (2017), pp. 973–992.

[81] B. Kaltenbacher, A. Neubauer, and O. Scherzer, Iterative regularization methods for nonlinear ill-posed problems, vol. 6 of Radon Series on Computational and Applied Mathematics, Walter de Gruyter GmbH & Co. KG, Berlin, 2008.

169 [82] B. Kaltenbacher, T. T. N. Nguyen, and O. Scherzer, The tangential cone condition for some coefficient identification model prob- lems in parabolic PDEs, arXiv preprint arXiv:1908.01239, (2019).

[83] S. Kindermann, Convergence analysis of minimization-based noise level-free parameter choice rules for linear ill-posed problems, Electron. Trans. Numer. Anal., 38 (2011), pp. 233–257.

[84] , Discretization independent convergence rates for noise level-free parameter choice rules for the regularization of ill-conditioned problems, Electron. Trans. Numer. Anal, 40 (2013), pp. 58–81.

[85] , Convergence of the gradient method for ill-posed problems, Inverse Probl. Imaging, 11 (2017), pp. 703–720.

[86] S. Kindermann, L. D. Mutimbu, and E. Resmerita, A numerical study of heuristic parameter choice rules for total variation regulariza- tion, Journal of Inverse and Ill-Posed Problems, 22 (2014), pp. 63–94.

[87] S. Kindermann and A. Neubauer, On the convergence of the qua- sioptimality criterion for (iterated) Tikhonov regularization, Inverse Problems and Imaging, 2 (2008), pp. 291–299.

[88] S. Kindermann, S. Pereverzyev, Jr., and A. Pilipenko, The quasi-optimality criterion in the linear functional strategy, Inverse Problems, 34 (2018), pp. 075001, 24.

[89] S. Kindermann and K. Raik, Heuristic parameter choice rules for Tikhonov regularization with weakly bounded noise, Numerical Func- tional Analysis and Optimization, 40 (2019), pp. 1373–1394.

[90] , Convergence of heuristic parameter choice rules for convex Tikhonov regularization, SIAM Journal on Numerical Analysis, 58 (2020), pp. 1773–1800.

[91] , A simplified L-curve method as error estimator, Electron. Trans. Numer. Anal., 53 (2020), pp. 217–238.

[92] M. A. Krasnosel’skii, P. Zabreyko, E. I. Pustylnik, and P. Sobolevski, Integral operators in spaces of summable functions, Journal of Engineering Mathematics, 10 (1976), pp. 190–190.

[93] D. Krawczyk-StanDo´ and M. Rudnicki, Regularization param- eter selection in discrete ill-posed problems—the use of the U-curve, International Journal of Applied Mathematics and Computer Science, 17 (2007), pp. 157–164.

170 [94] D. Krawczyk-Stando´ and M. Rudnicki, The use of L-curve and U-curve in inverse electromagnetic modelling, in Intelligent Computer Techniques in Applied Electromagnetics, Springer, 2008, pp. 73–82.

[95] E. Kreyszig, Introductory functional analysis with applications, Wi- ley Classics Library, John Wiley & Sons, Inc., New York, 1989.

[96] Y. A. Kriksin, The choice of regularization parameter for the solution of a linear operator equation, Zh. Vychisl. Mat. i Mat. Fiz., 25 (1985), pp. 1092–1097, 1119–1120.

[97] L. Landweber, An iteration formula for Fredholm integral equations of the first kind, Amer. J. Math., 73 (1951), pp. 615–624.

[98] C. L. Lawson and R. J. Hanson, Solving least squares problems, Prentice-Hall, Inc., Englewood Cliffs, N.J., 1974. Prentice-Hall Series in Automatic Computation.

[99] A. S. Leonov, On the choice of a regularization parameter according to the criteria of quasioptimality and ratio for ill-posed problems of linear algebra with a perturbed ope, Dokl. Akad. Nauk SSSR, 262 (1982), pp. 1069–1072.

[100] O. V. Lepski˘ı, A problem of adaptive estimation in Gaussian white noise, Teoriya Veroyatnostei i ee Primeneniya, 35 (1990), pp. 459–470.

[101] D. A. Lorenz, P. Maass, and P. Q. Muoi, Gradient descent for Tikhonov functionals with sparsity constraints: theory and numerical comparison of step size rules, Electron. Trans. Numer. Anal., 39 (2012), pp. 437–463.

[102] A. K. Louis, Inverse und schlecht gestellte Probleme, Teubner Studi- enb¨ucher Mathematik. [Teubner Mathematical Textbooks], B. G. Teub- ner, Stuttgart, 1989.

[103] S. Lu, S. V. Pereverzev, Y. Shao, and U. Tautenhahn, On the generalized discrepancy principle for Tikhonov regularization in Hilbert scales, J. Integral Equations Appl., 22 (2010), pp. 483–517.

[104] S. Lu, S. V. Pereverzev, and U. Tautenhahn, Regularized total least squares: Computational aspects and error bounds, SIAM J. Matrix Anal. Appl., 31 (2009), pp. 918–941.

[105] M. A. Lukas, Asymptotic optimality of generalized cross-validation for choosing the regularization parameter, Numerische Mathematik, 66 (1993), pp. 41–66.

171 [106] , Robust generalized cross-validation for choosing the regularization parameter, Inverse Problems, 22 (2006), pp. 1883–1902.

[107] , Strong robust generalized cross-validation for choosing the regu- larization parameter, Inverse Problems, 24 (2008), pp. 034006, 16.

[108] P. Mathe´, Saturation of regularization methods for linear ill-posed problems in Hilbert spaces, SIAM J. Numer. Anal., 42 (2004), pp. 968– 973.

[109] , The Lepski˘ı principle revisited, Inverse Problems, 22 (2006), pp. L11–L15.

[110] P. Mathe´ and S. V. Pereverzev, Geometry of linear ill-posed problems in variable Hilbert scales, Inverse problems, 19 (2003), p. 789.

[111] , The discretized discrepancy principle under general source condi- tions, Journal of Complexity, 22 (2006), pp. 371–381.

[112] P. Mathe´ and U. Tautenhahn, Regularization under general noise assumptions, Inverse Problems, 27 (2011), pp. 035016, 15.

[113] J.-J. Moreau, Proximit´eet dualit´edans un espace hilbertien, Bull. Soc. Math. France, 93 (1965), pp. 273–299.

[114] V. A. Morozov, On the solution of functional equations by the method of regularization, Soviet Math. Dokl., 7 (1966), pp. 414–417.

[115] , Regularization methods for ill-posed problems, CRC Press, Boca Raton, FL, 1993. Translated from the 1987 Russian original.

[116] M. Z. Nashed, Generalized inverses, normal solvability, and itera- tion for singular operator equations, in Nonlinear Functional Anal. and Appl. (Proc. Advanced Sem., Math. Res. Center, Univ. of Wisconsin, Madison, Wis., 1970), Academic Press, New York, 1971, pp. 311–359.

[117] , Aspects of generalized inverses in analysis and regularization, in Generalized inverses and applications (Proc. Sem., Math. Res. Center, Univ. Wisconsin, Madison, Wis., 1973), 1976, pp. 193–244. Publ. Math. Res. Center Univ. Wisconsin, No. 32.

[118] F. Natterer, The mathematics of computerized tomography, vol. 32 of Classics in Applied Mathematics, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2001. Reprint of the 1986 original.

172 [119] V. Naumova and S. V. Pereverzyev, Multi-penalty regulariza- tion with a component-wise penalization, Inverse Problems, 29 (2013), pp. 075002, 15.

[120] V. Naumova, S. V. Pereverzyev, and S. Sivananthan, Extrap- olation in variable RKHSs with application to the blood glucose reading, Inverse Problems, 27 (2011), pp. 075010, 13.

[121] A. Neubauer, On Landweber iteration for nonlinear ill-posed prob- lems in Hilbert scales, Numer. Math., 85 (2000), pp. 309–328.

[122] , Some generalizations for Landweber iteration for nonlinear ill- posed problems in Hilbert scales, Journal of Inverse and Ill-posed Prob- lems, 24 (2016), pp. 393–406.

[123] , A New Gradient Method for Ill-Posed Problems, Numerical Func- tional Analysis and Optimization, 0 (2017), pp. 1–26.

[124] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin, An iterative regularization method for total variation-based image restora- tion, Multiscale Modeling & Simulation, 4 (2005), pp. 460–489.

[125] F. O’Sullivan, A statistical perspective on ill-posed inverse problems, Statist. Sci., 1 (1986), pp. 502–527. With comments and a rejoinder by the author.

[126] R. Palm, Numerical comparison of regularization algorithms for solv- ing ill-posed problems, PhD thesis, Tartu University Press, 2010.

[127] E. Pantin, J.-L. Starck, F. Murtagh, and K. Egiazarian, Deconvolution and blind deconvolution in astronomy, Blind Image De- convolution: Theory and Applications, (2007), pp. 277–317.

[128] D. L. Phillips, A technique for the numerical solution of certain integral equations of the first kind, Journal of the ACM (JACM), 9 (1962), pp. 84–97.

[129] M. J. D. Powell, A method for nonlinear constraints in minimiza- tion problems, in Optimization (Sympos., Univ. Keele, Keele, 1968), Academic Press, London, 1969, pp. 283–298.

[130] J. Radon, Uber¨ die Bestimmung von Funktionen durch ihre Integral- werte l¨angsgewissez Mannigfaltigheiten, Ber, Verh. Sachs. Akad. Wiss. Leipzig, Math Phys Klass, 69 (1917).

[131] R. Ramlau, TIGRA—an iterative algorithm for regularizing nonlinear ill-posed problems, Inverse Problems, 19 (2003), pp. 433–465.

173 [132] R. Ramlau and E. Resmerita, Convergence rates for regulariza- tion with sparsity constraints, Electron. Trans. Numer. Anal, 37 (2010), pp. 87–104.

[133] T. Raus, An a posteriori choice of the regularization parameter in case of approximately given error bound of data, Tartu Riikl. Ul.¨ Toimetised, (1990), pp. 73–87.

[134] T. Raus and U. Hamarik¨ , On the quasi-optimal rules for the choice of the regularization parameter in case of a noisy operator, Advances in Computational Mathematics, 36 (2012), pp. 221–233.

[135] , Q-curve and area rules for choosing heuristic parameter in Tikhonov regularization, arXiv preprint arXiv:1809.02061, (2018).

[136] T. Reginska´ , A regularization parameter in discrete ill-posed prob- lems, SIAM Journal on Scientific Computing, 17 (1996), pp. 740–749.

[137] R. T. Rockafellar, Level sets and continuity of conjugate convex functions, Trans. Amer. Math. Soc., 123 (1966), pp. 46–63.

[138] L. I. Rudin, S. Osher, and E. Fatemi, Nonlinear total variation based noise removal algorithms, vol. 60, 1992, pp. 259–268. Experi- mental mathematics: computational issues in nonlinear science (Los Alamos, NM, 1991).

[139] W. Rudin, Functional analysis, International Series in Pure and Ap- plied Mathematics, McGraw-Hill, Inc., New York, second ed., 1991.

[140] O. Scherzer, M. Grasmair, H. Grossauer, M. Haltmeier, and F. Lenzen, Variational Methods in Imaging, Applied Mathe- matical Sciences, Springer New York, 2008.

[141] E. Schock, Arbitrarily slow convergence, uniform convergence and superconvergence of Galerkin-like methods, IMA J. Numer. Anal., 5 (1985), pp. 153–160.

[142] T. Schuster, B. Kaltenbacher, B. Hofmann, and K. S. Kaz- imierski, Regularization methods in Banach spaces, vol. 10 of Radon Series on Computational and Applied Mathematics, Walter de Gruyter GmbH & Co. KG, Berlin, 2012.

[143] U. Tautenhahn, Regularization of linear ill-posed problems with noisy right hand side and noisy operator, J. Inverse Ill-Posed Probl., 16 (2008), pp. 507–523.

174 [144] U. Tautenhahn and U. Hamarik¨ , The use of monotonicity for choosing the regularization parameter in ill-posed problems, Inverse Problems, 15 (1999), p. 1487.

[145] A. N. Tikhonov, On the regularization of ill-posed problems, Dokl. Akad. Nauk SSSR, 153 (1963), pp. 49–52.

[146] , On the solution of ill-posed problems and the method of regular- ization, Dokl. Akad. Nauk SSSR, 151 (1963), pp. 501–504.

[147] A. N. Tikhonov and V. Y. Arsenin, Solutions of ill-posed prob- lems, V. H. Winston & Sons, Washington, D.C.: John Wiley & Sons, New York-Toronto, Ont.-London, 1977. Translated from the Russian, Preface by translation editor Fritz John, Scripta Series in Mathematics.

[148] A. N. Tikhonov and V. B. Glasko, The approximate solution of Fredholm integral equations of the first kind, USSR Computational Mathematics and Mathematical Physics, 4 (1964), pp. 236–247.

[149] , Use of the regularization method in non-linear problems, USSR Computational Mathematics and Mathematical Physics, 5 (1965), pp. 93–107.

[150] G. M. Va˘ı nikko and A. Y. Veretennikov, Iteratsionnye protse- dury v nekorrektnykh zadachakh, “Nauka”, Moscow, 1986.

[151] C. R. Vogel, Non-convergence of the L-curve regularization parame- ter selection method, Inverse problems, 12 (1996), p. 535.

[152] G. Wahba, Spline models for observational data, vol. 59 of CBMS- NSF Regional Conference Series in Applied Mathematics, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1990.

[153] W. Yin, Analysis and generalizations of the linearized Bregman method, SIAM Journal on Imaging Sciences, 3 (2010), pp. 856–877.

[154] Z. Zhang and Q. Jin, Heuristic rule for non-stationary iterated Tikhonov regularization in Banach spaces, Inverse Problems, 34 (2018), pp. 115002, 26.

175 Appendices

176 Appendix A

Functional Calculus

For the analysis of linear operators acting on Hilbert spaces, the Borel func- tional calculus for self-adjoint operators proves to be particularly useful and also the functional calculus for compact operators, more commonly known as the singular value decomposition. We largely draw on material from [35]. Other useful references which detail the proceeding construction are [95,139]. Before we go into the Borel functional calculus, we first treat the less general, but simpler functional calculus for compact operators. Let X and Y be Hilbert spaces. In particular, without going into the derivation, we note that we can define an eigensystem for a compact linear operator K : X → Y as {σn; vn, un}, where σn are the singular values of the self-adjoint compact ∗ operator K K and vn are the corresponding set of orthonormal eigenvectors of K∗K so that for all x ∈ dom K, we can write

∞ X Kx = σnhx, vniun, (A.1) n=1

∗ where {un} are the orthonormal eigenvectors of KK and may be defined by

Kvn un := . kKvnk

2 ∗ Moreover, {σn; vn} defines a singular system for K K, so that ∞ ∗ X 2 K Kx = σnhx, vnivn, (A.2) n=1 for all x ∈ dom(K∗K). For self adjoint operators (which are not necessarily compact), we have the aforementioned Borel functional calculus. Before we go into more detail, note that (A.2) may be written as Z ∞ ∗ K Kx = λ dEλx, (A.3) 0

177 where λ are the eigenvalues and {Eλ} denotes the so-called spectral family ∗ of K K, respectively. Note that Eλ : X → Xλ is an orthogonal projector, where

2 ∗ Xλ := {vn | σn < λ and n ∈ N} (+ ker(K K) if λ > 0) .

Now, for λ ≤ 0, one has that Eλ = 0. Moreover, as the eigenvectors {vn} span range(K∗K), it follows that

∗ ∗ Xλ = range(K K) + ker(K K) = X,

2 2 for λ > σ1; thus Eλ = I whenever λ > σ1. Much in line with [35], we also remark that for all λ1 ≤ λ2 and x ∈ X, one has that hEλ1 x, xi ≤ hEλ2 x, xi, i.e., Eλ is “monotonically increasing” with respect to λ. Furthermore, Eλ 2 is piecewise constant with “jumps” at λ = σn (and λ = 0 if and only if ker K = ker(K∗K) 6= {0}) of “height”

∞ X h·, vnivn, n=1 2 σn=λ that is, the orthogonal projectors onto the span of all eigenvectors corre- 2 sponding to the eigenvalue σn (of multiplicity possibly greater than 1) is added. We may consequently define

Z kKk+ X 2 f(λ) dEλx := f(σn)hx, vnivn, 0 n∈N Z kKk+ X 2 f(λ) dhEλx1, x2i := f(σn)hx1, vnihx2, uni, 0 n∈N Z kKk+ 2 X 2 2 f(λ) dkEλxk := f(σn)|hx, vni| , 0 n∈N

2 for a (piecewise) continuous function f and x1, x2 ∈ X, where kKk +  = 2 σ1 + for any  > 0, although in the integration bound, we opt write “kKk+”. Thence, Z kKk+ X 2 ∗ f(λ) dEλx = f(σn)hx, vnivn = f(K K). 0 n∈N In particular, in case f = id, the identity function, we may write

Z kKk+ ∗ K K = λ dEλx. 0 A particularly useful result, which we cite from [35], is the following obser- vation

178 Proposition 32. For all continuous functions f, one has that

f(K∗K)K∗ = K∗f(KK∗), (A.4) where f(KK∗) is defined analogously to f(K∗K) with the spectral family which we denote by {Fλ}, defined by

∞ X Fλy := hy, uniun (+I − Q), n=1 2 σn<λ with I − Q : Y → ker(KK∗) = range(KK∗)⊥ an orthogonal projection. Moreover, for the norm and inner product operations, we have Z ∗ hf(K K)x1, x2i = f(λ) dhEλx1, x2i, (A.5) Z ∗ 2 2 2 kf(K K)k = f (λ) dkEλxk , (A.6) for all x, x1, x2 ∈ X. Furthermore, we have the following useful bound:

kf(K∗K)k ≤ sup |f(λ)|. (A.7) λ∈(0,kKk2)

Note that (A.4) and (A.7) are also valid for any linear operator A : X → Y (as will subsequently become apparent). For further details on the well definedness of the aforementioned integrals and also additional properties of the spectral family, we refer to [35]. It further follows from [35, Proposition 2.14, p. 46] that for any self-adjoint operator, say T : X → Y there exists a unique spectral family {Eλ} such that for all functions f ∈ M, where M denotes the space of all measurable 2 functions with respect to the measure dkEλxk for all x ∈ X, we may define the operator Z f(T )x = f(λ) dEλx, (x ∈ dom(f(T ))), (A.8) where  Z  2 2 dom(f(T )) := x ∈ X | f (λ) dkEλxk < ∞ . (A.9)

Additionally, from [35, Proposition 2.16, p. 46], the following holds for all f, g ∈ M:

1. For x1 ∈ dom f(T ) and x2 ∈ dom g(T ), Z hf(T )x1, g(T )x2i = f(λ)g(λ) dhEλx1, x2i.

179 2. If x ∈ dom f(T ), then f(T )x ∈ dom g(T ) if and only if x ∈ dom(fg)(T ). Moreover, g(T )f(T )x = (gf)(T )x.

Finally, following on from the observation of [35, p. 46], we may replace K in the expressions (A.3),(A.5) and (A.6) with A, defined as before.

180 Appendix B

Convex Analysis

For the analysis of Chapter 3 in particular, we require a different set of tools to the preceding appendix. In particular, convex analysis and subdifferential calculus are the tools of choice and perhaps necessity. We consequently pro- vide this appendix as a background for the unacquainted reader and present only the fundamental and required background plus results. Our general source of reference is [31, 32]. Note that in the Hilbert space setting, a use- ful reference is also [10]. If one is in finite dimensions, even, then one can consider [11]. Firstly, the underlying basis of convex analysis are the definitions of convex sets and functionals. Let X henceforth be a Banach space, unless stated otherwise. We review these basic concepts courtesy of [32,140]: Definition 9. A set U ⊂ X is said to be convex if λx + (1 − λ)y ∈ U. for all x, y ∈ U and λ ∈ (0, 1). On the other hand, a functional R : X → R ∪ {∞} is said to be convex if R(λx + (1 − λ)y) ≤ λR(x) + (1 − λ)R(y), (B.1) for all x, y ∈ X and λ ∈ (0, 1). The functional R is said to be strictly convex if the inequality (B.1) is strict, i.e., with “<” whenever x 6= y ∈ dom R. Note that reversing the inequality in (B.1) defines a concave functional. Typical examples of convex functionals include the norm k·k`p : X → R∪{∞} with p ≥ 1. Another useful concept in the analysis of Chapter 3 in particular is that of lower semicontinuity, which is defined as [140]: Definition 10. A functional R : X → R ∪ {∞} is said to be lower semi- continuous if lim inf R(xk) ≤ R(x), (B.2) k→∞ whenever xk → x.

181 R

x y

Figure B.1: For a convex function R : R → R ∪ {∞}, the line between connecting any two points R(x) and R(y) lies above or on the graph.

One may also define upper semi-continuity in much the same way by simply reversing the inequality in (B.2). The following concept, as will be uncovered, is closely related to convexity and, incidentally, also lower semicontinuity:

Definition 11. The epigraph of a functional R : X → R ∪ {∞} is defined as epi R := {(x, λ) ∈ X × R | R(x) ≤ λ} . In particular, we have the following proposition [32,140]:

Proposition 33. A functional R : X → R ∪ {∞} is convex if and only if its epigraph epi R is convex, and lower-semicontinuous if its epigraph is closed.

This is a particularly useful result as very often, it is simpler to prove the convexity of a functional’s epigraph than to prove the inequality (B.1) holds. Much of convex analysis stems from the generalisation of the concept of a derivative for convex functions:

Definition 12. The subdifferential of a functional R : X → R ∪ {∞} at x ∈ X is defined as the set valued mapping

∗ ∂R : X ⇒ X ∗ 0 0 0 x 7→ {ξ ∈ X | R(x ) ≥ R(x) + hξ, x − xiX∗×X for all x ∈ X}, for all x ∈ X (cf. [11,32]).

182 R

x y

Figure B.2: For a concave function R : R → R ∪ {∞}, the line connecting any two points R(x) and R(y) lies below or on the graph.

Now we provide some standard examples of subdifferentials: Example 3. We may compute ∂R(x) for various choices of R and a typical 1 q example is for R = q k · k`q with q ∈ (1, ∞), where one has   1 q q−1 ∂ k · k q (x) = {|x | sgn(x )} , q ` n n n∈N with sgn, the set-valued mapping

sgn : R ⇒ [−1, 1]  {−1}, if x < 0,  x 7→ [−1, 1], if x = 0, {1}, if x > 0.

For quantifying or measuring the error in convex regularisation and approx- imation theory, rather than using the norm, the so-called Bregman distance is often employed [21] and is defined as follows:

Definition 13. For a convex functional R : X → R ∪ {∞}, the Bregman distance Dξ2 : X × X → R ∪ {∞} from x1 to x2 with respect to ξ2 ∈ ∂R(x2) is given by

∗ Dξ2 (x1, x2) := R(x1) − R(x2) − hξ2, x1 − x2iX ×X . cf. [17].

183 R

∂R(x0)

x0

Figure B.3: In this plot, a convex function R : R → R ∪ {∞} is non- differentiable at the point x0. Its subdifferential at that point ∂R(x0) is the set of tangent lines marked by the arrows.

Note that the Bregman distance is not really a distance (a.k.a. a metric) as it does not satisfy the triangle inequality; nor is it in general symmetric. We do, however, have the following useful so-called three point identity:

Dξ2 (x1, x2) = Dξ3 (x1, x3) + Dξ3 (x3, x2) + hξ3 − ξ2, x1 − x3i, (B.3) where ξ1, ξ2, ξ3 are subgradients of a functional R : X → R∪{∞} at x1, x2, x3, respectively (cf. [24, Lemma 3.1]). One can also define

Definition 14. The symmetric Bregman distance from x1 to x2 in X with respect to ξ1 ∈ ∂R(x1) and ξ2 ∈ ∂R(x2) is defined as

Dsym (x , x ) := hξ − ξ , x − x i = D (x , x ) + D (x , x ). (B.4) ξ1,ξ2 1 2 1 2 1 2 ξ2 1 2 ξ1 2 1 Clearly, as is hinted by the name, the symmetric Bregman distance, contrary to the usual Bregman distance, is symmetric. It is also trivial from the definition that the symmetric Bregman distance is greater than or equal to the standard Bregman distance. We will utilise the following important class of monotone operators exten- sively throughout our analysis [11]:

Definition 15. If X is a Hilbert space, then for a proper, convex and lower semi-continuous functional R : X → R ∪ {∞} and γ > 0, the proximal point

184 mapping with respect to R is defined as

proxγR : X → X

1 2 x 7→ argmin kz − xkX + R(z). z∈X 2γ In particular, we have the following resolvent representation:

−1 proxγR = (I + γ∂R) . (B.5)

Definition 16. Let R : X → R∪{∞} be proper. Then the Fenchel conjugate (cf. [37]) of R is defined as

∗ ∗ R : X → R ∪ {∞} ∗ ∗ x 7→ suphx , xiX∗×X − R(x). x∈X

Note also that R∗ is always closed and convex [32]. A typical example of 1 p ∗ 1 p∗ a conjugate function is whenever R = p k · k`p , then R = p∗ k · k`p∗ with 1 1 p + p∗ = 1 and p ≥ 1. The Fenchel conjugate also satisfies the following additional properties [32]:

R∗(0) = − inf x ∈ XR(x), R ≤ G ⇐⇒ G∗ ≤ R∗, x∗  (λR)∗(x∗) = λR∗ , (λ > 0) λ ∗ ∗ (R + γ) = R − γ. (γ ∈ R) What is the Fenchel conjugate of the Fenchel conjugate? That question is answered as follows [32]:

Definition 17. For R : X → R ∪ {∞}, we can define the biconjugate of R as

∗∗ R : X → R ∪ {∞} x 7→ sup hx, x∗i − R∗(x∗). x∗∈X∗ Proposition 34. If R is convex, then R∗∗ = R.

Corollary 6. For every R : X → R ∪ {∞}, we have R∗∗∗ = R∗. An important link between the Fenchel conjugate and the aforementioned proximal point mapping is given by the following proposition which allows us to represent a vector as the sum of two proximal mappings [113]:

185 Proposition 35. Let R : X → R ∪ {∞} be proper, convex and lower semi- continuous and γ > 0, then we have Moreau’s decomposition:

x = proxR(x) + proxR∗ (x) x = prox (x) + γ prox −1 ∗ , γR γ R γ for all x ∈ X, where

proxR∗ : X → X 1  x 7→ argmin kz − xk2 + R∗(x) . x∈X 2 The above proposition is most useful when one would like to compute the proximal mapping with respect to the Fenchel conjugate, for instance. Rewrit- ing Moreau’s decomposition above, one may subsequently write

x proxR∗ (x) = (I − proxR)(x), and proxγR∗ (x) = x − γ prox 1 R , γ γ (B.6) for all x ∈ X and γ > 0. We will make use of the firm non-expansivity of the proximal mapping oper- ator:

2 hproxJ (y1) − proxJ (y2), y1 − y2i ≥ k proxJ (y1) − proxJ (y2)k , (B.7) for which we confer [10,11]. We have the following useful identity from [32, Prop 5.7, p. 27]

Proposition 36. Suppose there exists a point Ax¯ where R is continuous and finite. Then we have

∂(R ◦ A)(x) = A∗∂R(Ax). for all x ∈ X.

186 Statutory Declaration

I hereby declare that the thesis submitted is my own unaided work, that I have not used other than the sources indicated, and that all direct and indirect sources are acknowledged as references. This printed thesis is identical with the electronic version submitted.

Linz, July 2020

———————————————— Kemal Raik

187 Curriculum Vitae

Name: Kemal Raik

Nationality: British

Date of Birth: 20 April 1991

Place of Birth: London, United Kingdom

Education: 1996–2002 Primary School Caldecot School, London Crawford School, London Kilmorie School, London

2002–2006 Secondary School Forest Hill School, London

2007–2010 Sixth Form Hillsyde Sixth Form, London St Joseph’s College, London

2010–2014 Bachelor’s in Mathematics, University of Aberdeen, Scotland Bachelor’s Thesis: Functional analysis

2015–2017 Master’s in Mathematical Sciences, Utrecht University, The Netherlands Master’s Thesis: Inverse Schr¨odingerscattering for seismic imaging

2017–2020 Doctorate in Engineering Sciences (Industrial Mathematics), Johannes Kepler University Linz, Austria Doctoral Thesis: Linear and nonlinear heuristic regularisation for ill-posed problems

Positions: 2017-2020 Research Assistant, Industrial Mathematics Institute, Johannes Kepler University Linz

Scientific Publications: Uno H¨amarik,Urve Kangro, Stefan Kindermann, Kemal Raik. Semi-heuristic parameter choice rules for Tikhonov regularisation with operator perturbations. Journal of Inverse and Ill-posed problems Vol 27 (1), p. 117-131, 2019.

Stefan Kindermann, Kemal Raik. Heuristic parameter choice rules for Tikhonov regularisation with weakly bounded noise. Numerical Functional Analysis and Optimization, Vol. 40 (12), p. 1373-1394, 2019.

Stefan Kindermann, Kemal Raik. A simplified L-curve method as error estimator. Electronic Transactions on Numerical Analysis, Vol. 53, p.217-238, 2020.

Stefan Kindermann, Kemal Raik. Convergence of heuristic parameter choice rules for convex Tikhonov regularization. SIAM Journal on Numerical Analysis, Vol. 58, p.1773-1800, 2020.

Simon Hubmer, Stefan Kindermann, Kemal Raik, Ekaterina Sherina. A numerical comparison of some heuristic stopping rules for nonlinear Landweber iteration. Forthcoming, 2020.

Scientific Talks at Conferences, Workshops and Universities: Oct–June 19th Internet Seminar: Infinite Dimensional Analysis, 2016 Casalmaggiore, Italy

189 Feb–Mar W¨urzburgWinter School, 2018 “Modern Methods in Nonsmooth Optimization”, W¨urzburg,

Oct–June 21st Internet Seminar: Functional Calculus, 2018 Wuppertal, Germany

Sept Chemnitz Symposium on Inverse Problems, 2018 Chemnitz, Germany

June University of Kent Postgraduate Seminar, 2019 Canterbury, United Kingdom

July Applied Inverse Problems (AIP) Conference, 2019 Grenoble, France

July International Conference for Industrial and 2019 Applied Mathematics (ICIAM), Valencia, Spain

Oct Chemnitz Symposium on Inverse Problems: On Tour, 2019 Frankfurt, Germany

Special Interests: Powerlifting, Boxing, Football, , Photography.

190