Submitted by Kemal Raik, MA MSc.
Submitted at Industrial Mathematics Institute
Supervisor and First Examiner Priv. Doz. DI Dr Stefan Kindermann
Second Examiner Linear and Nonlinear Univ.-Prof. Dr Bernd Hofmann Heuristic Regularisation July 2020 for Ill-Posed Problems
Doctoral Thesis to obtain the academic degree of Doktor der technischen Wissenschaften in the Doctoral Program Technische Wissenschaften
JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, Osterreich¨ www.jku.at DVR 0093696
Abstract
In this thesis, we cover the so-called heuristic (aka error-free or data-driven) parameter choice rules for the regularisation of ill-posed problems (which just so happen to be prominent in the treatment of inverse problems). We consider the linear theory associated with both continuous regularisation methods, such as that of Tikhonov, and also iterative procedures, such as Landweber’s method. We provide background material associated with each of the aforementioned regularisation methods as well as the standard results found in the literature. In particular, the convergence theory for heuristic rules is typically based on a noise-restricted analysis. We also introduce some more recent developments in the linear theory for certain instances: in case of operator perturbations or weakly bounded noise for linear Tikhonov regularisation. In both the aforementioned cases, novel parameter choice rules were derived; for the case of weakly bounded noise, through necessity and in the case of operator perturbations, an entirely new class of parameter choice rules are discussed (so-called semi-heuristic rules which could be said to be the “middle ground” between heuristic rules and a-posteriori rules). We then delve further into the abyss of the relatively unknown; namely the nonlinear theory (by which we mean that the regularisation is nonlinear) for which the development and analysis of heuristic rules are still in their infancy. Most notably in this thesis, we present a recent study of the conver- gence theory for heuristic Tikhonov regularisation with convex penalty terms which attempts to generalise, to some extent, the restricted noise analysis of the linear theory. As the error in this setting is measured in terms of the Bregman distance, this naturally lends itself to the introduction of some novel parameter choice rules. Finally, we illustrate and supplement most of the preceding by including a numerics section which displays the effectiveness of heuristic parameter choice rules and conclude with a discussion of the results as well as a specu- lation of the potential future scope of research in this exciting area of applied mathematics.
1 Zusammenfassung
In dieser Dissertation behandeln wir sogenannte heuristische Parameterwahl- regeln (auch noisefreie oder datengesteuerte Parameterwahlregeln genannt) fur¨ die Regularisierung schlecht gestellter Probleme (welche bei der Behand- lung von inversen Problemen eine herausragende Rolle spielen). Wir behan- deln zun¨achst lineare inverse Probleme sowohl in Kombination mit kontinu- ierlichen Regularisierungsmethoden, wie zum Beispiel der Tichonov Regulari- sierung, als auch mit iterativen, wie etwa dem Landweber Verfahren. Wir lie- fern Hintergrundmaterial zu jedem dieser Regularisierungsmethoden als auch die dazugeh¨origen Standardresultate aus der Literatur, wie etwa die Kon- vergenztheorie fur¨ heuristische Parameterwahlregeln, die typischerweise auf einer Analysis mit Einschr¨ankungen an den Datenfehler basieren. Außerdem stellen wir einige neuere Entwicklungen in der linearen Theorie fur¨ bestimmte Spezialf¨alle vor: im Fall von Operatorst¨orungen oder schwach beschr¨ankten Datenfehlern fur¨ lineare Tichonovregularisierung. In beiden F¨allen werden neuartige Parameterwahlen vorgestellt; im Fall von schwach beschr¨ankten Datenfehlern aus Notwendigkeit, und im Fall von Operatorst¨orungen wird eine v¨ollig neue Klasse von Parameterwahlregeln diskutiert (sogenannte semi- heuristische Regeln, die in gewisser Weise ein “Mittelweg” zwischen heuristi- schen und a-posteriori-Regeln sind). Anschließend tauchen wir weiter in den Abgrund des relativ Unbekannten ein, n¨amlich in die nichtlineare Theorie (d.h., wenn die Regularisierungmethode nichtlinear ist), fur¨ die Entwicklung und Analyse heuristischer Regeln noch im Kindheitsstadium sind. Bemer- kenswert in dieser Arbeit ist, dass wir eine aktuelle Konvergenztheorie fur¨ heuristische Tikhonov-Regularisierung mit konvexem Strafterm entwickeln, die versucht, die Konvergenzanalyse mit Datenfehlerbeschr¨ankungen der li- nearen Theorie bis zu einem gewissen Grad zu verallgemeinern. Da der Fehler bei diesen Methoden ublicherweise¨ in der Bregman-Distanz gemessen wird, bieten sich dementsprechend einige neuartige Regeln fur¨ die Parameterwahl an. Schließlich veranschaulichen und erg¨anzen wir die meisten der vorher- gehenden Resultate in einem Abschnitt mit numerischen Experimenten, der die Wirksamkeit heuristischer Parameterwahlen illustriert, und wir schlie- ßen mit einer Diskussion der Ergebnisse sowie Spekulationen uber¨ m¨ogliche zukunftige¨ Forschungsinhalte in diesem spannenden Anwendungsbereich der Mathematik.
2 Acknowledgements
First and foremost, I would like to acknowledge and thank my supervisor, Dr Stefan Kindermann, who provided significant guidance over the course of my doctoral studies. The research topic, on which this thesis is based, was through his proposal, which was granted funding from the Austrian Science Fund (FWF) to whom I also extend my thanks. Moreover, much of the contents of this thesis are based on research which was jointly conducted by myself and my supervisor. I would also like to thank Professor Bernd Hofmann for agreeing to be the second examiner for this thesis and also for being the original organiser of the Chemnitz Symposium on Inverse Problems which I have twice had the pleasure of participating in. I also owe a great deal of thanks to my family in London for their continued support whilst I have been in Linz, particularly my mother who has visited me here on a great deal of occasions. Given the natural beauty of Austria, she did not need much convincing, however. Finally, I would like to thank my friends and colleagues; most notably, and in no particular order, Dr Simon Hubmer, Fabian Hinterer, Onkar Sandip Jadhav, Alexander Ploier and Dr G¨unter Auzinger for their friendship and discussions, both academic and otherwise.
Kemal Raik Linz, July 2020
3 Contents
1 Introduction 7 1.1 Examples ...... 7 1.2 Preliminaries ...... 10 1.3 Regularisation Methods ...... 13 1.3.1 Continuous Methods ...... 14 1.3.2 Iterative Methods ...... 20 1.3.3 Parameter Choice Rules ...... 21 1.4 Heuristic Parameter Choice Rules ...... 23
I Theory 30
2 Linear Tikhonov Regularisation 31 2.1 Classical Theory ...... 31 2.1.1 Heuristic Parameter Choice Rules ...... 36 2.2 Weakly Bounded Noise ...... 52 2.2.1 Modified Parameter Choice Rules ...... 54 2.2.2 Predictive Mean-Square Error ...... 61 2.2.3 Generalised Cross-Validation ...... 65 2.3 Operator Perturbations ...... 69 2.3.1 Semi-Heuristic Parameter Choice Rules ...... 71
3 Convex Tikhonov Regularisation 78 3.1 Classical Theory ...... 78 3.2 Parameter Choice Rules ...... 86 3.2.1 Convergence Analysis ...... 89 3.2.2 Convergence Rates (for the Heuristic Discrepancy Rule) 92 3.3 Diagonal Operator Case Study ...... 94 3.3.1 Muckenhoupt Conditions ...... 97
4 Iterative Regularisation 108 4.1 Landweber Iteration for Linear Operators ...... 108 4.1.1 Heuristic Stopping Rules ...... 112 4.2 Landweber Iteration for Nonlinear Operators ...... 122
4 CONTENTS 5
4.2.1 Heuristic Parameter Choice Rules ...... 123
II Numerics 125
5 Semi-Heuristic Rules 127 5.1 Gaußian Operator Noise Perturbation ...... 128 5.1.1 Tomography Operator Perturbed by Gaußian Operator. 128 5.2 Smooth Operator Perturbation ...... 129 5.2.1 Fredholm Integral Operator Perturbed by Heat Operator129 5.2.2 Blur Operator Perturbed by Tomography Operator . . 130 5.3 Summary ...... 131
6 Heuristic Rules for Convex Regularisation 136 6.1 `1 Regularisation ...... 137 3 6.2 ` 2 Regularisation ...... 138 6.3 `3 Regularisation ...... 139 6.4 TV Regularisation ...... 139 6.5 Summary ...... 140
7 The Simple L-curve Rules for Linear and Convex Tikhonov Regularisation 144 7.1 Linear Tikhonov Regularisation ...... 145 7.1.1 Diagonal Operator ...... 145 7.1.2 Examples from IR Tools ...... 146 7.2 Convex Tikhonov Regularisation ...... 148 7.2.1 `1 Regularisation ...... 148 3 7.3 ` 2 Regularisation ...... 149 7.4 TV Regularisation ...... 150 7.5 Summary ...... 151
8 Heuristic Rules for Nonlinear Landweber Iteration 152 8.1 Test problems ...... 154 8.1.1 Nonlinear Hammerstein Operator ...... 154 8.1.2 Auto-Convolution ...... 154 8.1.3 Summary ...... 157
III Future Scope 158
9 Future Scope 159 9.1 Convex Heuristic Regularisation ...... 159 9.2 Heuristic Blind Kernel Deconvolution ...... 159 9.2.1 Deconvolution ...... 160 9.2.2 Semi-Blind Deconvolution ...... 160 9.3 Meta-Heuristics ...... 162
A Functional Calculus 177
B Convex Analysis 181
6 Chapter 1
Introduction
Typically in the “real world”, we have problems in which we would like to extract information from given data, e.g., acoustic sound waves and X-ray sinograms among other examples. In particular, an acoustic sound wave recorded on the surface of the Earth contains information regarding the sub- surface, and X-rays contain information on the density of the material which they pass through. In order to recover this information, one must, in ef- fect, reverse the aforementioned processes, i.e., solve the inverse problem. In the theory of inverse problems, this is usually mathematically formalised in operator theoretic terms. That is, we generally consider an equation of the form Ax = y, (1.1) in which A : X → Y is a continuous operator mapping between two vector spaces, called the “forward operator”. The objective is then to invert the forward operator and to thus recover the solution x from measured data y. Generally speaking, the data we measure is considered corrupted to reflect, for instance, real-world machine error, and what we consider in fact is a perturbation of the data, yδ = y + e (where e may be very small), which we call noisy data, and then naturally y is called exact data. One should mention that the noise model, i.e., e may be deterministic or stochastic, although in this thesis, we will limit ourselves to the deterministic framework for ill-posed problems. On the topic of stochastic ill-posed problems though, we opt to refer the reader to [15,23]. Note that significant parts of this thesis are derived from the papers [51,89– 91] of which the author of this thesis was a coauthor.
1.1 Examples
Examples of inverse problems may be found in theory as well as a wide va- riety of applications ranging from differentiation as a theoretically grounded example to tomography as an example found in application.
7 Differentiation Indeed, differentiation and integration may be seen as op- posites of one another and therefore we may define one as the direct (or forward) and the other as the inverse problem, respectively. In this way, we see that the definition of the inverse problem is rather arbitrary as either may qualify. However, it is the norm to define differentiation as the inverse problem and the reason for this is that, unlike integration, the differentiation problem is ill-posed; a concept which we will illustrate by example now and define somewhat more rigorously in the proceeding section. We include the following example from [35]: for any f ∈ C1[0, 1], consider the perturbed function nx f δ(x) := f(x) + δ sin , n δ with δ ∈ (0, 1) and n ∈ {2, 3, ...} arbitrary. In the language of inverse prob- lems, the first term would be the “exact data” and the second term would be the “noise”, and subsequently their sum is referred to as the “data with noise” or in even more colloquial terms, the “noisy data”. Now, differentiating the function above yields nx (f δ)0(x) = f 0(x) + n cos . n δ Note that δ kf − fnk∞ = δ, (1.2) whereas 0 δ 0 kf − (fn) k∞ = n. (1.3) Or to put into words, arbitrarily small data errors (1.2) (e.g., δ < 1) may lead to arbitrarily large solution errors (1.3) (e.g., n → ∞). That is, there is a lack of continuous dependence between the data and the solution which makes this problem, e.g., differentiation, ill-posed. This consequently leads one to approximate the ill-posed problem by a well-posed problem, i.e., to regularise (another term which we shall define later).
Tomography In computerised tomography (CT), which is particularly prevalent in the medical field for a variety of applications, one seeks to recon- struct the density of a medium from X-ray measurements. This falls under the umbrella of non-destructive testing in which one would like to under- stand the properties within a medium without causing any physical damage, hence the term non-destructive. In computerised tomography, the subject of interest is usually the human body or more precisely, a body part. In par- ticular, if we restrict ourselves to a two dimensional domain and let Ω ⊂ R2 represent the (compact) cross-section of a human body, then the “aim of the game” is to recover the density, which we denote by a two-dimensional function f :Ω → R, from X-ray measurements in the plane where Ω lies.
8 In particular, the X-rays travel in straight lines which are parametrised by their normal vector θ ∈ R2 (kθk = 1) and their distance s > 0 from the ori- gin (cf. [35]). The forward operator which maps this is known as the Radon transform, and we can represent it by the following integral expression Z (Rf)(s, θ) := f(sθ + tθ⊥) dt. (1.4) R For the derivation of (1.4), the reader is referred to [118]. Note that the Radon transform was named after the Austrian mathematician Johann Radon who, in fact, derived it under entirely theoretical grounds (cf. [130]). It is quite obvious then that R is the operator which models the forward problem. The inverse problem is therefore to invert the Radon transform and recover the density distribution f. This problem has been treated quite extensively and we refer the reader to [118], [102, Chapter 6] and the references therein. In fact, in R2, an explicit inversion formula for (1.4) already exists thanks to Johann Radon and the aforementioned reference [130]. The point, and rele- vance regarding the theory of ill-posed problems, is that the formula involves taking the derivative of the data and as already shown in the previous exam- ple, differentiation is an ill-posed problem! Thus, by consequence, inversion of the Radon transform is also ill-posed.
Backwards Heat Equation This example was also taken from [35], and we refer the reader to the aforementioned reference for further details as we only give a relatively brief description of the problem. The “forward” heat equation is a well known one, and is also referred to as the diffusion equation as it mathematically models the diffusion of heat in a body or medium. The one-dimensional heat equation is usually written in the following way:
∂ ∂2 u(x, t) − u(x, t) = 0, ∂t ∂x (1.5) u(x, 0) = u0, in Ω, u = 0, on ∂Ω × [0,T ], with an initial and Dirichlet boundary condition, where Ω ⊂ R is the domain of the body/medium with a constant temperature equal to 0 on its boundary. The forward operator which describes the forward problem would be the one which maps A : u(·, 0) 7→ f, where
f(x) = u(x, T ), (x ∈ Ω) where T > 0 is the (final) time of measurement. This can be solved via, e.g., Fourier analysis. However, our concern is the inverse problem which would be to determine the initial temperature distri- bution u(x, 0) from data derived from measurements of the final temperature.
9 Note, however, that there is no solution for this inverse problem, unless, that is, f is assumed to be analytic (cf. [25, 35]). Restricting ourselves to f for which a solution exists still does not remedy all our problems as this unique (cf. [35]) solution would still not depend continuously on the data. Therefore, two out of three requirements for a “well-posed” problem would be violated (namely, existence, uniqueness and continuous dependence of the data and solution; see below). In order to see this, we follow the example of [35] by writing (1.5) as ( ∆φ + λ φ = 0, in Ω, k k k (1.6) φk = 0, on ∂Ω, where {λk}k and {φk}k represent the eigenvalues and eigenvectors for the 2 Dirichlet problem (1.6) on Ω, respectively, with φk ∈ L (Ω) normalised such that kφkkL2(Ω) = 1 for all k ∈ N. Letting 1 uk(x, t) := φk(x) exp (λk(T − t)) , λk and plugging into (1.5), we see that 1 (∆uk)(x, t) = (∆φk)(x) exp (λk(T − t)) λk = −φk(x) exp (λk(T − t)) ∂ = u (x, t), ∂t k which confirms that uk satisfies (1.5), with fk = φk/λk. Now, since λk → ∞, we get that kfkkL2 → 0, whereas exp(λkT ) kuk(·, 0)kL2 = → ∞, λk as k → ∞. Therefore, considering fk as perturbations of f = 0 with (data) error 1/λk (measured in the L2 norm), the (solution) error of the inverse problem is amplified exponentially by the factor exp(λkT ). Thus, in the “quantification” of ill-posed problems, namely, the so-called degree of ill-posedness (again, see below for details), the backwards heat equation is said to be severely (aka exponentially) ill-posed.
1.2 Preliminaries
Assume henceforth, until stated otherwise, that A ∈ L(X,Y ) is a continuous linear operator between two Hilbert spaces. In case A−1 does not exist, one
10 seeks to construct a generalised inverse A† which recovers the best approxi- mate solution, denoted by x† := A†y (cf. [116,117]).
Definition 1. The Moore-Penrose generalised inverse is defined as the unique linear extension of the operator A˜−1 : range A → ker A⊥ to the new domain dom A† := range A ⊕ range A⊥, with range A⊥ =: ker A†, where A˜ is a re- striction of A to the orthogonal complement of its kernel.
X Y
range A⊥ A ker A range A
ker A⊥ A†
Figure 1.1: The Moore-Penrose generalised inverse, as we see in this illus- tration, maps the direct sum of the range of A and its complement to the complement of the kernel of A.
The Moore-Penrose generalised inverse A† : range A ⊕ range A⊥ → ker A⊥ thus allows a simple way to compute the best approximate solution which is usually expressed in terms of the least-squares solution of (1.1) [35]:
Definition 2. A vector x ∈ X is called a least-squares solution of(1.1) if
kAx − yk = inf kAz − yk. z∈X
The best approximate solution of (1.1) may be defined as x ∈ X satisfying
kxk = inf {kzk | z is a least-squares solution of Ax = y} .
In particular, the following theorem is key in understanding the relationship between the least-squares solution of (1.1) and the generalised inverse [35]:
Theorem 1. x ∈ X is a least-squares solution of Ax = y if and only if the Gaußian normal equation A∗Ax = A∗y, holds for all y ∈ dom A†.
11 Courtesy of Theorem 1, for A∗A continuously invertible, one could “naively” compute the best approximate solution as Z ∞ † † ∗ −1 ∗ 1 ∗ x = A y = (A A) A y = dEλA y, (1.7) 0 λ
† for y ∈ dom A , where {Eλ} refers to the spectral family of the self-adjoint operator A∗A, and we refer to Appendix A for further details and explana- tion. However, for ill-posed problems (to be defined), it turns out that (1.7) is not well defined as, in particular, 0 could be in the spectrum of A∗A, which is precisely where the integrand in (1.7) has a pole [139]. On the other hand, what can be deduced from Theorem 1 is that A† = (A∗A)†A∗ [35]. Another question that begs to be asked is whether A†yδ would be a good approximation of A†y. In other words, if kyδ − yk is small, does that imply that kA†yδ − A†yk will remain small? The answer, it turns out, also depends on whether (1.1) is well-posed or not. There are essentially two well-known definitions of well-posedness which are attributed to Hadamard and Nashed, respectively.
Definition 3. A problem (1.1) is said to be well-posed according to Hadamard [50] if all three of the following criteria are satisfied:
1. For all admissible data, a solution exists (i.e., range A = Y );
2. The solution is unique (i.e., ker A = {0});
3. The solution depends continuously on the data (i.e., A−1 ∈ L(Y,X)).
The working definition we opt to proceed with, however, is that of Nashed [116,117]:
Definition 4. The problem (1.1) is said to be well-posed according to Nashed if range A = range A.
If any one of the criterion of Hadamard, or the criteria of Nashed fail to be satisfied, then (1.1) is said to be ill-posed. The latter definition may be linked with the following theorem [139]:
Theorem 2 (Open Mapping Theorem). A† is continuous if and only if range A = range A.
Whilst we do not present the proof here and instead refer the reader to [35], we illustrate it via a rather abstract and general example. For instance, if K is a compact operator (note that we usually opt to write K in place of A whenever referring to a compact operator), then its range is closed if and only if it is finite dimensional (cf. [33]). Thus, in case dim range K = ∞, (1.1)
12 would automatically be ill-posed. That is, 0 would be an accumulation point of the spectrum, so that (1.7) would no longer be well defined. In particular, δ δ for a compact operator, with yn := y + δun, then kyn − yk = δ, but
† δ † δ kK yn − K yk = → ∞, σn as n → ∞. In particular, for a compact operator K, we can write
∞ X 1 K†y = hy, u iv , (1.8) σ i i i=1 i where {σi; vi, ui} is its singular system (cf. Appendix A), whenever
∞ X 1 y ∈ dom K† ⇐⇒ |hy, u i| < ∞. (1.9) σ i i=1 i Note that (1.9) is commonly referred to as the Picard condition (cf. [35]). Whilst it is only applicable in case the model operator is compact, therefore yielding a singular value decomposition, it provides insight as to when the data y is in the range of the pseudo-inverse (i.e., is attainable). In particular, it says that the Fourier coefficients, i.e., hy, uni, with respect to the singular functions un, should decay faster than the singular values σn. We may also accordingly quantify degrees of ill-posedness (cf. [67]): Definition 5. Let K : X → Y be a compact linear operator between two Hilbert spaces. Then the equation Kx = y with singular value decomposition −β (A.1) is said to be mildly ill-posed if σi = O(i ) for some β > 0 and severely −i ill-posed if σi = O(e ). The faster the singular values decay, the more difficult the inverse problem is to solve. Thus, a severely ill-posed problem tends to be more problematic than the mildly ill-posed case. The “irony” is that for heuristic rules (which will be defined later) to “work”, the problem should in fact be sufficiently ill-posed (but not “too ill-posed”). This will be explained in further detail in the appropriate chapter. Note that in the example of differentiation, which we mentioned as a first −1 example, the order of ill-posedness is given by σi = O(i ) (cf. [35]). There is further literature on the nonlinear extensions on the degree of ill-posedness, cf. [69], as well as extensions for nonlinear equations, cf. [70,71].
1.3 Regularisation Methods
We have seen that the pseudo-inverse does not always yield an acceptable solution. For this reason, we would like to find a different way to compute
13 an acceptable (approximate) solution. As the topic of this thesis is on regu- larisation (cf. [35,49,115,145–149]), this is the method which we will explain below. In particular, it is possible to divide them into continuous and itera- tive methods (although both are related, as will become apparent):
1.3.1 Continuous Methods In case A† is unbounded, we seek to approximate it by a parametric family † of continuous operators {Rα}α>0, with Rα : Y → X such that Rα → A as α → 0 pointwise [35]. In particular, if we consider the integral of the form:
Z ∞ ∗ ∗ ∗ Rαy = gα(A A)A y = gα(λ) dEλA y, (1.10) 0 then we see that the “aim of the game” is to choose a filter function gα such that 1 g (λ) → , (λ > 0), α λ as α → 0, as it is apparent that this would then imply convergence of the regularisation operator to the generalised inverse (see (1.7)). In case the forward operator is compact, one may also recall that we can express (1.10) as ∞ X Rαy = gα(σi)hy, uiivi, i=1 in which we would like for gα(σi) → 1/σi as α → 0 (see (1.8)). We also introduce the residual filter function as
rα(λ) := 1 − λgα(λ), (1.11) which may be derived, for instance, by considering the error. In particular, the residual function is derived in [35] as follows:
† † ∗ ∗ ∗ ∗ † x − Rαy = x − gα(A A)A y = (I − gα(A A)A A)x Z ∞ † ∗ † = (1 − λgα(λ)) dEλx = rα(A A)x . 0 The following convergence proof is also courtesy of [35]:
Theorem 3. If gα is piecewise continuous and there exists a positive constant such that |λgα(λ)| ≤ C, (1.12) and 1 lim gα(λ) = , (1.13) α→0 λ
14 for all λ ∈ (0, kAk2), then
∗ ∗ † Rαy = gα(A A)A y → x , as α → 0 for all y ∈ dom A†. Note that if y∈ / dom A†, then
∗ ∗ kRαyk = kgα(A A)A yk → ∞, as α → 0. Proof. The proof essentially boils down to exploiting Lebesgue’s Dominated Convergence Theorem [36]. We may write Z kAk2+ † 2 2 † 2 kRαy − x k = rα(λ) dkEλx k , 0 and due to (1.12), it follows that
|rα(λ)| = |1 − λgα(λ)| ≤ 1 + C, for all λ ∈ (0, kAk2). Therefore, Z kAk2+ Z kAk2+ 2 † 2 2 † 2 lim rα(λ) dkEλx k = lim rα(λ) dkEλx k , (1.14) α→0 0 0 α→0 is a consequence of the boundedness of the integrand thereby allowing one to utilise the aforementioned dominated convergence theorem. Now, due to (1.13), it follows that
lim rα(λ) = 1 − lim λgα(λ) = 1 − 1 = 0, α→0 α→0 for all λ > 0, and since rα(0) = 1, we also have that
lim rα(0) = 1. α→0 Hence, without going into great detail, it follows that the integral in (1.14) † 2 is equal to the “jump” of λ 7→ kEλx k at λ = 0, i.e., it is equal to † 2 † 2 † 2 limλ→0+ kEλx k −kE0x k = kP x k , where P : X → ker A is an orthogonal projection. For further detail, we refer to [35]. In essence, since x† ∈ ker A⊥, one has P x† = 0, which yields the desired result that
† 2 lim kRαy − x k = 0, α→0 thus, completing the convergence proof. For the divergence result, we assume, for the sake of trying to prove a contradiction, that there exists a bounded sequence {αn} with αn → 0 such that {kRαn yk} is uniformly bounded so that there exists a subsequence {R y} such that kR yk * x ∈ X. Now, αnk αnk due to the weak sequential continuity of A, it follows that ARαn y * Ax.
On the other hand, ARαn y → Qy =⇒ Ax = Qy, where Q : Y → range A is an orthogonal projection, i.e., we must have y ∈ dom A†; that is, no such bounded sequence can exist for y∈ / dom A† and this completes the proof.
15 Next we provide some typical examples for the choice of the regularisation operator Rα: Example 1. Sticking with the general form (1.10), Tikhonov regularisation (cf. [145, 146]), sometimes also called Tikhonov-Phillips regularisation [128], amounts to choosing 1 g (λ) = , α λ + α ∗ −1 ∗ (i.e., taking Rα = (A A + αI) A , which is well defined as it is well known that the operator A∗A + αI is invertible for α > 0). It is trivial to see that this tends to the desired integrand as α vanishes (and equivalently, that (A∗A + αI)−1A∗ → (A∗A)†A∗ as α → 0).
Example 2. An alternative is the spectral cutoff method (cf. [35,102]), which consists of choosing 1 g (λ) = χ (λ) , α (α,∞) λ i.e., Z ∞ 1 ∗ Rαy = dEλA y, α λ where χ(α,∞) is the characteristic function on the interval specified by the subscript; that is ( 1, if λ ∈ (α, ∞), χ (λ) := (α,∞) 0, otherwise. The intuition of this regularisation method is that we cut off the problematic eigenvalues congregating near zero. As before, it is clear to see once more † that Rαy → A y as α → 0. Now, in the presence of noise, e.g., given data yδ = y + e, convergence is not so straightforward to prove as a quick glance of Figure 1.3.1 may suggest. In particular, one may observe in Figure 1.3.1 that for a given noise level δ > 0, the error tends to infinity as the parameter α tends to zero. In the presence of noise, however, convergence is proven with respect to δ (rather than α, with an α = α∗ such that 0 < α∗ < αmax < ∞ usually depending on δ). It is common to define
δ δ xα = Rαy and xα = Rαy, where the latter xα is really an auxiliary function we consider only in proofs as it not possible to calculate it in practice, since we tend to have the presence δ of noise in our data. xα, on the other hand, is the approximate solution which we actually calculate via regularisation. The approximation lies in a neighbourhood of the exact solution x†, the proximity to which is determined by α. The idea is to choose the α as small as possible in order to obtain a
16 more accurate approximation whilst bearing stability of the computation in mind. In order to understand the relationship between the stability and approxi- mation errors, it is insightful to estimate the total error as
δ † δ † kxα − x k ≤ kxα − xαk + kxα − x k, (1.15) where the first and second term correspond to the data propagation (a.k.a. stability) error and the approximation error, respectively. Subsequently,
δ † kxα − xαk ≤ B(α), and kxα − x k ≤ V (α), (1.16) where α 7→ B(α) and α 7→ V (α) are typically increasing and decreasing functions, respectively. Thus, one has that for small α, the data propagation error blows up, and for large α, one gets a larger approximation error.
δ † kxα − x k
δ kxα − xαk † kxα − x k
α
δ † Figure 1.2: In this plot, the total error, i.e., kxα − x k is represented by δ the blue line whilst the propagated data (kxα − xαk) and approximation † + (kxα − x k) errors, represented by the red lines, are increasing for α → 0 and α → ∞, respectively.
Source Conditions As well as proving convergence, as in Theorem 3, we are also interested in convergence rates (i.e., the “speed” of convergence). In particular, it is known that the approximate solution may converge arbitrarily slowly to the desired solution in the absence of certain conditions. We give the following proposition from [141]:
17 Proposition 1. Let range A be non-closed and α = α(δ, yδ) be a parameter choice rule. Then there does not exist any function f : (0, ∞) → (0, ∞) with limδ→0 f(δ) = 0 for which
δ † kxα(δ,yδ) − x k ≤ f(δ), holds for all y ∈ dom A† with kyk ≤ 1 and all δ > 0. In other words, the proposition above states that one cannot obtain uniform convergence rates for ill-posed problems, i.e., the convergence may be arbi- trarily slow. We do not give the proof of Proposition 1 here, but instead refer the reader to [35]. Thus, in order to overcome this issue, one usually enforces a so-called source condition on the solution, i.e.,
x† ∈ range(φ(A∗A)) ⇐⇒ x† = φ(A∗A)ω, kωk ≤ C, (1.17) with an index function φ : R+ → R+ (i.e., φ is continuous, strictly monoton- ically increasing and satisfies φ(0) = 0 cf. [83]), i.e.,
Z kAk2+ 1 † 2 2 dkEλx k < ∞. (1.18) 0 φ (λ) Typical index functions include the so-called H¨oldertype ones:
φ(λ) = λµ, (µ > 0) (1.19) where larger µ indicates higher degrees of smoothness (of the solution) and for this reason, we may refer to it as the smoothness index. For problems in which the singular values decay very rapidly (i.e., for severely ill-posed problems as defined previously), (1.19) may be unnecessarily strong and therefore, a more appropriate index function may be C −µ φ(λ) = log , λ i.e., a logarithmic type source condition [73]. The following proposition from [110] shows that for (1.17) satisfying (1.20), we achieve the desired bound on the approximation error: Proposition 2. Let x† satisfy a source condition (1.17) and suppose that there exists a positive constant such that
rα(λ)φ(λ) ≤ Cφ(α), (1.20) for all α > 0 and λ ∈ (0, kAk2). Then the approximation error may be estimated as † kxα − x k ≤ Cφ(α), (1.21) for all α ∈ (0, αmax).
18 Proof. Recalling that y = A†x†, we may write Z ∞ † 2 † † 2 2 † 2 kxα − x k = kRαAx − x k = |1 − λgα(λ)| dkEλx k 0 Z ∞ 2 Z ∞ 2 2 φ (λ) † 2 2 1 † 2 = |rα(λ)| 2 dkEλx k ≤ Cφ (α) 2 dkEλx k 0 φ (λ) 0 φ (λ) ≤ C0φ2(α), where we used (1.20) in the penultimate inequality and the final inequality is a consequence of (1.18), thereby completing the proof. Note that (1.20) typically does not hold for all φ, for instance, if we consider the µ dependent ones then (1.20) may only hold for µ ∈ (0, µ0) in a finite in- terval, where µ0 would be the so-called qualification index, i.e., the maximum µ for which optimal convergence rates hold, i.e., where saturation occurs (to be exemplified in the proceeding Chapter 2).
Iterating the Regularisation After regularising the problem (1.1), one may opt to regularise again and consequently define the so-called second regularisation iterate. In particular, δ given an “initial guess”, which would be the (first) regularised solution, xα, we can consider the equation
δ δ δ δ A(z + xα) = y =⇒ Az = y − Axα,
δ and thereby iterate the process of regularisation to yield a new vector, zα := δ δ Rα(y − Axα). Thus, the second iterate may be defined as
II δ δ δ δ δ δ δ xα,δ := zα + xα = Rα(y − Axα) + Rαy = 2Rαy − RαARαy , i.e., II δ xα,δ = (2Rα − RαARα)y , (1.22) II is the second regularisation iterate and we adopt the notation xα for the noise-free case. Note that it is possible to repeat this process n times and we exemplify this with e.g., (2.9). It is also useful to define
δ δ δ II δ II pα := y − Axα and pα,δ := y − Axα,δ, (1.23) as the residuals with respect to a regularisation and the second iterate of II an iterated regularisation, respectively. Moreover, pα and pα define their noise-free variants. The aforementioned families of regularisation operators, however, are not sufficient for determining a stable approximation of x†. In particular, which
19 α ∈ (0, αmax) does one opt for? If we choose it too large, then the approx- imation error increases; but if we choose it too small, then we may not be able to compute it stably in the presence of data error. A parameter choice is needed, i.e., a way to select α = α∗ which will ensure a stable yet accurate computation of the regularised solution. To this end, we define a regulari- δ † sation method as a pair (Rα∗ , α∗) such that Rα∗ y → A y and α∗ → 0 as δ → 0. We will expand on this in Section 1.3.3.
1.3.2 Iterative Methods If we recall the Gaußian normal equation: A∗Ax = A∗y, then it is not hard to see that one can transform this into the following fixed point equation
x = x + A∗(y − Ax), to approximate x†: ∗ xk = xk−1 + A (y − Axk−1), (1.24) with k ∈ N and some initial guess x0. Note that (1.24) is known as Landwe- ber iteration (cf. [97]) and will be explored further in Chapter 4. Iterative methods may be more generally expressed as
xk = xk−1 + Gk(xk−1, y), (1.25) for different choices of Gk (cf. [81] and the references therein). The iteration index k acts as the regularisation parameter for iterative methods and the 1 relationship with α is given by k ∼ α . Thus, if it were not already clear before, it should be clear now that the asymptotics are inverted in the sense that where α → 0 for continuous methods, the iterative regularisation should † be such that xk → x as k → ∞. Examples include Landweber iteration, already stated and also Newton type methods among others. δ † In case (1.1) is ill-posed, then it is not true that xk → x as k → ∞. Instead, one observes a phenomenon known as semi-convergence in which δ † kxk −x k → ∞ as k → ∞, but for a particular k∗ (which we call the stopping δ † index, to be detailed in the proceeding section), one has that xk∗ → x as δ → 0. Note that we adapt the notation of (1.23) so that
δ δ δ pk := y − Axk, and pk := y − Axk, (1.26) define the residual variables in this case. Moreover, it follows from (1.24), that the second iterates in this case correspond to
II δ II δ xk,δ = x2k, and pk,δ = p2k, (1.27) which can easily be verified, i.e., iterating an iterative regularisation, e.g., Landweber iteration, is tantamount to doubling the iteration index.
20 1.3.3 Parameter Choice Rules The optimal parameter would be the one which minimises the total error:
δ † αopt = argmin kxα − x k. (1.28) α∈(0,αmax)
However, (1.28) is not implementable as x† is unknown. Similarly, in case one regularises via an iterative method, one can also use a stopping rule which, much like a parameter choice rule, is a methodology for approximating the optimal stopping index, kopt (which minimises the total error with respect to k ∈ N). Therefore, we are left with the task of approximating αopt via an imple- mentable parameter choice rule the consequent parameter of which we shall denote henceforth by α∗. In particular, they may be categorised as follows: • a-priori: assumes knowledge of δ. As indicative of the name, such a parameter may be computed prior to measurement of the data. For s(µ) s(µ) instance, one may choose α∗ = O(δ ) or equivalently k∗ = O(δ ), with an exponent s(µ) that depends on the smoothness index µ of the solution.
• a-posteriori: assumes knowledge of δ and yδ. A typical example would be Morozov’s discrepancy principle (cf. [114,150]) in which would com- pute the parameter as
δ δ α∗ = α(δ, y ) = sup{α ∈ (0, αmax) | kpαk ≤ τδ}, (1.29)
δ or equivalently, k∗ = inf{k ∈ N | kpαk ≤ τδ}, with a constant τ ≥ 1. • heuristic (a.k.a. error-free, data-driven) [45, 83, 84, 87]: only assumes δ δ δ knowledge of y , i.e., α∗ = α(y ), or equivalently, k∗ = k(y ). As this is the topic of the thesis, we proceed to expand on this particular way of selecting the parameter below in the next section. In a way, the most “disadvantaged” of the above classes of parameter choice rules are the a-priori rules since they require knowledge of both the noise level δ and the smoothness index µ. At first glance, the naive reader would assume the heuristic class of rules would be the natural choice, given that they are the least strenuous with regard to the information required to implement them. However, it will become apparent in the subsequent convergence theory (and to some degree in the numerics) that heuristic rules are not without their banes. Note that in the stochastic case, i.e., where the noise e is considered random, it is common to assume that it has a Gaußian distribution. As such, it is typical in that case to utilise statistical estimators in order to determine the noise level δ (under certain restrictive conditions) and then to combine it
21 V (α) B(α)
α αopt
Figure 1.3: In this plot, the selected parameter αopt is the so-called oracle choice which mimiseses the sum of B(α) and V (α) which are the upper bounds for the data propagation and approximation errors, respectively. The class of heuristic parameter choice rules which we will focus on are based on minimising functionals which try to emulate the aforementioned oracle choice. with one of the a-posteiori methods mentioned above (cf. [15,65]). In terms of heuristic rules, the most popular in the stochastic case are arguably the cross-validation and generalised cross-validation rules (cf. [106,107]). We will discuss the GCV rule in a deterministic framework in Section 2.2.3. Definition 6. For a continuous linear operator A : X → Y between two Hilbert spaces, the family of operators {Rα}, with Rα : Y → X for each † α ∈ (0, αmax), is called a regularisation (for A ) if there exists a parameter choice rule α = α∗ such that δ † lim sup kRα∗ y − A yk = 0, (1.30) δ→0 yδ∈Y ky−yδk≤δ
† for all y ∈ dom A , where α∗ is also such that
lim sup α∗ = 0. (1.31) δ→0 yδ∈Y ky−yδk≤δ
† For any y ∈ dom A , a pair (Rα, α) is called a (convergent) regularisation method (for solving (1.1)) whenever (1.37) and (1.31) hold. Note that (1.30) case is often referred to as worst case convergence as, in particular, the method must converge for the supremum of all possible noise.
22 1.4 Heuristic Parameter Choice Rules
In case one does not have knowledge of the noise level δ, one may select δ the parameter depending only on the measured data alone, i.e. α∗ = α(y ). Generally speaking, such a parameter is typically selected as the minimiser of some convex functional ψ : (0, αmax) × Y → R ∪ {∞}, i.e., δ α∗ = argmin ψ(α, y ), (1.32) α∈(0,αmax) where ψ acts as a surrogate functional, i.e., it should serve as an estimator for the error (cf. Figure 1.4). Similarly, a heuristic stopping rule may be defined via δ k∗ = argmin ψ(k, y ), (1.33) k∈[1,kmax] where ψ : N × Y → R ∪ {∞}. In this thesis, whenever A : X → Y is a continuous linear operator between Hilbert spaces, the functionals we consider may represented via their spectral form (cf. Appendix A): Z ∞ δ 2 δ 2 ψ(α, y ) = Φα(λ) dkFλy k , (1.34) 0 with Φα : (0, ∞) → R, a spectral filter function. The idea of the ψ-functionals, and consequently the associated heuristic rules, is that the aforementioned functional should act as a surrogate for the error functional (cf. Figure 1.4), i.e., δ δ † ψ(α, y ) ∼ kxα − x k. Note that for a ψ-functional of the form (1.34), we may apply the triangle inequality to estimate the functional by two components: ψ(α, yδ) ≤ ψ(α, y − yδ) + ψ(α, y), (1.35) with the first term denoting the “noisy” component and the second term being associated with exact data. In particular, it is for this reason that (under certain conditions), δ δ ψ(α, y − y ) ∼ kxα − xαk, † ψ(α, y) ∼ kxα − x k, (1.36) i.e., the functional acting on the noise should behave similarly to the data propagation error, and the functional acting on the exact data should anal- ogously behave like the approximation error. In order to prove convergence of a regularisation method (cf. Definition 6), one should be able to prove convergence with respect to the parameter choice for all possible noise, i.e., for the worst case scenario (cf. (1.30)). A pitfall for heuristic rules is the so-called Bakushinskii veto [4], which is the following:
23 ψ(α, yδ) δ † kxα − x k
α α∗ Figure 1.4: In this plot, we illustrate the principle that the heuristic ψ- functional should somehow act as a surrogate for the error (represented by the black curve) so that minimising the ψ-functional which only requires knowledge of the data should yield a parameter which is as close as possible to the optimal one.
Proposition 3 (Bakushinskii’s veto). Suppose (1.1) is ill-posed, i.e., A† is not continuous, and let α = α∗. Then we have that
δ † kxα∗ − x k → 0, as δ → 0 only if α∗ depends on δ.
δ Proof. Suppose that α∗ = α(y ) does not depend on δ. Then
δ † lim sup kRα(yδ)y − A yk = 0, (1.37) δ→0 yδ∈Y kyδ−yk≤δ
† † would imply that Rα(y)y = A y for all y ∈ dom A . Thus, if {yn} (with † † yn ∈ dom A ) is such that yn → y ∈ dom A , then this would in turn imply † † † that A yn = Rα(yn)yn → A y, i.e., that A is continuous [35]. Contrary to Bakushinskii’s veto, however, heuristic parameter choice rules when implemented are often successful, usually providing better than sat- isfactory results (cf. [9]). A possible reason could be due to the fact that Bakushinskii’s veto considers the supremum of all possible noise, i.e., the worst possible noise, which in practical implementations is often not the en- countered scenario. Therefore, Kindermann and Neubauer proposed a noise
24 restricted analysis in which one can prove convergence of the heuristic rules (cf. [83,87]), i.e., instead of attempting to prove (1.37), one proves δ † lim sup kRα∗ y − A yk = 0, δ→0 yδ∈Y y−yδ∈N where N ⊂ Y is a restriction on the noise aimed to bypass the aforemen- tioned veto of Bakushinskii. In particular, with this concept of convergence, Proposition 3 no longer applies. Prior to the aforementioned solution, Glasko and Kriskin (cf. [45]) also de- fined δ δ N := e ∈ Y | kxα − xαk ≤ Cψ(α, y − y ) for all α ∈ (0, αmax) , (1.38) as the so-called auto-regularisation set in order to prove convergence with respect to the quasi-optimality rule (although, as stated, one may use this same condition to prove convergence in the linear theory with respect to all the mentioned heuristic rules), which postulates a bound on the data propagation error. In particular, we provide a sketch below of how one would utilise this condition in order to prove convergence: δ † δ † δ kxα∗ − x k ≤ kxα∗ − xα∗ k + kxα∗ − x k ≤ Cψ(α∗, y − y ) + φ(α∗), where the first inequality is a mere application of the triangle inequality and the second inequality follows from the above auto-regularisation (1.38) condition and the bound for the approximation error by the index function, (1.21). Then, due to the sub-additivity condition of the ψ-functionals, one obtains the bound δ δ ψ(α∗, y − y ) ≤ ψ(α, y ) + ψ(α∗, y). (1.39) Additionally, the functionals satisfy ψ(α, yδ) → 0 as δ → 0 (cf. Proposition 4) and ψ(α∗, y) ≤ Cα∗ for all α ∈ (0, αmax). Hence, since it can (and will) be δ † shown that α∗ → 0 as δ → 0, it follows that xα∗ → x in the strong (i.e., norm) topology. One may view (1.38) as quite an abstract condition, however, which does not provide any particular insight or qualitative information regarding the noise. Prior to defining some concrete examples of ψ-functionals, we first provide some associated “axioms”. In particular, from [83], we can assume that all ψ functionals (in the linear Hilbert space setting) satisfy the following assumptions: δ δ ψ(α, y ) ≥ 0, for all α ∈ (0, αmax) and y ∈ Y ; (A1) δ δ δ ψ(α, y ) = ψ(α, −y ) for all α ∈ (0, αmax) and y ∈ Y ; (A2) ψ(α, ·): Y → R ∪ {∞} is continuous for all α ∈ (0, αmax); (A3) ψ(·, y) : (0, αmax) → R ∪ {∞} is l.s.c. for all y ∈ Y ; (A4) y ∈ dom A† implies lim ψ(α, y) = 0. (A5) α→0
25 We may now give the following useful convergence result from [83]:
Proposition 4. Let α∗ be selected by (1.32) and ψ satisfy (A1)-(A5). Then ψ(α, yδ) → 0 as δ → 0 for all yδ ∈ Y .
δ δ Proof. It is immediate that ψ(α∗, y ) ≤ ψ(α, y ) for all α ∈ (0, αmax). Addi- tionally, by (A3), it follows that
δ lim sup ψ(α∗, y ) ≤ ψ(α, y), for all α ∈ (0, αmax). δ→0 In light of (A5), it follows that
δ δ 0 ≤ lim inf ψ(α∗, y ) ≤ lim sup ψ(α∗, y ) ≤ 0, δ→0 δ→0 hence, yielding the desired result.
Definition 7. Let gα be such that rα ≥ 0. Then, for such a regularisation, we can define four “classical” heuristic parameter choice rules via the functionals (which also satisfy (A1),(A2),(A3),(A4),(A5)): 1 1 ψ (α, yδ) := √ kpδ k = √ kr (AA∗)yδk, (HD) HD α α α α
δ 1 II δ 1 1 ∗ 3 δ ψ (α, y ) := √ hp , p i 2 = √ kr (AA ) 2 y k, (HR) HR α α,δ α α α 1 1 δ δ δ II 2 ∗ ∗ ∗ 2 δ ψL(α, y ) := hxα, xα − xα,δi = kA gα(AA )rα(AA ) y k, (L) δ II δ ∗ ∗ ∗ δ ψQO(α, y ) := kxα,δ − xαk = kA gα(AA )rα(AA )y k, (QO) which define the heuristic discrepancy (HD), Hanke-Raus (HR) (cf. [59]), simple-L (L) (cf. [91]) and quasi-optimality (QO) (cf. [45,87,145,146]) func- tionals, respectively.
These would generally be the considered as the so-called “classical rules”. Their stopping rule counterparts are defined analogously by replacing the δ II δ δ δ II pα, pα,δ terms above by pk, p2k, respectively, and similarly xα and xα,δ are δ δ replaced by xk and x2k, respectively, in the same fashion. Similarly for the filter functions. In this thesis, we will generally focus slightly more on contin- uous regularisation methods, however. Moreover, one should note that the second equalities with the filter functions are only valid in case of linear reg- ularisation over Hilbert spaces. Thus, the expressions in the first equalities are the more general and can be used in a more expanded setting. The heuristic discrepancy rule was originally proposed as a stopping rule for Landweber iteration in the paper [59] whilst the Hanke-Raus rule was suggested as a parameter choice rule for Tikhonov regularisation. However, one may just as well use the HD rule continuous regularisation methods
26 such as Tikhonov regularisation and for this reason, it is sometimes rather confusingly also called the “Hanke-Raus” rule. The intuition is that since 1 ψ (α, y) = √ kr (AA∗)Ax†k ∼ kr (A∗A)x†k = kx − x†k, HD α α α α
δ ∗ δ √ in most cases, it seems reasonable to expect that ψHD(α, y ) = krα(AA )y k/ α may be used for error estimation (cf. [35]). The logic behind the quasi-optimality rule is that the minimum of ψQO will normally be obtained near the “cross-over point” where the propagated data and approximation errors have roughly the same order of magnitude (cf. [35]). This incidentally turns out to be an effective parameter choice rule, as will be revealed in later sections. Note that we may also “extend” the above to define a class of ratio rules by δ simply dividing the functionals above by kxαk, e.g. kpδ k ψ (α, yδ) := √ α , HDR δ αkxαk which defines the heuristic disrepancy ratio functional, and in similar fashion, we use the notation ψHRR and ψQOR to refer to the Hanke-Raus ratio and quasi-optimality ratio functionals [99]. The ψHDR functional defined above is also known as the Brezinski-Rodriguez-Seatzu rule [18]. In the case of stopping rules, the ratio rules are defined in an equivalent manner, replacing δ δ δ δ α with 1/k, and clearly also xα and pα by xk and pk, respectively.. The rationale behind the ratio rules being that whilst the standard functionals approximate the total error, the ratio functionals similarly approximate the relative error: kxδ − x†k Err (α) = α . rel kx†k However, we may disappoint the reader by stating now that the ratio rules lie beyond the scope of this thesis. A further class of heuristic rules include those which combine pre-existing functionals, e.g., ψ (α, yδ) 1 hpII , pδ i δ HR √ α,δ α ψHME(α, y ) := δ = II , ψHD(α, y ) α kpα,δk which defines the functional for the heuristic monotone error rule [52, 144]. One can observe that this is nothing other than the ratio of the Hanke-Raus and heuristic discrepancy functionals. This is also one of the “heuristifica- tions” of certain a-posteriori parameter choice rules; in this case, the mono- tone error rule which selects the parameter as
( II δ ) hpα,δ, pαi α∗ = sup α ∈ (0, αmax) | II ≤ τδ , kpα,δk
27 for a suitably chosen τ ≥ 1. Similarly, the HD rule is a heuristification of the discrepancy principle (1.29) mentioned earlier and the Hanke-Raus rule is a heuristification of the so-called improved discrepancy principle [34,35,44]:
II δ α∗ = sup α ∈ (0, αmax) | hpα,δ, pαi ≤ τδ , (1.40) also known as the Raus-Gferer rule, which has the advantage of always being an optimal order method, contrary to the standard discrepancy principle. On the other hand, an a-posteriori rule seemingly based on a heuristic rule is the so-called Lepskii rule a.k.a. balancing principle [96,100,109]
II δ α∗ = sup α ∈ (0, αmax) | kxα,δ − xαk ≤ τδ , which is arguably a “a-posteriori-fication” of the quasi-optimality rule. An- other, in fact very popular, heuristic rule is the generalised cross validation rule [125,152]. However, that rule is restricted to the finite dimensional case and we shall discuss it in Section 2.2.3.
Graph Based Heuristic Rules In addition to the ψ-based methods, al- ternative options for heuristic rules include choosing the parameter α∗ = α(yδ), again, based on the data alone, via a graph based method. Arguably the most popular of these methods and perhaps all heuristic rules is the L- curve rule [60]. Other methods include the U-curve [93], the V-curve [40] and the recently proposed Q-curve rules [135]. δ δ The L-curve method consists of plotting the graph (log kpαk, log kxαk) which, in principle, should produce an “L” shaped curve allowing one to subse- quently select the parameter α∗ as the corner of the “L”, i.e., the point of maximum curvature. See Figure 1.5. The intuition behind the L-curve is that we would like to choose a parameter such that the residual is small, δ whilst also maintaining stability of the solution, i.e., kxαk should not “blow up”. Whilst the logic may seem sound, it has been observed that the rule may sometimes fail and cannot be subjected to the same analysis as the ψ- based methods, e.g., the noise restricted analysis induced by (2.17). Thus, it is difficult to garner an understanding of when the method works and oth- erwise. Recently, however, a ψ-functional rule based on the L-curve method was developed [91] for which a convergence analysis consistent with the other ψ-based methods is possible. See Section 2.1.1 for details. We will not go into detail on the other methods, since the focus of this thesis is on the ψ-functional based methods, but the V-curve essentially chooses the parameter as the minimiser of the speed of the parameterisation of the L-curve. The Q-curve, on the other hand, consists of plotting the graph of II δ δ (loghpα,δ, pαi, log ψQO(α, y )) and choosing the point of maximum curvature, similarly as we do for the aforementioned L-curve rule. Whilst the Q-curve also exhibits an “L” shaped curve, in spite of its name (which is derived from
28 δ kxαk
δ δ α∗ kAxα − y k
Figure 1.5: In this figure, we see a typical example of how an L-curve plot δ δ δ should look. The idea is that the graph of (kAxα − y k, kxαk) should exhibit an “L” shape and the parameter is selected as the “corner point”. its incorporation of the QO functional rather than the general shape of its plots), the U-curve, in principle, should actually exhibit a “U” shaped plot. We would plot (α, U(α)), where 1 1 U(α) := δ + δ , kpαk kxαk and we would select the parameter α∗ as the one which corresponds to the local maximum on the left-hand-side of the graph. Indeed, the “U” shape arises from the observation that U(α) → ∞ whenever α → 0 and also as α → ∞. (cf. [93]).
29 Part I
Theory
30 Chapter 2
Linear Tikhonov Regularisation
In this chapter, we provide a theoretical analysis of Tikhonov regularisation, which the observant reader will recall was introduced in Example 1 of Chap- ter 1. This form of regularisation was introduced by its namesake, Tikhonov (cf. [145, 146]), in its variational form (see (2.4) below) in order to approx- imate the solution of ill-posed Fredholm integral equations. However, since its inception, Tikhonov regularisation has been thoroughly analysed via its spectral representation (see (2.1) below) for linear ill-posed problems (cf. [35] and the references therein). We therefore devote the first section of this chap- ter as an exposition of the aforementioned classical theory with relevance to the heuristic parameter choice rules. There are then a further two sections which provide an expansion of the classical theory to particular deviations of the typical framework for ill-posed problems: when the noise in unbounded in the usual norm and when the operator is perturbed, respectively. As mentioned, this method is sometimes also referred to as Tikhonov-Phillips regularisation as Phillips [128] suggested it around the same time. Further references include [49,147] and those therein. A scan of the Table of Contents would reveal that the proceeding chapter is also on Tikhonov regularisation, but that covers the relatively more recent application to so-called convex regularisation problems.
2.1 Classical Theory
Now we delve into the classical convergence theory for Tikhonov regulari- sation based on error estimates. The functional calculus for operators will feature heavily in the analysis and we therefore refer the unacquainted reader to Appendix A. Otherwise, we proceed. Let A : X → Y be a continuous linear operator between two Hilbert spaces. The idea of Tikhonov regularisation is to shift the spectrum of A∗A away from the singularity (at zero), i.e., we estimate (1.7) by δ ∗ −1 ∗ δ xα = (A A + αI) A y , (2.1)
31 with α ∈ (0, αmax) (cf. [145,146]). In this case, we recall from Example 1, that the associated filter function is given by 1 g (λ) = , (2.2) α λ + α and the filter function for its associated residual is given by α r (λ) = . (2.3) α λ + α The equation (2.1) may also be written in the variational form [35]:
δ Proposition 5. Let xα be defined by (2.1). Then
δ δ 2 2 xα = argmin kAx − y k + αkxk , (2.4) x∈X
δ for all α ∈ (0, αmax) and y ∈ Y .
δ Proof. From the optimality condition, it is clear that xα satisfies
∗ δ δ δ ∗ δ ∗ δ 0 = A (Axα − y ) + αxα = (A A + αI)xα − A y ∗ δ ∗ δ ⇐⇒ (A A + αI)xα = A y , from which the result follows trivially. In the latter sections, we will see that the form (2.4) lends itself better for generalisation than the spectral formulation (2.1) (cf. Chapter 3). Now we are ready to prove the following error estimates which will also be utilised for subsequent results [35]: Proposition 6. There exist positive constants such that
δ δ † kx − x k ≤ C √ , kx − x k = O(1), as α → 0, α α α α δ √ kA(xα − xα)k ≤ Cδ, kAxα − yk ≤ C α,
δ for all y, y ∈ Y and α ∈ (0, αmax) Proof. First we estimate the data propagation error
δ 2 ∗ −1 ∗ δ 2 kxα − xαk = k(AA + αI) A (y − y)k = kA∗(AA∗ + αI)−1(yδ − y)k2 ∗ 1 ∗ −1 δ 2 = k(AA ) 2 (AA + αI) (y − y)k Z ∞ λ δ 2 = 2 dkFλ(y − y)k 0 (λ + α) Z ∞ 1 δ 2 ≤ C dkFλ(y − y)k , α 0
32 since λ λ 1 1 = · ≤ . (λ + α)2 λ + α λ + α α Thus, in this case, δ kxδ − x k ≤ B(α) = C √ . α α α The approximation error is estimated as
† 2 ∗ −1 ∗ † 2 ∗ −1 † 2 kxα − x k = k[(A A + αI) A A − I]x k = kα(A A + αI) x k Z ∞ 2 α † 2 = 2 dkEλx k . 0 (λ + α) In particular, we prove similarly as in [35] and Theorem 3, by observing that since the integrand, i.e., the residual function, is bounded from above, we may apply the dominated convergence theorem identically as before to get
Z ∞ 2 Z ∞ 2 α † 2 α † 2 lim 2 dkEλx k = lim 2 dkEλx k , α→0 0 (λ + α) 0 α→0 (λ + α) and since the limit tends to zero for all λ > 0 and to 1 for λ = 0, the result follows as in the proof of Theorem 3. Now, we estimate the data discrepancy as
δ 2 ∗ −1 ∗ δ 2 kA(xα − xα)k = kA(A A + αI) A (y − y)k = k(AA∗ + αI)−1AA∗(yδ − y)k2 Z ∞ 2 λ δ 2 2 = 2 dkFλ(y − y)k ≤ Cδ . 0 (λ + α) Finally, we may bound the exact discrepancy in similar fashion
2 † 2 ∗ −1 † 2 kAxα − yk = kA(xα − x )k = kαA(A A + αI) x k Z ∞ 2 α λ † 2 = 2 dkEλx k ≤ Cα, 0 (λ + α) thereby completing the proof. The above estimates cater for a simple proof of the following convergence result for Tikhonov regularisation [35]: √ Theorem 4. Let α = α(δ) be chosen such that α → 0 and δ/ α → 0 as δ → 0. Then we have δ † kxα − x k → 0, as δ → 0.
33 Proof. From the estimates in the previous proposition, we can estimate the total error as δ † δ kx − x k ≤ C √ + O(1), α α √ for all α ∈ (0, αmax). Therefore, choosing α = α(δ) such that δ/ α → 0 and α → 0 as δ → 0 yields convergence. Due to Proposition 1, convergence in Theorem 4 may be arbitrarily slow. We remind the reader, however, that Proposition 1 (on arbitrarily slow conver- gence) is not just valid for Tikhonov regularisation, but for all regularisation methods. In any case, to prove convergence rates, we postulate the following H¨older type source condition:
x† ∈ range(A∗A)µ, µ > 0, (2.5) i.e., there exists an ω such that x† = (A∗A)µω. Thence, we obtain the following estimates [35]:
Proposition 7. Let (2.5) hold. Then there exists constants such that
† µ kxα − x k ≤ Cα , for µ ∈ [0, 1), (2.6) µ+ 1 1 kAx − yk ≤ Cα 2 , for µ ∈ 0, , (2.7) α 2 for all α ∈ (0, αmax) and y ∈ Y . Proof. We have
† 2 ∗ −1 † 2 ∗ −1 ∗ µ 2 kxα − x k = kα(A A + αI) x k = kα(A A + αI) (A A) ωk Z ∞ 2 2µ 2 2µ Z ∞ α λ 2 α λ 2 = 2 dkEλωk ≤ sup 2 dkEλωk 0 (λ + α) λ∈(0,∞) (λ + α) 0 µα 2µ = (µ − 1)2 kωk2 = Cα2µ, 1 − µ for µ ∈ [0, 1), and
2 ∗ −1 ∗ µ 2 kAxα − yk = kαA(A A + αI) (A A) ωk Z ∞ 2 2µ+1 2 2µ+1 Z ∞ α λ 2 α λ 2 = 2 dkEλωk ≤ sup 2 dkEλωk 0 (λ + α) λ∈(0,∞) (λ + α) 0 1 α + 2µα2µ = α (1 − 4µ2)kωk2 = Cα2µ+1, 4 1 − 2µ
1 for µ ∈ [0, 2 ), which yields the desired estimates.
34 In light of the above, we are now able to give the convergence rates result [35]:
Corollary 1. Let (2.5) hold with µ ∈ [0, 1). Then, choosing α = α(δ) = 2 αopt := δ 2µ+1 yields that
2µ δ † 2µ+1 kxα − x k = O δ , (2.8) as δ → 0.
Proof. Recall the estimates (1.15) and (1.16). Then it follows from the data propagation error estimate of Proposition 6 and (2.6), that computing δ † δ µ kxα − x k ≤ inf C √ + Cα α∈(0,αmax) α
2 yields the desired estimate, with α = αopt = δ 2µ+1 . The estimate (2.8) is known as the optimal order of convergence (cf. [35]). From the estimates above, we may observe that Tikhonov regularisation exhibits the well-known saturation effect in which the convergence rates do not improve for µ ≥ 1 and in particular, why we do not assume a source condition (2.5) with µ ≥ 1. In general, the saturation effect describes the behaviour of regularisation methods for which (2.8) does not hold for all µ > 0 for only up to some finite qualification index µ0. For Tikhonov regularisation, we observe that the qualification is µ0 = 1 (cf. [35]). Note that there also exist generalisations of the source condition above, cf. [110,111]. One can also define an iterative regularisation scheme, as in Chapter 1 and (1.22), with the Tikhonov regularisation operator, known as iterated Tikhonov regularisation (cf. [57]). This is given by the following expression:
δ ∗ −1 ∗ δ δ δ xα,n := (A A + αI) (A y + αxα,n−1), xα,0 := 0. (2.9) It is also possible to express (2.9) in its variational form:
δ δ 2 δ 2 xα,n = argmin kAx − y k + αkx − xαn−1 k . x∈X
In this case, one may consider regularisation for α → 0 with a fixed n ∈ N and that is indeed usually the case. However, one may also fix an α ∈ (0, αmax) and consider (2.9) as an iterative regularisation method for n → ∞ [35]. The respective filter and residual functions may be written in the form: (λ + α)n − αn g (λ) = , α,n λ(λ + α)n α n r (λ) = , α,n λ + α
35 which, we note, is a rather convenient form for the residual function. Of particular interest to us (for usage in several parameter choice rules) is the second Tikhonov iterate, which we denote by (cf. (1.22) in Chapter 1):
II ∗ −1 ∗ δ δ xα,δ = (A A + αI) (A y + αxα), (2.10) which is simple to compute and follows from plugging n = 2 into (2.9). One may also expand the expression in (2.10) to get
II ∗ −1 ∗ δ δ δ xα,δ = (A A + αI) A [y + (y − Axα)], where we observe that the iterates are essentially “adding back the noise”; δ δ δ that is, we add y − Axα to the initial data y (and then regularise). Note that the filter function associated with the second Tikhonov iterate is given by λ + 2α gII (λ) := , α (λ + α)2 and α 2 rII (λ) := α λ + α is then the respective filter function for the associated residual.
Remark. The qualification for the second Tikhonov iterate is µ0 = 2 and in general, for iterated Tikhonov regularisation, (1.20) holds for H¨oldertype source conditions `ala (2.5) with qualification µ0 = n.
2.1.1 Heuristic Parameter Choice Rules For Tikhonov regularisation, we may consider the “classical” ψ-functional based heuristic parameter choice rules, defined in Definition 7 of Chapter 1, as the minimisers of (1.34), with
αk−j−1λj Φj,k(λ) = . (2.11) α (λ + α)k
In particular, from the definitions of gα (2.2) and rα (2.3), for Tikhonov regularisation, we have the following: if k = 2 and j = 0, then this defines the heuristic discrepancy rule; k = 3 and j = 0 defines the Hanke-Raus rule; k = 4 and j = 1 defines the quasi-optimality rule, and all of the aforementioned fall into the so-called R1 family of rules (cf. [133]). Another rule, which does not fall into the R1 family, is the simplified L-curve rule, which is defined by taking k = 3 and j = 1. In addition to the aforementioned R1-rules, there is the greater family of R*-rules [54]. We may also express this in tabular form for he benefit of the reader:
36 k = 2 k = 3 k = 4 j = 0 HD HR j = 1 L QO Or more explicitly: α Φ (λ) = , (HD) α (λ + α)2 α2 Φ (λ) = , (HR) α (λ + α)3 αλ Φ (λ) = , (L) α (λ + α)3 α2λ Φ (λ) = . (QO) α (λ + α)4 Relatively recent results on the quasi-optimality rule may be found in [7,8]. Note that for Tikhonov regularisation, the simple-L functional may be writ- ten in the following form [91]: ∂ ψ (α, yδ) = − xδ , α xδ . (2.12) L α ∂α α Naturally, it follows that the respective ratio rule may be expressed as
δ ∂ δ δ xα, α ∂α xα ψLR(α, y ) = − δ 2 , (2.13) kxαk although, as mentioned previously, we will generally focus more on the non- ratio rules. An a-posteriori version of this rule was also proposed in [91], namely, ∂ α = sup α ∈ (0, α ) | −α xδ , α xδ ≤ τδ , ∗ max α ∂α α for an appropriately chosen τ ≥ 1, although this rule is yet to have been investigated.
The Simple-L Rules Note that the simple-L and simple-L ratio rules are relatively new devel- opments (cf. [91]) which drew inspiration from the more classical L-curve δ method (cf. [60, 62, 64, 98]) in which one plots the graph of (log(kAxα − δ 2 δ 2 y k ), log(kxαk )) and selects the parameter as the maximiser of the curva- ture of the so-called L-graph, i.e., the curve κ(α) α 7→ , χ(α)
37 δ δ 2 δ 2 where κ(α) := log(kAxα − y k ) and χ(α) := log(kxαk ). In essence, this corresponds to selecting
α∗ = argmax θ(α), α∈(0,αmax) where θ : (0, αmax) → R ∪ {∞} denotes the signed curvature defined as [60] χ00(α)κ0(α) − χ0(α)κ00(α) θ(α) := 3 . (2.14) (χ0(α)2 + κ0(α)2) 2
In particular, for Tikhonov regularisation, (2.14) may be simplified to an expression devoid of second derivatives; namely, if we set
δ δ 2 δ 2 %(α) := kAxα − y k and υ(α) := kxαk , (2.15) and observing that we can write Z ∞ 0 ∂ ∗ −1 ∗ δ 2 ∂ 1 ∗ δ 2 υ (α) = k(A A + αI) A y k = 2 dkEλA y k ∂α ∂α 0 (λ + α) Z ∞ Z ∞ ∂ λ δ 2 λ δ 2 = 2 dkFλy k = −2 3 dkFλy k , 0 ∂α (λ + α) 0 (λ + α) and similarly again,
Z ∞ 2 Z ∞ 0 ∂ α δ 2 αλ δ 2 % (α) = 2 dkFλy k = 2 3 dkFλy k , 0 ∂α (λ + α) 0 (λ + α) we have the identity %0(α) = −αυ0(α). Moreover, we may define the so-called logarithmic derivatives of % and υ by
∂ %0 %00% − (%0)2 ∂ υ0 υ00υ − (υ0)2 κ00(α) = = and χ00(α) = = . ∂α % %2 ∂α υ υ2 Furthermore, ∂ %00 = (−αυ0) = −υ0 − αυ00, ∂α and we may use all of the above to rewrite (2.14) as
υ(α)%(α) %(α)υ(α) + αυ0(α)%(α) + α2υ0(α)υ(α) θ(α) = 0 3 , (2.16) |υ (α)| (%(α)2 + α2υ(α)2) 2 as was observed in [55, 63, 91, 151]. The described L-curve method is what may be called a “graphical” method and cannot be analysed in the same manner as the ψ-based methods. Note that other related methods include
38 the so-called V-curve [40], U-curve [93, 94] and more recently, Q-curve [135] (see Section 1.4). Indeed, the V-curve selects the parameter similarly to the L-curve rule by minimising the function
∂ α ∂α κ(α) θV (α) = ∂ . α ∂α χ(α)
Preceding the ψL method was Reginka’s rule (cf. [136]) which composed of minimising the functional:
δ δ τ δ δ ψR(α, y ) = kxαk kAxα − y k. (τ > 0)
Indeed, one can also find in [35, Proposition 4.37] the result that α∗ = δ argmin ψR(α, y ) if and only if α∗ = argmax θ(α). However, the choice of τ in ψR is somewhat of a cumbersome dilemma. Moreover, it was also ob- served in [83] that Reginska’s method is not subject to the same analysis as the other ψ-based methods (e.g. the HD, HR or QO rules) in the sense that ψR is neither subadditive, nor can it satisfy a noise restriction `ala (2.17); thus the motivation for the “new” L functionals, which are also motivated by the following proposition [91]: Proposition 8. We have υ(α) %(α) θ(α) = C (ζ) − C (ζ), ζ := , α|υ0(α)| 1 2 αυ(α) where C1,C2 are positive constants depending on ζ and satisfy 2 1 0 ≤ C1(ζ) ≤ √ , and 0 ≤ C2(ζ) ≤ √ , 3 3 2 for all α ∈ (0, αmax). Proof. The expression (2.16) can easily be rewritten as (2.15) with ζ2 ζ(1 + ζ) c1(ζ) = 3 , c2(ζ) = 3 . (ζ2 + 1) 2 (ζ2 + 1) 2 √ By elementary calculus, we may find the maxima for c1 at ζ = 2 and for c2 at ζ = 1 yielding the upper bounds. Upon observation of the expression in Proposition 8, and the realisation that the function α 7→ υ(α)/α|υ0(α)| is the sole contributer to large values of θ (recalling that the L-curve method maximises θ), it follows that one may in effect “simplify” the task of computing α∗ by instead minimising the reciprocal of the aforementioned function: the direct result is the simple-L ratio rule. The simple-L rule is then derived from the observation of the previous one being a ratio rule.
39 Convergence Analysis Arguably the most important part of this section is the convergence analysis of the mentioned rules. For this, we first prove some more preliminary bounds before finally providing theorems which give convergence rates for the total error with respect to each heuristic parameter choice rule. First we show that if the parameter α is selected according to a heuristic parameter choice rule, then it may be estimated from above by the corre- sponding ψ-functional: Proposition 9. Let δ α∗ = argmin ψ(α, y ), α∈(0,αmax) with ψ ∈ {ψHD, ψHR, ψQO, ψL}. Then
2 δ δ Cψ (α, y ), if ψ = ψHD and ∃ C1,C2 > 0 s.t. ky k ≥ C, ∗ δ or ψ = ψL and ∃ C s.t. kA y k ≥ C, α∗ ≤ Cψ(α, yδ), if ψ = ψ and ∃ C s.t. kyδk ≥ C, HR ∗ δ or ψ = ψQO and ∃ C > 0 s.t. kA y k ≥ C, for all yδ ∈ Y . Proof. Note that we may write √ αk(AA∗ + αI)−1yδk, if ψ = ψ , HD 3 ∗ − 2 δ δ αk(AA + αI) y k, if ψ = ψHR, ψ(α, y ) = ∗ −2 ∗ δ αk(A A + αI) A y k, if ψ = ψQO, √ ∗ − 3 ∗ δ αk(A A + αI) 2 A y k, if ψ = ψL. Now, notice that for s ≥ 0, we have k(AA∗ + αI)s(AA∗ + αI)−syδk ≤ k(AA∗ + αI)skk(AA∗ + αI)−syδk, i.e., δ ∗ −s δ ky k C k(AA + αI) y k ≥ ∗ s ≥ , k(AA + αI) k Cs ∗ s where k(AA + αI) k ≤ Cs. Then the result subsequently follows for the 3 heuristic discrepancy and Hanke-Raus functionals with s = 1 and s = 2 , respectively. The estimate for the quasi-optimality and simple-L functionals follow simi- larly, as
∗ δ δ t kA y k C 3 1 ψQO/L(α, y ) ≥ α ∗ s ≥ , s ∈ , 2 , t ∈ , 1 k(A A + αI) k Cs 2 2
∗ s with k(AA + αI) k ≤ Cs.
40 Corollary 2. Let α∗ be selected as in Proposition 9. Then α∗ → 0 as δ → 0. Proof. It follows from Propositions 9 and 4 that
s δ s δ α∗ ≤ ψ (α∗, y ) ≤ ψ (α, y ) → 0, s ∈ {1, 2} as δ → 0.
Noise Restriction Due to the ever ominous Bakushinkii veto (cf. Propo- sition 3), we would like to restrict the set of admissible noise in order to satisfy the auto-regularisation condition (1.38), thereby proving convergence of the method. In particular, whilst the former condition is quite abstract, a weaker and incidentally more insightful condition was given; namely the so-called Muckenhoupt condition, which requires that the noise belongs to the following set:
Z ∞ Z α p −1 2 p−1 2 Np := e ∈ Y | α λ dkFλek ≤ C λ dkFλek ∀ α ∈ (0, αmax) . α 0 (2.17) Notice that for (2.17) to hold, there should be sufficiently many high fre- quency components (for the right-hand side to dominate). Therefore, this condition tells us that the noise should be sufficiently irregular; i.e., very smooth perturbations of the noise do not satisfy the Muckenhoupt condi- tion. The observant reader might recall that this is the “irony” we referred to in the description of degrees of ill-posedness in Chapter 1. This yields the following proposition (cf. [83,87]):
Proposition 10. Let ( N , if ψ ∈ {ψ , ψ }, e = y − yδ ∈ 1 HD HR N2, if ψ ∈ {ψQO, ψL}.
Then δ δ kxα − xαk ≤ ψ(α, y − y ), (2.18) δ for all α ∈ (0, αmax) and y, y ∈ Y . Proof. Recall that the data propagation error may be represented in its spec- tral form by Z ∞ δ 2 λ δ 2 kxα − xαk = 2 dkFλ(y − y )k . (2.19) 0 (λ + α) The idea of the proof of this theorem is to split the above integral into two parts: λ ∈ (0, α) and λ ∈ (α, ∞) which will allow us to utilise the Muckenhoupt condition (2.17) above to either achieve an appropriate bound
41 for the data propagation error from above and then to subsequently connect it with a bound for the ψ-functional acting on the noise from below. Since Z ∞ Z ∞ λ δ 2 −1 δ 2 2 dkFλ(y − y )k ≤ C λ dkFλ(y − y )k , (2.20) α (λ + α) α and Z α Z α λ δ 2 1 δ 2 2 dkFλ(y − y )k ≤ C 2 λ dkFλ(y − y )k (2.21) 0 (λ + α) α 0 Z α 1 δ 2 ≤ C dkFλ(y − y )k , (2.22) α 0 it follows from the above estimates that (2.19) may be bounded as
Z α Z ∞ δ 2 1 δ 2 −1 δ 2 kxα − xαk ≤ C 2 λ dkFλ(y − y )k + λ dkFλ(y − y )k α 0 α Z α Z ∞ 1 δ 2 −1 δ 2 ≤ C dkFλ(y − y )k + λ dkFλ(y − y )k α 0 α Recall the Muckenhoupt condition (2.17), which allows us to bound the sec- ond term of the inequality above with p = 1 and p = 2, respectively, yielding 1 Z α dkF (y − yδ)k2, if p = 1, Z ∞ α λ λ−1 dkF (y − yδ)k2 ≤ C 0 (2.23) λ 1 Z α α λ dkF (y − yδ)k2, if p = 2. 2 λ α 0
In case ψ ∈ {ψQO, ψL}, we observe, similarly as in [83,87,91], that
Z α 2 2 δ α λ δ 2 ψQO(α, y − y ) ≥ 4 dkFλQ(y − y )k 0 (λ + α) Z α Z α λ δ 2 ¯ 1 δ 2 ≥ C 2 dkFλQ(y − y )k ≥ C 2 λ dkFλQ(y − y )k , 0 (λ + α) α 0 and Z α Z α 2 δ αλ δ 2 λ δ 2 ψL(α, y − y ) ≥ 3 dkFλ(y − y )k ≥ C 2 dkFλ(y − y )k . 0 (λ + α) 0 α so that with (2.23), where p = 2, we achieve the desired estimate. In case ψ ∈ {ψHD, ψHR}, we follow the example of [83] and find that Z α Z α 2 δ α δ 2 C δ 2 ψHD(α, y − y ) ≥ 2 dkFλ(y − y )k ≥ dkFλ(y − y )k 0 (λ + α) α 0
42 and Z α 2 Z α 2 δ α δ 2 1 δ 2 ψHR(α, y − y ) ≥ 3 dkFλ(y − y )k ≥ dkFλ(y − y )k . 0 (λ + α) α 0 Then similarly as for the two previous parameter choice rules, we use (2.23) with p = 1 to complete the proof. We also bound the ψ-functional acting on the noise from above (cf. [83,87]): Proposition 11. We have δ √ , if ψ ∈ {ψHD, ψHR} ψ(α, y − yδ) ≤ C α δ kxα − xαk, if ψ ∈ {ψQO, ψL},
δ for all α ∈ (0, αmax) and y, y ∈ Y .
Proof. For ψ = ψHD, one can immediately estimate that Z ∞ 2 2 δ 1 α δ 2 ψHD(α, y − y ) = 2 dkFλ(y − y )k α 0 (λ + α) Z ∞ 2 1 δ 2 δ ≤ dkFλ(y − y )k ≤ . α 0 α
If ψ = ψHR, then Z ∞ 2 2 δ α δ 2 ψHR(α, y − y ) = 3 dkFλ(y − y )k 0 (λ + α) Z ∞ 2 1 δ 2 δ ≤ dkFλ(y − y )k ≤ C . 0 λ + α α
For ψ = ψQO, we have
Z ∞ 2 2 δ α λ δ 2 ψQO(α, y − y ) = 4 dkFλ(y − y )k 0 (λ + α) Z ∞ 2 α λ δ 2 = 2 · 2 dkFλ(y − y )k 0 (λ + α) (λ + α) Z ∞ λ δ 2 δ 2 ≤ 2 dkFλ(y − y )k = kxα − xαk , 0 (λ + α) and for ψ = ψL, the estimate follows similarly as for the quasi-optimality functional, since the filter function for the simple-L rule satisfies αλ λ Φ (λ) = ≤ , α (α + λ)3 (α + λ)2 thus completing the proof.
43 Now, the next proposition proves that the ψ-functionals may be bounded from above by the approximation error or, at least, a similar estimate:
Proposition 12. If ψ ∈ {ψHR, ψQO}, then we have
† ψ(α, y) ≤ Ckxα − x k. (2.24)
If ψ ∈ {ψHD, ψL} and (2.5) holds, then
sZ ∞ α † 2 µ 1 ψ(α, y) ≤ dkEλx k ≤ Cα , µ ≤ 0 λ + α 2 for all α ∈ (0, αmax) and y ∈ Y .
Proof. Let ψ = ψHD. Then Z ∞ Z ∞ 2 αλ † 2 λ α † 2 ψHD(α, y) = 2 dkEλx k = dkEλx k 0 (λ + α) 0 λ + α λ + α Z ∞ α † 2 ≤ dkEλx k , 0 λ + α
λ from which the result follows thanks to (2.5), courtesy of the fact that λ+α ≤ 1 for all λ ≥ 0. Note that the rest of the estimates in this proof also follow from the aforementioned upper bound. For ψ = ψHR, it easily follows that
Z ∞ 2 Z ∞ 2 2 α λ † 2 α λ † 2 ψHR(α, y) = 3 dkEλx k ≤ 2 dkEλx k 0 (λ + α) 0 (λ + α) λ + α † 2 ≤ kxα − x k .
If ψ = ψQO, then it is immediate that
Z ∞ 2 2 Z ∞ 2 2 2 α λ † 2 α λ † 2 ψQO(α, y) = 4 dkEλx k = 2 2 dkEλx k 0 (λ + α) 0 (λ + α) (λ + α) † 2 ≤ kxα − x k .
For ψ = ψL, it follows easily that
Z ∞ 2 Z ∞ 2 αλ † 2 α † 2 ψL(α, y) = 3 dkEλx k ≤ dkEλx k , 0 (λ + α) 0 λ + α
µ 1 and this expression is the order of α (for µ ≤ 2 ) which follows similarly via (1.20), and this completes the proof, as the remaining details are identical as for the HD rule.
44 Remark. We remark that in order to bound the HD functional from above by the approximation error, as one can do for the HR and QO rules, the following regularity condition was considered in [83]:
Z ∞ 2 Z ∞ 2 α † 2 α † 2 λ 2 dkEλx k ≤ Cα 2 dkEλx k , (2.25) α (λ + α) α (λ + α) for all α ∈ (0, αmax), with which one can prove: Z α Z ∞ 2 αλ † 2 ψHD(α, y) = + 2 dkEλx k 0 α (λ + α) Z α 2 Z ∞ 2 α † 2 α λ † 2 ≤ 2 dkEλx k + 2 dkEλx k 0 (λ + α) α (λ + α) α (2.25) Z α 2 Z ∞ 2 α † 2 α † 2 ≤ 2 dkEλx k + C 2 dkEλx k 0 (λ + α) α (λ + α) † 2 ≤ max{1,C}kxα − x k .
However, the inequality (2.25) is very restrictive as it is rarely satisfied in practice, so we will not consider it in general. In the absence of any source conditions, we now have all the tools to prove that when the parameter is selected according to a heuristic rule, the corre- sponding total error converges as the noise level decays to zero: Theorem 5. Let δ α∗ = argmin ψ(α, y ), α∈(0,αmax) with ψ ∈ {ψHD, ψHR, ψQO, ψL} and let ( N , if ψ ∈ {ψ , ψ }, y − yδ ∈ 1 HD HR N2, if ψ ∈ {ψQO, ψL},
δ ∗ δ and suppose ky k 6= 0 in case ψ ∈ {ψHD, ψHR}, and kA y k 6= 0 in case ψ ∈ {ψQO, ψL}. Then δ † kxα∗ − x k → 0, as δ → 0. Proof. We have
δ † δ † (2.18) δ † kxα∗ − x k ≤ kxα∗ − xα∗ k + kxα∗ − x k = O ψ(α∗, y − y ) + kxα∗ − x k (1.39) δ † = O ψ(α, y ) + ψ(α∗, y) + kxα∗ − x k ,
δ δ where we have used the optimality of α∗; namely, that ψ(α∗, y ) ≤ ψ(α, y ) for all α ∈ (0, αmax) (cf. (1.32)).
45 If ψ ∈ {ψQO, ψL}, then it follows from Proposition 12 that δ † δ † kxα∗ − x k = O ψQO/L(α, y ) + kxα∗ − x k δ † = O ψQO/L(α, y − y ) + ψQO/L(α, y) + kxα∗ − x k δ † = O kxα − xαk + kxα − x k + φ(α∗) , with a function φ satisfying φ(α) → 0 as α → 0 (cf. (1.21)), and since α∗ → 0 due to Corollary 2, it follows that choosing any α = α(δ) such that α → 0 and δ2/α → 0 as δ → 0 yields the desired result. If ψ = ψHD, it also follows from Propositions 11 and 12, that δ kxδ − x†k = O √ + α + α , α∗ α ∗ and the result follows selecting α as before (cf. Corollary 2). If we now consider the source condition (2.5), then, in addition to being able to prove convergence of the error, we may also prove (suboptimal) conver- gence rates. However, before doing so, we first give the following proposition which, like Proposition 9, provides an estimate for the ψ-functionals from below, this time by the approximation error and in terms of the smoothness index µ: Proposition 13. Let (2.5) hold with µ ∈ [0, 1) and assume kA∗Ax†k 6= 0. Then 2µ 1 † ψ (α, y), if ψ ∈ {ψHD, ψL} and µ ≤ , kxα − x k ≤ C 2 µ ψ (α, y), if ψ ∈ {ψHR, ψQO}, for all α ∈ (0, αmax) and y ∈ Y .
Proof. If ψ = ψL, we can estimate Z ∞ 2 Z ∞ 2 αλ † 2 α 2 † 2 ψL(α, y) = 3 dkEλx k ≥ 2 3 λ dkEλx k 0 (λ + α) (kAk + αmax) 0 ∗ † 2 = C1k(A A)x k α, and conversely, we have that µ † µ 1 2 kxα − x k ≤ C2α ≤ C2 ∗ † 2 ψL(α, y) C1k(A A)x k C = ψ2µ(α, y), k(A∗A)x†k2µ L
If ψ = ψQO, we estimate, analogously, Z ∞ 2 2 2 α λ † 2 ∗ † 2 2 ψQO(α, y) = 4 dkEλx k ≥ Ck(A A)x k α , 0 (λ + α)
46 by which we similarly arrive at the estimate:
† µ µ kxα − x k ≤ Cα ≤ CψQO(α, y). Note that the proofs for the heuristic discrepancy and Hanke-Raus rule follow in exactly the same fashion. The rates for the HD rule match that of the L rule and the Hanke-Raus matches the quasi-optimality rule’s rates in this instance due to the order of the α factor in the respective filter functions. We remark that in the above proposition, we can see that the HD and L rules 1 both exhibit the early saturation effect, as the result only holds for µ ≤ 2 rather than 1. This also has a “knock-on” effect on the subsequent results, namely, the main convergence rates result of this section, which we provide via the following corollary:
Corollary 3. In addition to the conditions of Theorem 5, let the source condition (2.5) hold with µ ∈ [0, 1). Then 4µ µ 1 O δ 2µ+1 , if ψ ∈ {ψHD, ψL} and µ ≤ , δ † 2 kxα − x k = ∗ 2µ µ O δ 2µ+1 , if ψ ∈ {ψHR, ψQO}, as δ → 0.
Proof. Similar as in the proof of the previous theorem, we have
δ † δ † kxα∗ − x k = O ψ(α, y ) + ψ(α∗, y) + kxα∗ − x k , which follows from the optimality of α∗, cf. (1.32) and the inequality (1.39). For ψ ∈ {ψHD, ψL}, by Propositions 11 and 12, we have
δ kxδ − x†k = O √ + αµ + αµ with µ ≤ 1 α∗ α ∗ 2µ 2µ+1 2µ δ = O δ + ψHD/L(α, y )
4µ µ = O δ 2µ+1 ,
2 δ where the second equality follows from the fact that α∗ ≤ ψHD/L(α∗, y ) with 1 µ ≤ 2 (cf. Prop. 9). For ψ ∈ {ψHR, ψQO}, the proofs follow similarly, except that since α∗ ≤ δ ψ(α∗, y ) for those rules, we get
2µ 2µ δ † 2µ+1 µ δ 2µ+1 µ kxα∗ − x k = O δ + ψHR/QO(α, y ) = O δ , thereby completing the proof.
47 Remark. Note that the heuristic discrepancy and simple-L rules exhibit different rates to the Hanke-Raus and quasi-optimality rules. Moreover, if one wanted to prove that, say the HD rule satisfied the sharper bound (2.24), then one would have to require an additional condition, namely (2.25), which we have already stated is somewhat restrictive (and thus we do not consider it in general). For certain problems, however, the HD and simple-L rules 1 are optimal, e.g., whenever µ = 2 . On the other hand, for µ = 1, the QO and HR rules yield optimal rates. Thus, for solutions with low smoothness, the HD and L rules in fact offer themselves as the better alternatives. For numerical implementations and results of the aforementioned rules, we refer the reader to [9,53,126]. As stated already, the convergence rates of the previous corollary are subop- timal and in order to prove optimal rates (i.e., rates which match (2.8)), we require an additional regularity condition [83,87]: Z ∞ Z α 2 −2 † 2 † 2 α λ dkEλx k ≥ C dkEλx k , (2.26) α 0 for all α ∈ (0, αmax). Note, however, that we also have the weaker condition (cf. [91]): Z α Z ∞ † 2 −1 † 2 dkEλx k ≤ Cα λ dkEλx k , (2.27) 0 α which suffices for the HD and simple L-curve rules in the following Lemma 1. Condition (2.27) is the weaker of the above two conditions in the sense that (2.26) implies (2.27). Considering the addition of the latest regularity condition, we now state the following lemma which allows us to prove that the ψ-functional for the considered heuristic parameter choice rules may be bounded from below by the approximation error [83,87]:
Lemma 1. For µ ∈ [0, 1), let (2.26) hold for ψ ∈ {ψHR, ψQO} and (2.27) hold for ψ ∈ {ψHD, ψL}. Then we have † ψ(α, y) ≥ Ckxα − x k, for all y ∈ Y and α ∈ (0, αmax). Proof. First, we decompose the spectral form of the approximation error as the sum of two integrals: Z α Z ∞ 2 † 2 α † 2 kxα − x k = + 2 dkEλx k . 0 α (λ + α) Now, similarly as in [87], we estimate the second integral of the sum as: Z ∞ 2 Z ∞ α † 2 2 −2 † 2 2 dkEλx k ≥ Cα λ dkEλx k α (λ + α) α (2.26) Z α Z α 2 † 2 α † 2 ≥ C dkEλx k ≥ C 2 dkEλx k . (2.28) 0 0 (λ + α)
48 Thus, with the above, we obtain an upper bound for the overall approxima- tion error: Z ∞ 2 † 2 α † 2 kxα − x k ≤ C 2 dkEλx k . α (λ + α)
Hence, for all ψ ∈ {ψHD, ψHR, ψQO, ψL}, it suffices to prove that
Z ∞ 2 2 α † 2 ψ (α, y) ≥ 2 dkEλx k . α (λ + α)
For ψ = ψQO, we can estimate
Z ∞ 2 2 Z ∞ 2 2 α λ † 2 α † 2 ψQO(α, y) ≥ 2 dkEλx k ≥ C 2 dkEλx k , α (λ + α) α (λ + α) and similarly for ψ = ψHR:
Z ∞ 2 Z ∞ 2 2 α † 2 α † 2 ψHR(α, y) ≥ 3 λ dkEλx k ≥ 2 dkEλx k . α (λ + α) α (λ + α)
Note that for ψ ∈ {ψHD, ψL}, we may use the weaker condition (2.27). In particular, we can bound
Z α 2 Z α 2 Z α α † 2 α † 2 † 2 2 dkEλx k ≤ 2 dkEλx k ≤ dkEλx k . (2.29) 0 (λ + α) 0 α 0 Thus,
Z α 2 Z ∞ 2 † 2 α † 2 λ α † 2 kxα − x k ≤ 2 dkEλx k + C 2 dkEλx k 0 (λ + α) α α (λ + α) Z ∞ 2 λ α † 2 ≤ C 2 dkEλx k , α α (λ + α) where the first inequality follows from the fact that λ/α ≥ 1 for all λ ≥ α, and the second inequality follows from a combination of (2.29) and (2.27). For ψ = ψHD, the proof follows from the observation that
Z ∞ Z ∞ 2 2 αλ † 2 α † 2 ψHD(α, y) ≥ 2 dkEλx k ≥ 2 dkEλx k , α (λ + α) α (λ + α) whereas for ψ = ψL, we observe that
Z ∞ 2 Z ∞ 2 αλ † 2 1 α † 2 ψL(α, y) ≥ 3 dkEλx k ≥ dkEλx k , α (λ + α) 2 α λ from which the result follows via (2.27).
49 Now we state the optimal convergence rates theorem for the heuristic pa- rameter choice rules which, more or less, combines all of the above results in this section:
Theorem 6. Let the source condition (2.5) hold with µ ∈ [0, 1), with the 1 δ additional restriction that µ ≤ 2 for ψ ∈ {ψHD, ψL}, and let y − y ∈ N1 δ for ψ = {ψHD, ψHR} and y − y ∈ N2 whenever ψ ∈ {ψL, ψQO}. Suppose, in addition, that the regularity condition (2.27) holds for ψ ∈ {ψHD, ψL} and that (2.26) holds for ψ ∈ {ψHR, ψQO}. Then
2µ δ † 2µ+1 kxα∗ − x k = O δ , as δ → 0.
Proof. We first assume that for δ sufficiently small there exists anα ¯ ∈ (0, αmax) such that
δ † δ † kxα¯ − xα¯k + kxα¯ − x k = inf kxα − xαk + kxα − x k . α∈(0,αmax)
† Assumeα ¯ ≥ α∗. Then it follows, since α 7→ kxα − x k is a monotonically † † increasing function, that kxα∗ − x k ≤ kxα¯ − x k. Now, we estimate
δ δ δ kxα∗ − xα∗ k = Cψ(α∗, y − y ) ≤ C ψ(α, y ) + ψ(α∗, y) 1 ψ(α, y − yδ) + ψ(α, y) + αµ , if ψ ∈ {ψ , ψ }, µ ≤ , ≤ C ∗ HD L 2 δ † ψ(α, y − y ) + ψ(α, y) + kxα∗ − x k , if ψ ∈ {ψHR, ψQO}. δ µ µ O √ + α + α , if ψ = ψHD, α ∗ δ µ µ O kx − xαk + α + α , if ψ = ψL, = α ∗ δ O √ + kx − x†k + kx − x†k , if ψ = ψ , α α∗ HR α δ † † O kxα − xαk + kxα − x k + kxα∗ − x k , if ψ = ψQO. δ µ O √ +α ¯ , if ψ = ψHD, α¯ δ µ O kx − xα¯k +α ¯ , if ψ = ψL, = α¯ δ O √ + kx − x†k , if ψ = ψ , α¯ HR α¯ δ † O kxα¯ − xα¯k + kxα¯ − x k , if ψ = ψQO, where we have used the estimates for the ψ-functionals from Propositions 11, † † µ µ 12, and forα ¯ ≥ α∗, we recall kxα∗ − x k ≤ kxα¯ − x k and α∗ ≤ α¯ .
50 Moreover, continuing with the case thatα ¯ < α∗: † δ δ kxα∗ − x k ≤ Cψ(α∗, y) ≤ C ψ(α, y ) + ψ(α∗, y − y ) δ µ δ O √ + α + √ , if ψ = ψHD, α α ∗ δ µ δ O kxα − xαk + α + kxα∗ − xα∗ k , if ψ = ψL, = δ † δ O √ + kxα − x k + √ , if ψ = ψHR, α α∗ δ † δ O kxα − xαk + kxα − x k + kxα∗ − xα∗ k , if ψ = ψQO, δ µ O √ +α ¯ , if ψ = ψHD, α¯ δ µ O kx − xα¯k +α ¯ , if ψ = ψL, = α¯ δ O √ + kx − x†k , if ψ = ψ , α¯ HR α¯ δ † O kxα¯ − xα¯k + kxα¯ − x k , if ψ = ψQO, where we have used the estimate from Lemma 1 which give a bound for the approximation error by the ψ-functionals , and forα ¯ < α∗, since α 7→ δ δ kxα−xαk is a monotonically decreasing function, it follows that kxα∗ −xα∗ k ≤ δ kxα¯ − xα¯k. Thus, for all α ∈ (0, αmax), we have δ µ O √ +α ¯ , if ψ = ψHD, α¯ δ µ O kx − xα¯k +α ¯ , if ψ = ψL, kxδ − x†k = α¯ α∗ δ O √ + kx − x†k , if ψ = ψ , α¯ HR α¯ δ † O kxα¯ − xα¯k + kxα¯ − x k , if ψ = ψQO. 2µ = O δ 2µ+1 , as δ → 0, which completes the proof. Remark. One may observe in the proof of the above theorem that the quasi- optimality rule actually satisfies δ † δ † kxα∗ − x k ≤ C inf kxα − xαk + kxα − x k , α∈(0,αmax) which is quite remarkable as the upper bound above is almost the most optimal rate that is possible. We may organise the above four rules into a table in terms of early saturation and required noise restriction as follows: 1 Early Saturation: µ ≤ 2 No Early Saturation: µ < 1 Np with p = 1 HD HR Np with p = 2 L QO
51 where the required noise restrictions were demonstrated in Proposition 10 and the early saturation effect of the HD and L rules was first observed when bounding the approximation error in terms of the smoothness index µ. Namely, we saw in Proposition 13 that we had to restrict the range of 1 possible µ to µ ∈ [0, 2 ). The Hanke-Raus and quasi-optimality rules are thus a kind of “remedy” to this effect for the heuristic discrepancy and L rules, respectively, as they are closely related. We remark that for a-posteriori parameter choice rules, the discrepancy (1.29) and improved discrepancy (1.40) principles have a similar relationship in that the latter is also the “remedy” for the former (with respect to saturation). Returning to the topic of heuristic parameter choice rules, we mention that in spite of being “remedies” regarding saturation, the noise restrictions required for the L and quasi-optimality rules are more stringent compared to the somewhat weaker restriction required for the convergence of the other two rules.
2.2 Weakly Bounded Noise
This section is largely analogous with the paper [89]. In this scenario, we deviate from the classical theory somewhat and consider the case in which the noise is unbounded in the image space, but satisfies another constraint which qualifies it to be named as in the title of this section. We describe this in the following definition: Definition 8. If the noise e = y − yδ satisfies 1 τ := k(AA∗)pek < ∞, with p ∈ 0, , 2 then we say that the noise is weakly bounded [29,30]. That is to say that the noise belongs to the Hilbert space Z, which can be defined as the completion of range A with respect to k(AA∗)p · k complemented by range A⊥ (cf. [89]). The above noise model may be interpreted as a deterministic treatment of the white noise model, which is studied extensively in the stochastic treatment of ill-posed problems [88]. The question of whether one may still regularise, in the determistic sense, in the presence of this large, i.e., weakly bounded noise can be verified by the proceeding proposition [51]: Proposition 14. Suppose that e ∈ Z. Then the Tikhonov regularised solu- δ tion xα ∈ X is well defined. Proof. Notice that
δ ∗ ∗ −1 δ kxαk = kA (AA + αI) y k ≤ kA∗(AA∗ + αI)−1yk + kA∗(AA∗ + αI)−1(y − yδ)k.
52 The first term is well defined, thus we proceed to show that the second term is also finite Z ∞ ∗ 1/2 ∗ −1 δ 2 1 ∗ 1/2 δ 2 k(AA ) (AA + αI) (y − y )k = 2 dkFλ(AA ) (y − y )k 0 (λ + α) 2 1 ∗ 1/2 δ 2 τ ≤ sup 2 k(AA ) (y − y )k = 2 < ∞, λ∈(0,kAk2) (λ + α) α
1 1 ∗ 2 ∗ p with p = 2 , which proves existence, since k(A A) ek ≤ Ck(A A) ek for 1 p ≤ 2 . We would like to replicate the analogous data error estimates as we had in case the noise was bounded in Y . Indeed, the following error estimates are courtesy of [28,112]: Proposition 15. There exist constants such that
δ τ δ τ kx − xαk ≤ C and kA(x − xα)k ≤ C , (2.30) α p+ 1 α p α 2 α
δ for all y, y ∈ Z and α ∈ (0, αmax). Proof. We have
Z ∞ 1−2p δ 2 λ 2p δ 2 kxα − xαk = 2 λ dkFλ(y − y)k 0 (λ + α) 1 τ 2 ≤ C k(AA∗)p(yδ − y)k2 = C . p α2p+1 p α2p+1 Moreover,
Z ∞ 2−2p δ 2 λ 2p δ 2 kA(xα − xα)k = 2 λ dkFλ(y − y)k 0 (λ + α) 1 τ 2 ≤ C k(AA∗)p(yδ − y)k2 = C , p α2p p α2p where Cp are positive constants depending on p. The observant reader will notice that the above proposition has omitted ap- proximation error estimates and will realise that as those may be estimated independently of the noise, they are identical to the ones we derived in Propo- sition 6. Thus, with the error estimates, one can even prove convergence of the Tikhonov regularised solution, even in the weakly bounded noise case:
p+ 1 Theorem 7. Choosing α = α(τ) such that α(τ) → 0 and τ/α(τ) 2 → 0 as τ → 0, we have δ † kxα − x k → 0, as τ → 0.
53 Proof. The estimate for the data propagation error of the previous proposi- tion and the classical estimate for the approximation error allows us to bound the error as τ kxδ − x†k ≤ C + Cα → 0, (2.31) α p+ 1 α 2 p+ 1 as τ → 0, if we choose α such that τ/α 2 → 0 and α → 0 as τ → 0. In order to prove convergence rates (which, in this instance, will be in terms of τ), we recall from the usual theory that we require a source condition. To this end, we assume once more that (2.5) holds [28,30]:
Corollary 4. Let x† satisfy the source condition (2.5) with µ ∈ [0, 1). Then 2 choosing α = α(τ) = αopt ∼ τ 2µ+2p+1 , we have
2µ δ † 2µ+2p+1 kxα − x k = O(τ ), as τ → 0.
2 Proof. If we take the infimum of (2.31) over α, then we get α ∼ τ 2µ+2p+1 . Indeed, with this a-priori choice of parameter, one gets the convergence rate above. What we observe is that the above results are true generalisations of the standard theory in the sense that setting p = 0 so that rather than e ∈ Z, we have e ∈ Y , the results are analogous with the standard ones (see Section 2.1).
2.2.1 Modified Parameter Choice Rules In this setting, the discrepancy is not well defined and for this reason, we introduce, from [89]
δ 1 ∗ q δ δ ψMHD(α, y ) : = k(AA ) (Ax − y )k, (2.32) q+ 1 α α 2 δ 1 ∗ q II δ ∗ q δ δ 1 ψMHR(α, y ) : = h(AA ) (Ax − y ), (AA ) (Ax − y )i 2 , (2.33) q+ 1 α,δ α α 2 as the modified heuristic discrepancy and Hanke-Raus rules, respectively, with q ≥ 0, a parameter to be specified. The sharp eyed reader will notice the lack of a modified quasi-optimality rule; but this rule need not be changed as it lived in the domain space and is therefore well defined in any case. The L-curve rule has also been omitted, for the same reason, as well as the fact that its analysis was not included in [89] as it was introduced in a later paper (cf. [91]), and we also forgo that task of analysing it here in the weakly
54 bounded noise case. The linear filter functions (2.11) may be generalised to include the modified parameter functionals above; in particular αmλ2q Φq,m(λ) := , α α2q(λ + α)m+1 q ≥ p and m = 1, if ψ = ψ , (2.34) MHD with q ≥ p and m = 2, if ψ = ψMHR, 1 q = 2 and m = 3, if ψ = ψQO.
Noise Restriction We may also generalise the Muckenhoupt inequalities, i.e., the set of permissible noise as Z ∞ Z α ν+1 −1 2 ν 2 Nν := e ∈ Z | α λ dkFλek ≤ C λ dkFλek ∀ α ∈ (0, αmax) , α 0 (2.35) where ν = 1 when ψ = ψQO and ν = 2q whenever ψ ∈ {ψMHD, ψMHR} [2]. In the following, we state some examples in which the condition (2.35) holds
Case Study of Noise Restriction Note that in the classical situation of (strongly) bounded noise, it has been verified that (2.35) is satisfied in typical situations [87]. Moreover, for coloured Gaußian noise, (2.35) holds almost surely for mildly ill-posed problems [88]. ∗ Suppose A is compact such that AA has eigenvalues {λi} with polynomial decay, and we assume a certain polynomial decay or growth of the noise δ ∗ e = y − y with respect to the eigenfunctions of AA , denoted by {ui}: 1 1 λ = , γ > 0, and |hyδ − y, u i|2 = κ β ∈ , κ > 0. (2.36) i iγ i iβ R Then ∞ ∞ X 1 X 1 kyδ − yk2 = κ , τ 2 = k(AA∗)p(yδ − y)k2 = κ . iβ iβ+2pγ i=1 i=1 If we consider the case of unbounded but weakly bounded noise, i.e., kyδ − yk2 = ∞ but τ < ∞, then the exponents β, p should thus satisfy
β ≤ 1 and β + 2pγ > 1, thus β ∈ (1 − 2pγ, 1]. The inequality in (2.35) can then be written as
ν+1 X γ−β ν+1 X 1 δ 2 κα i = α |hy − y, uii| λi 1 λ ≥α 1≤i≤α− γ i X X 1 ≤ C λν|hyδ − y, u i|2 = Cκ . i i iγν+β λ ≤α 1 i i≥α− γ
55 − 1 Defining N∗ = α γ , we have ( X Z N∗ N γ−β+1 if γ − β > −1, iγ−β ≤ xγ−β dx ≤ C ∗ 1 1 1 if γ − β < −1, 1≤i≤α− γ and C Z ∞ X 1 1 γν+β−1 if γν + β > 1, γν+β ∼ γν+β dx ∼ N∗ i ∗ x 1 N i≥α− γ ∞ if γν + β ≤ 1. −γ Since α = N∗ , we arrive at the sufficient inequality
−γ(ν+1)+1+γ−β 1−γν−β N∗ ≤ CN , in the case that γ − β > −1 and γν + β > 1. Since the exponents match, the noise condition is then satisfied. If γν + β ≤ 1, then the inequality is clearly satisfied because of the divergent right-hand side. Thus, the noise condition holds for β < γ + 1. Roughly speaking, this means that the noise should not be too regular (rela- tive to the smoothing of the operator). In particular, the deterministic model of white noise, where β = 0 (no decay) satisfies a noise condition if the op- erator is smoothing. Most importantly, the assumption of a noise condition (2.35) is compatible with a weakly bounded noise situation.
Convergence Analysis The convergence analysis of regularisation methods with standard (non- heuristic) parameter choice rules in the weakly bounded noise setting is well established (cf. [28]). For the analysis of heuristic rules, we follow the exam- ple of [89] and prove the following estimates for the associated functionals:
Lemma 2. Let x† satisfy (2.5) with µ ∈ [0, 1). Then
δ • If y − y ∈ N1, then there exists a positive constant C such that
δ δ δ Ckxα − xαk ≤ ψQO(α, y − y ) ≤ kxα − xαk, (2.37) † ψQO(α, y) ≤ kxα − x k; (2.38)
δ 1 • if y − y ∈ N2q, with q ∈ min{p + 1, 2 − µ}, then there exist positive constants such that
δ δ τ kx − xαk ≤ ψMHD(α, y − y ) ≤ C , (2.39) α p+ 1 α 2 µ ψMHD(α, y) ≤ Cα ; (2.40)
56 δ 3 • if y − y ∈ N2q and q ≤ min{p + 2 , 1 − µ}, then there exist positive constants such that
δ δ τ Ckx − xαk ≤ ψMHR(α, y − y ) ≤ C , (2.41) α p+ 1 α 2 µ ψMHR(α, y) ≤ Cα , (2.42) for all α ∈ (0, αmax). Proof. In the following, we utilise the following useful estimate: that for t ≥ 0, there exists a positive constant such that
λt C ≤ , (2.43) α + λ αmax{1−t,0} for all α, λ ≥ 0. The estimates (2.37),(2.38) do not require the weakly boundedness condition on the noise and therefore the proofs are nigh on identical to the ones found in Chapter 2, Propositions 10, 11 and (12). For m > 0 and with (2.43), we obtain
2(q−q) !m+1 λ2q αm λ2p λ m+1 ≤ α2q (λ + α)(m+1) α2q−m λ + α 1 λ2p λ2p ≤ Cλ2p = C = C , 2q−m+max{1− 2(q−q) ,0}(m+1) max{1+2p,2q−m} 1+2p α m+1 α α
m+1 if q < 2 + p. Moreover,
1+2µ+2q !m+1 λ2q αm 1 λ m+1 λ1+2µ ≤ α2q (λ + α)(m+1) α2q−m λ + α 1 1 ≤ C = C ≤ Cα2µ, 2q−m+max{1− 1+2µ+2q ,0}(m+1) max{−2µ,2q−m} α m+1 α
m if q < 2 − µ. Using the spectral representation, the upper estimates (2.39) (using m = 2) and (2.41) (using m = 3) follow from the first estimate, while (2.40) and (2.42) are obtained similarly by the second one. For the lower bound, we estimate
Z kAk2+ 2 2 δ 1 2q α δ 2 ψMHD(α, y − y ) = 2q+1 λ 2 dkFλ(y − y )k α 0 (λ + α) Z α Z kAk2+ 1 2q δ 2 1 2q−2 δ 2 ≥ C 2q+1 λ dkFλ(y − y )k + C 2q−1 λ dkFλ(y − y )k , α 0 α α (2.44)
57 for all α ∈ (0, αmax), since for λ ≤ α, it follows that α/(λ + α) ≥ C and for λ ≥ α, one has α/(λ + α) ≥ Cα/λ, yielding the estimate above. Conversely, Z α λ kxδ − x k2 = dkF (y − yδ)k2 α α (λ + α)2 λ 0 (2.45) Z α Z kAk2+ 1 λ δ 2 −1 δ 2 ≤ C dkFλ(y − y )k + C λ dkFλ(y − y )k . α 0 α α R α Since 2q − 1 ≤ 0, we observe that the term with 0 in the above inequal- ity is bounded by the corresponding term in (2.44). Thus, using the noise condition, the second term can be bounded by the first one of (2.44). For the lower bound of the modified Hanke-Raus functional, we can estimate
λ2q 2q 2 if λ ≤ α, λ α 2q+1 ≥ C α α2q (λ + α)3 λ2q−3 if λ ≥ α. α2q−2
δ Now, using N2q and (2.45), we can estimate kxα − xαk by the part of δ ψMHR(α, y − y ) restricted to λ ≤ α. The part for λ ≥ α can then be estimated from below by 0, i.e., Z α δ 1 2q δ 2 ψMHR(α, y − y ) ≥ C 2q+1 λ dkFλ(y − y )k α 0 2 (2.35) Z kAk + −1 δ 2 ≥ C λ dkFλ(y − y )k , α for all α ∈ (0, αmax).
1 Note that our parameters satisfy µ ≥ 0 and p ∈ [0, 2 ], hence, the restrictions 1 on the parameter q reduce to q ≤ 2 − µ or q ≤ 1 − µ for the modified heuristic discrepancy or Hanke-Raus rules, respectively. Since, by definition, q ≥ p must hold in any case, we have as a restriction to the smoothness 1 index that µ ∈ [0, 2 − p] for ψMHD and µ ∈ [0, 1 − p] for ψMHR, respectively. Only then, there exists a possible choice for q that satisfies the conditions of the previous lemma. Observe that the interval for µ is smaller for ψMHD, which also illustrates a saturation effect of the discrepancy-based rules which is well-known in the standard noise case, i.e., p = 0 (see Section 2.1).
Theorem 8. Let x† satisfy the source condition (2.5) and in addition, sup- pose that y − yδ ∈ N and A∗y 6= 0, ψ = ψ , 1 QO δ 1 1 ∗ q (y − y ) ∈ N2q, µ ∈ [0, 2 − p], q ∈ [p, 2 − µ], (AA ) y 6= 0, ψ = ψMHD, δ ∗ q (y − y ) ∈ N2q, µ ∈ [0, 1 − p], q ∈ [p, 1 − µ], (AA ) y 6= 0, ψ = ψMHR.
58 Then 2µ µ O τ 2µ+2p+1 , if ψ = ψQO, 2µ 2µ δ † 2µ+2p+1 · 1−2q kxα∗ − x k = O τ , if ψ = ψMHD, 2µ · µ O τ 2µ+2p+1 1−q , if ψ = ψMHR, as τ → 0.
Proof. We treat the different parameter choice rules separately:
• From the definition of α∗ and the triangle inequality, it follows, with 2 α = τ 2µ+2p+1 , that
δ δ δ ψQO(α∗, y ) ≤ ψQO(α, y ) ≤ ψQO(α, y − y) + ψQO(α, y) † δ µ τ 2µ ≤ kxα − x k + kx − xαk ≤ Cα + C = O τ 2µ+2p+1 . α p+ 1 α 2 By the triangle inequality, (2.37) and (2.38) of Lemma 24,
δ † † δ kxα∗ − x k ≤ kxα∗ − x k + kxα∗ − xα∗ k † δ = O kxα∗ − x k + ψQO(α∗, y − y ) † δ ≤ O kxα∗ − x k + ψQO(α∗, y ) + ψQO(α∗, y) 2µ µ 2µ+2p+1 = O α∗ + τ .
Note that
Z kAk2 2 δ 2 λ δ 2 ψQO(α, y ) ≥ α 2 4 dkFλy k 0 (λ + kAk ) Z kAk2 2 1 δ 2 (2.46) ≥ α 2 4 λ dkFλy k (2kAk ) 0 2 2 1 ∗ ∗ 1 −p 2 ≥ α kA yk − kAA k 2 τ ≥ Cα , (2kAk2)4
for all α ∈ (0, αmax) and τ sufficiently small. Hence for α = α∗, it 2µ follows that α∗ ≤ Cτ 2µ+2p+1 . Therefore, we may deduce that
2µ δ † 2µ+2p+1 µ kxα∗ − x k = O(τ ), as τ → 0.
• Note that from (AA∗)qy 6= 0, we may conclude, as in (2.46), that
1 2µ 2 δ 1 −q α∗ ≤ C ψMHD(α, y ) 2 = O τ 2µ+2p+1 1−2q .
59 Then it follows, as above, from (2.39) and (2.40), that δ † † δ µ δ kxα∗ − x k ≤ kxα∗ − x k + kxα∗ − xα∗ k = O(α∗ + ψMHD(α∗, y − y )) µ µ τ 2µ 2µ 2µ = O α + α + = O τ 2µ+2p+1 1−2q + τ 2µ+2p+1 , ∗ 1 +p α 2 as τ → 0. • One may similarly verify that if k(AA∗)qyk ≥ C, then
δ 1 α∗ ≤ CψMHR(α∗, y ) 1−q . Therefore, δ † δ † kxα∗ − x k ≤ kxα∗ − xα∗ k + kxα∗ + x k µ δ = O α∗ + ψMHR(α, y) + ψMHR(α, y − y ) δ µ 2µ 2µ µ = O ψMHR(α, y ) 1−q + τ 2µ+2p+1 = O τ 2µ+2p+1 1−q .
For the quasi-optimality rule, one may notice that the above convergence rates are optimal for the saturation case µ = 1, but they are only suboptimal for µ < 1 (similarly as in Section 2.1). Let us further discuss the assumptions in this theorem: for the modified heuristic discrepancy rule, the first condition on q is not particularly restric- 1 1 tive. However, the requirement q ≤ 2 −µ implies that µ ≤ 2 −q, which means 1 that we obtain a saturation at µ = 2 − q. This is akin to the bounded noise 1 case (q = 0), where this method saturates at µ = 2 (see Section 2.1). It is well known that a similar phenomenon occurs for the non-heuristic analogue of this method, namely the discrepancy principle [35]. In contrast to the modified discrepancy rule, we observe that the saturation for the modified Hanke-Raus rule occurs at µ = 1 − q. Hence, again analo- gous to the bounded noise case (and to the non-heuristic case), the modified Hanke-Raus method yields convergence rates for a wider range of smoothness classes. We may, however, impose an additional condition as before in order to achieve an optimal convergence rate. More specifically, since it was independent of the noise, we may consider the condition from the standard theory described before; namely, (2.26).
Theorem 9. Let the assumptions of the previous theorem hold and let α∗ be selected according to either the quasi-optimality, modified heuristic dis- crepancy, or the modified Hanke-Raus rule. Then, assuming the regularity condition (2.26), it follows that 2µ δ † 2µ+2p+1 kxα∗ − x k = O τ , as τ → 0.
60 Proof. The proof for the quasi-optimality rule is analogous to the standard theory (see Corollary 6), so therefore we omit it. For the modified heuristic discrepancy and Hanke-Raus rules, we show that the regularity condition † implies that ψ(α, y) ≥ Ckxα − x k. Recall that
Z kAk2+ 2 † 2 α † 2 kxα − x k = 2 dkEλx k 0 (α + λ) Z α Z kAk2+ † 2 2 1 † 2 ≤ C dkEλx k + Cα 2 dkEλx k . (2.47) 0 α λ
For λ ≥ α we have the following estimate for m > 0 with a constant Cq,m:
λ2q λαm λ2q λαm αm αm+1 ≥ = C ≥ C . α (λ + α)m+1 λ (λ + λ)m+1 q,m λm q,m λm+1
Thus, for ψ = ψMHD taking m = 1 and for ψ = ψMHR, taking m = 2, we obtain
Z kAk2+ 2q m 2 λ λα † 2 ψ (α, y) ≥ m+1 dkEλx k 0 α (λ + α) Z kAk2+ 2q m λ λα † 2 ≥ m+1 dkEλx k α α (λ + α) Z kAk2+ 2 α † 2 ≥ C 2 dkEλx k . α λ By the regularity condition, the first integral in the upper bound in (2.47) can be estimated by the second part which agrees up to a constant with the lower bound for ψ(α, y) in both cases. In the proof of Theorem 8, the † µ † estimate kxα∗ − x k ≤ Cα∗ can then be replaced by kxα∗ − x k ≤ Cψ(α∗, y), which leads to the optimal rate.
2.2.2 Predictive Mean-Square Error The predictive mean-square error functional (cf. [89]) is given by
δ δ ψPMS(α, y ) := kAxα − yk.
It is clear that this is not an implementable parameter choice rule as the functional to be minimised requires knowledge of the exact data y. The motivation for studying it, however, becomes apparent when we consider the generalised cross-validation rule. Indeed, if we now turn our attention to finite dimensional ill-conditioned problem
Anx = yn,
61 n for An : X → R , then we can define the generalised cross-validation func- tional by 1 ψ (α, yδ ) := kA xδ − yδ k, GCV n ρ(α) n α n with α ρ(α) := tr (A A∗ + αI )−1 . (2.48) n n n n 2 δ 2 For i.i.d. noise, the expected value of ψGCV(α, yn) − kek estimates the pre- dicted mean-square error functional squared (cf. [105,152]). The predictive mean-square error functional differs from the previous ones in the sense that it has different upper bounds. In fact, from (2.30), one immediately finds that τ 2 ψ2 (α, yδ) ≤ C + Cα2µ+1, PMS α2p 1 δ for µ ≤ 2 . It can easily be seen that for strongly bounded noise ky −yk < ∞, the method fails as it selects α∗ = 0. 2 The minimum of the upper bound is obtained for α = αopt = O(τ 2p+2µ+1 ), but the resulting rate is of the order
h (2µ+1) i2 2 δ 2p+2µ+1 ψPMS(α, y ) ≤ C τ ,
δ which agrees with the optimal rate for the error in the A-norm, i.e., kxα − † δ † x kA := kA(xα − x )k. Thus, for this method, it is not reasonable to bound δ † the functional ψPMS by expressions involving kxα −xαk or kxα −x k. Rather, we try to directly relate the selected regularisation parameter α∗ to the op- timal choice αopt. To do so, we need some estimates from below, although in this case, we will need to introduce a noise condition of a different type and an additional condition on the exact solution. Lemma 3. Suppose that there exists a positive constant C such that yδ −y ∈ Z satisfies Z ∞ 2 δ 2 τ dkFλQ(y − y )k ≥ C 2p−ε , (2.49) α α for all α ∈ (0, αmax) and ε > 0 small. Then
δ τ kA(xα − xα)k ≥ C p− ε . α 2 Proof. From (2.49), one can estimate Z ∞ 2 Z ∞ δ 2 λ δ 2 δ 2 kA(xα − xα)k = 2 dkFλQ(y − y )k ≥ dkFλQ(y − y )k 0 (α + λ) α τ 2 ≥ C . α2p−ε
62 Let us exemplify condition (2.49): for the case in (2.36), we have that ( Z Z N∗ 1−β δ 2 X 1 1 CN∗ if 1 − β > 0, dkFλQ(y − y )k = β ∼ β dx = λ≥α i 1 x C if 1 − β < 0, 1≤i≤N∗
1−β 1 − γ with N∗ = 1 . This gives that the left-hand side is of the order of α . α γ 1−β For (2.49) to hold true, we require that γ ≥ 2p − ε, which means that 1 + εγ ≥ β + 2pγ.
If we now choose p close to the smallest admissible exponent for the weakly bounded noise condition, i.e. 2pγ = 1 − β + εγ, with ε small, then the condition holds. In other words, our interpretation of the stated noise con- dition means that k(AA∗)p(yδ − y)k < ∞ and p is selected as the minimal exponent such that this holds. This noise condition automatically excludes the (strongly) bounded noise case. The example also shows that the desired inequality with ε = 0 cannot be achieved.
1 δ Theorem 10. Let µ ≤ 2 , α∗ be the minimiser of ψPMS(α, y ), assume that the noise satisfies (2.49) and that Ax† 6= 0. Then
( 2µ 2µ+1 2µ+2p+1 2 δ † Cτ , if α∗ ≥ αopt, kx − x k ≤ 2µ 2p+1 α∗ − Cτ 2µ+2p+1 (2p−)(2µ+2p+1) , if α∗ ≤ αopt.
If additionally for some 2 > 0, Z ∞ 2µ−1 2 2µ−1+2 λ dkEλωk ≥ Cα , (2.50) α then for the first case we have
2µ 2µ+1 δ † 2µ+2p+1 2µ+1+2 kxα∗ − x k ≤ Cτ , if α∗ ≥ αopt.
† Proof. If α∗ ≥ αopt, it follows from Ax 6= 0 that
2 2 kAxα − yk ≥ Cα , (2.51) and if (2.50) holds, then one even has that
Z ∞ 1+2µ 2 † 2 λ α 2 kA(xα − x )k ≥ 2 dkEλωk α (α + λ) Z ∞ 2 2µ−1 2 2µ+1+2 ≥ α λ dkEλωk ≥ Cα . (2.52) α
63 δ 2 Since α 7→ kA(xα − xα)k is a monotonically decreasing function and using Young’s inequality [36], we may obtain that
h 2µ+1 i2 t δ 2 2 2µ+2p+1 Cα∗ ≤ kA(xαopt − xαopt )k + kAxαopt − yk ≤ C τ , i.e., 2µ+1 2 α∗ ≤ Cτ 2µ+2p+1 t , where t = 2 or t = 2µ + 1 + 2 if (2.50) holds. If α∗ ≤ αopt, then we may bound the functional from below as 1 ψ2 (α, yδ) ≥ kA(xδ − x )k2 − kAx − yk2, PMS 2 α α α for all α ∈ (0, αmax), which allows us to obtain 1 kA(xδ − x )k2 − kAx − yk2 ≤ ψ2 (α , yδ) ≤ ψ2 (α , yδ) 2 α∗ α∗ α∗ PMS ∗ PMS opt h 2µ+1 i2 δ 2 2 2µ+2p+1 ≤ 2kA(xαopt − xαopt )k + 2kAxαopt − yk ≤ C τ . i.e., by Lemma 3,
2 τ 1 h 2µ+1 i2 2µ+1 δ 2 2 2µ+2p+1 C 2p−ε − Cα∗ ≤ kA(xα∗ − xα∗ )k − kAxα∗ − yk ≤ C τ . α∗ 2
Now, from α∗ ≤ αopt, we get
2 τ h 2µ+1 i2 2µ+2p+1 2µ+1 C 2p−ε ≤ C τ + Cα∗ α∗ h 2µ+1 i2 h 2µ+1 i2 2µ+2p+1 2µ+1 2µ+2p+1 ≤ C τ + Cαopt ≤ C τ , i.e., h 2p i2 2 2p 2p−ε 2µ+2p+1 2µ+2p+1 · 2p−ε α∗ ≥ C τ ⇐⇒ α∗ ≥ Cτ .
Then inserting the respective bounds for α∗ into (1.15) yields the desired rates. Condition (2.50) can again be verified as we did for the noise condition for some canonical examples. The inequality with 2 = 0 does not usu- ally hold. The condition can be interpreted as the claim that x† satisfies a source condition with a certain µ but this exponent cannot be increased, i.e., x† 6∈ range((A∗A)µ+). A similar condition was used by Lukas in his analysis of the generalised cross-validation rule [105]. The theorem shows that we may obtain almost optimal convergence results but only under rather restrictive conditions. Moreover, the method shows a 1 saturation effect at µ = 2 comparable to the heuristic discrepancy rule.
64 2.2.3 Generalised Cross-Validation The generalised cross-validation rule was proposed and studied in particular by Wahba [152], and is most popular in a statistical context [125] but less so for deterministic inverse problems. It is derived from the cross-validation method by combining the associated estimates with certain weights. Impor- tantly, it was shown in [152] that the expected value of the generalised cross- validation functional converges to the expected value of the PMS-functional as the dimension tends to infinity. This is why, in the preceding section, we studied ψPMS in detail.
∗ Proposition 16. Let supn tr(AnAn) < ∞; then it follows that the weight ρ(α) in ψGCV is monotonically increasing with ρ(0) = 0 and bounded with ρ(α) ≤ 1. Furthermore, for α > 0, it follows that ρ(α) → 1 as the dimension n → ∞. Proof. Observe that, from the definition of ρ, namely (2.48), we may derive
n n α 1 X α 1 X λi ρ(α) = tr((A∗ A + αI)−1) = = 1 − n n n n α + λ n α + λ i=1 i i=1 i n 1 X λi = 1 − . (2.53) n α + λ i=1 i Moreover, it is clear that ρ(α) ≤ 1 for all α > 0, since the second term is positive. Additionally, since
n ∞ ∞ X λi X λi ∗ X 1 lim = ≤ sup tr(AnAn) , n→∞ α + λ α + λ n α + λ i=1 i i=1 i i=1 i is clearly bounded, as per the assumption on the supremum in the statement of the proposition, and since the series is clearly convergent, it follows that taking the limit 1/n → 0 as n → ∞ yields that the second term of (2.53) converges to 0 as n → ∞. Thus, ρ(α) → 1 as n → ∞. The fact that ρ(0) = 0 is obvious. This is also the reason why one has to study the GCV in terms of weakly bounded noise. The limit limn→∞ ψGCV tends pointwise to the residual δ δ kAxα − y k, which in the bounded noise case does not yield a reasonable parameter choice as then α∗ = 0 is always chosen. Note that in a stochastic context, and using the expected value of ψGCV, a convergence analysis has been done by Lukas [105]. In contrast, we analyse the deterministic case. We now consider the ill-conditioned problem
Anx = yn, (2.54)
δ n where we only have noisy data yn ∈ R .
65 We impose a discretisation independent source condition; that is, † ∗ µ x = (AnAn) ω, kωk ≤ C, 0 < µ ≤ 1, where C does not depend on the dimension n. Furthermore, let us restate some definitions for this discrete setting: n δ 2 X 2p δ 2 δn := kyn − ynk, τ := λi |hyn − yn, uii| . i=1 Note that in an asymptotically weakly bounded noise case, we might assume that τ is bounded independent of n while δn might be unbounded as n tends to infinity. Moreover, we impose a noise condition of similar type as for the predictive mean-square error, but slightly different: 2 2 X α δ 2 τ 2 |hyn − yn, uii| ≥ C 2p−ε , for all α ∈ I, (2.55) λi α λi≥α where C does not depend on n. This is different from the condition stated in [89], as the author of this thesis found that the condition there was false and it is thus corrected here with arguably a much more restrictive condition (which is nevertheless needed to for the proceeding results of this section). Note that in the discrete case, one must restrict the noise condition to an interval with I = [αmin, αmax] with αmin > 0. We also state a regularity condition
X 2µ−1 2 2µ−1+2 λ |hω, vii| ≥ Cα , for all α ∈ I, (2.56)
λi≥α ∗ where {vi} denote the eigenfunctions of AnAn. In order to deduce convergence rates, we look to bound the functional from above as we did for the other functionals in the previous sections: δ n Lemma 4. For yn ∈ R , there exist positive constants such that C 1 ψ (α, y ) ≤ Cα2µ+1, µ ≤ , (2.57) GCV n ρ(α) 2 C 1 ψ (α, yδ − y ) ≤ δ2, µ ≤ , (2.58) GCV n n ρ(α) n 2 1 1 ψ (α, yδ ) ≤ Cα2µ+1 + δ2 , µ ≤ . (2.59) GCV n ρ(α) n 2 Proof. It is a standard result (cf. [35]) that δ δ δ kAnxα − yn − (Anxα − yn)k ≤ kyn − ynk ≤ δn. 2µ+1 Similarly, by the usual source condition, we obtain k(Anxα − yn)k ≤ Cα 1 for µ ≤ 2 (see Proposition 7). The result follows from the triangle inequality.
66 The proceeding results generally follow from the infinite dimensional setting and we similarly obtain the following bounds from below:
Lemma 5. Suppose that α ∈ I and also that (2.55) holds. Then
1 τ 2 ψ (α, yδ − y ) ≥ C . GCV n n ρ(α) α2p−ε
† Moreover, if kAnx k ≥ C0, with an n-independent constant, then there exists an n-independent constant C with 1 ψ (α, y ) ≥ C α2. GCV n ρ(α)
If (2.56) holds and α ∈ I, then 1 1 ψ (α, y ) ≥ C α2µ+1+2 , µ ≤ . GCV n ρ(α) 2
Proof. We can estimate
2 2 δ 1 X α δ 2 ψGCV(α, yn − yn) ≥ 2 2 |hyn − yn, uii| ρ (α) (λi + α) λi≥α 2 1 X α δ 2 ≥ C 2 2 |hyn − yn, uii| ρ (α) λi λi≥α (2.55) 1 τ 2 ≥ C . ρ2(α) α2p−ε
The remaining two inequalities in the lemma follow analogously to (2.51) and (2.52).
1 δ Theorem 11. Let µ ≤ 2 , assume α∗ is the minimiser of ψGCV(α, yn) and suppose further that α∗ ∈ I such that (2.55) holds. Then
− 1 2p−ε 2 − 2 2 2µ+1 2 2p−ε 2p−ε 2p−ε α∗ ≥ inf (Cα + Cδn) τ ≥ Cδn τ . α≥α∗
On the other hand
1 t 1 2µ+1 2 α∗ ≤ inf Cα + Cδn , α≤α∗ ρ(α) with t = 2. If α∗ ∈ I and (2.56) hold, then the above upper bound on α∗ holds with t = 2µ + 1 + 2.
67 Proof. Take an arbitraryα ¯ and consider first the case α∗ ≤ α¯. Following on from the previous lemmas and using (2.58), we have
2 1 τ δ δ C 2p−ε ≤ ψGCV(α∗, yn − yn) ≤ ψGCV(α∗, yn) + ψGCV(α∗, yn) ρ(α∗) α∗ δ 1 2µ+1 1 2µ+1 2 1 2µ+1 ≤ ψGCV(¯α, yn) + C α∗ ≤ Cα¯ + δn + C α∗ . ρ(α∗) ρ(¯α) ρ(α∗)
Hence, by the monotonicity of α 7→ α2µ+1 and since ρ is monotonically increasing (cf. Proposition 16), we obtain that
2 τ ρ(α∗) 2µ+1 2 2µ+1 C 2p−ε ≤ Cα¯ + δn + α∗ α∗ ρ(¯α) 2µ+1 2 2µ+1 2µ+1 2 ≤ Cα¯ + δn + α∗ ≤ Cα¯ + δn .
Hence,
− 1 2p−ε 2 − 2 2 2µ+1 2 2p−ε 2p−ε 2p−ε α∗ ≥ inf (Cα + Cδn) τ ≥ Cδn τ . α≥α∗
Now, suppose α∗ ≥ α¯. Then using that α∗ is a minimiser, and with t as in the statement of the theorem
C t δ δ α∗ ≤ ψGCV(α∗, yn) ≤ ψGCV(α∗, yn) + ψGCV(α∗, yn − yn) ρ(α∗)
1 2µ+1 2 1 2 ≤ (Cα¯ + Cδn) + C δn ρ(¯α) ρ(α∗) 1 1 ≤ (Cα¯2µ+1 + Cδ2) + C δ2. ρ(¯α) n ρ(¯α) n
Hence, as ρ(α∗) is bounded from above by 1, it follows that
1 t 1 2µ+1 2 α∗ ≤ inf Cα + Cδn . α≤α∗ ρ(α)
1 δ Theorem 12. Suppose that µ ≤ 2 , α∗ is the minimiser of ψGCV(α, yn), where α∗ ∈ I such that (2.55) and (2.56) are satisfied. Suppose further that 2 2µ+1 one has ρ(δn ) ≥ C. Then
1 2µ 2p−ε δ † t τ kxα∗ − x k ≤ δn + δn , δn with t as in Theorem 11.
68 Proof. Since δ † µ δn kxα∗ − x k ≤ Cα∗ + C √ , α∗
2 2µ+1 we may take the balancing parameterα ¯ = δn . From the previous theorem, it follows that if α∗ ≤ α¯, then by Theorem 11
2 2 τ 2p−ε τ 2p−ε α∗ ≥ 1 ≥ . 2µ+1 2 2p−ε δn [infα≥α∗ (Cα + Cδn)]
On the other hand, if α∗ ≥ α¯, and ρ(¯α) ≥ C, then
2 α∗ ≤ Cδ t .
µ √ Thus, taking for α and δn/ α the worst of these estimates, we obtain the desired result. This result establishes convergence rates in the discrete case. However, the required conditions are somewhat restrictive as we need that the selected α∗ has to be in a certain interval (although this is to be expected in a finite- 2 dimensional setting). Note that the term δn in Theorem 11 can be replaced by 2 δ any reasonable monotonically decreasing upper bound for ψGCV(α, yn − yn). 2 δ In particular, if we could conclude that α∗ is in a region where ψGCV(α, yn − τ 2 yn) ≤ C α2p , then we would obtain similar convergence results as for the 2 2µ+1 predictive mean square error. Moreover, the condition that ρ(δn ) > C restricts the analysis to the weakly bounded noise scenario, in which case δn → ∞ as n → ∞. The standard “bounded noise” case is ruled out in Theorem 12, because if δn would tend to zero, then this would lead to a contradiction with Proposition 16. In general, however, the performance of the GCV-rule for the regularisation of deterministic inverse problems is subpar compared to other heuristic rules, e.g., those mentioned in the previous sections; cf., e.g., [9, 53]. This is also illustrated by the fact that we had to impose stronger conditions for the convergence results compared to the aforementioned rules.
2.3 Operator Perturbations
We now consider another deviation from the classical theory and attempt to reprove the standard results for this scenario: suppose that, in addition to noisy data yδ (that is, we return to the usual setting in which ky − yδk ≤ δ), one also only has knowledge of a noisy operator
Aη = A + ∆A,
69 such that kAη − Ak ≤ η. Note that this section is largely based on the paper [51]. We also refer the reader to [103,134,143,150]. The specific situation that we consider here, which is often met in practical situations, is that we suppose we have knowledge of the operator noise level, i.e., we assume η known, but we do not know the level of the data error δ. The regularised solution in this case defined as
δ ∗ −1 ∗ δ xα,η = (AηAη + αI) Aηy , (2.60) and the task is then to determine a parameter α such that
δ † kxα,η − x k → 0, as δ, η → 0. First, we state the following useful lemma from [51] which provides some useful bounds for the subsequent analysis:
1 Lemma 6. Let α ∈ (0, αmax) and for s ∈ {0, 2 , 1}, define ( ( (A∗A )s if s ∈ {0, 1}, (A∗A)s if s ∈ {0, 1}, B := η η B := η,s ∗ 1 s ∗ 1 Aη if s = 2 , A if s = 2 . ˆ ˆ Let Bη,s and Bs be the operators we get from Bη,s and Bs by changing the ∗ ∗ roles of the operators Aη with Aη and A with A , respectively. Then for 1 3 s ∈ {0, 2 , 1} and t ∈ {−1, − 2 , −2}, there exist positive constants Cs,t such that ∗ q ∗ t η (A Aη + αI) Bη,s − (A A + αI) Bs ≤ Cs,t . (2.61) η 1 −s−t α 2
∗ t ˆ ∗ t ˆ η (AηA + αI) Bη,s − (AA + αI) Bs ≤ Cs,t . (2.62) η 1 −s−t α 2 Proof. We prove (2.61), this gives (2.62) changing the roles of the operators ∗ ∗ Aη ↔ Aη and A ↔ A . We recall the elementary estimates 1 k(A∗A + αI)−1k ≤ , α 1 k(A∗A + αI)−1A∗k ≤ √ , (2.63) 2 α k(A∗A + αI)−1A∗Ak ≤ 1,
∗ ∗ which also hold with A and A replaced by Aη and Aη, respectively. For s ∈ {0, 1}, it follows from some algebraic manipulations, the fact that Bs,Bη.s
70 commute with the inverses below, and the previous estimates that
∗ −1 ∗ −1 (AηAη + αI) Bη,s − Bs(A A + αI) ∗ −1 ∗ ∗ ∗ −1 = (AηAη + αI) Bη,s(A A + αI) − (AηAη + αI)Bs (A A + αI) ∗ −1 ∗ ∗ ∗ −1 = (AηAη + αI) Bη,sA A − AηAηBs (A A + αI) ∗ −1 ∗ −1 + α(AηAη + αI) [Bη,s − Bs](A A + αI) .
In the case s = 0 and Bη,0 = B0 = I, we find
∗ ∗ ∗ ∗ ∗ Bη,0A A − AηAηB0 = (A − Aη)A + Aη(A − Aη), which, using (2.63), gives C0,−1 = 1. Similarly, we can prove that C1,−1 = 1. 1 ∗ ∗ 5 For the case s = , if Bη,s = A and Bs = A , we obtain C 1 = with 2 η 2 ,−1 4 minor modifications noting that (A∗A + αI)−1A∗ = A∗(AA∗ + αI)−1. The other cases of t follow in a similar manner by
∗ t ∗ t (AηAη + αI) Bη,s − Bs(A A + αI) ∗ t+1 ∗ −1 ∗ −1 = (AηAη + αI) (AηAη + αI) Bη,s − Bs(A A + αI) ∗ t+1 ∗ t+1 ∗ −1 + (AηAη + αI) − (A A + αI) Bs(A A + αI) ,
3 and by using (2.63) and the result for t = −1. For t = − 2 , we employ an additional identity from semigroup operator calculus [92],
1 1 ∗ − 2 ∗ − 2 (AηAη + αI) − (A A + αI) π ∞ sin( ) Z 1 2 − 2 ∗ −1 ∗ −1 = w (AηAη + (α + w)I) − (A A + (α + w)I) dw, π 0 which leads to Z ∞ 1 1 C0,−1η 1 ∗ − 2 ∗ − 2 k(AηAη + αI) − (A A + αI) k ≤ √ 3 dw π 0 w(α + w) 2 2C η ≤ 0,−1 , π α thereby finishing the proof.
2.3.1 Semi-Heuristic Parameter Choice Rules In terms of parameter selection, previous work can be found in, e.g., [103, 104,134,143]. Note that the aforementioned references generally consider a- posteriori rules; namely, the generalised discrepancy principle which selects the parameter as
δ δ δ δ α∗ = α(δ, η, y ) = inf α ∈ (0, αmax) | kAηxα,η − y k = τ δ + ηkxα,ηk , (2.64)
71 with τ ≥ 1 and also a generalised balancing principle. The fact that prohibits the direct use of a minimisation-based rule with, say, a functional, which especially for this section, we define to be of the form
ψ : (0, αmax) × L(X,Y ) × Y → R ∪ {∞} δ δ (α, Aη, y ) 7→ ψ(α, Aη, y ), is that we are faced with an additional operator error, which is usually not random or irregular and hence it would be unrealistic to assume that for the operator perturbation an analogous inequality to the Muckenhoupt condition (2.17) holds. The remedy is to employ a modified functional, which uses the noisy operator Aη, but is designed to emulate a functional for the unperturbed operator. To guarantee a minimiser and convergence of the regularised solution, we restrict the minimisation to an interval [γ, αmax], where the lower bound γ is selected depending on η (but not on δ):
δ ¯ δ α∗ := α(η, y ) := argmin ψ(α, Aη, y ), γ = γ(η) > 0, (2.65) α∈[γ,αmax] with ¯ δ δ δ ψ(α, Aη, y ) := ψ(α, Aη, y ) − J (α, Aη, y , η). In this way, we combine heuristic rules with an η-based choice. Here, ψ is a standard heuristic parameter choice functional and J is the so-called compensating functional. Note that the noise restrictions for the convergence results of the heuristic parameter choice rules may fail to be satisfied in case of smooth noise (i.e., when the noise is in the range of the forward operator). Incidentally, operator noise, when it exists, tends to be smooth. Therefore, the purpose of the compensating functional is to subtract this smooth part of δ δ the noise, i.e., it should behave approximately like ψ(α, A, y ) − ψ(α, Aη, y ). For the compensating functional, we propose two possibilities:
δ δ J (α, Aη, y , η) = Dηkxα,ηk, (SH1) or η J (α, A , yδ, η) = D√ , (SH2) η α which define the (SH1) and (SH2) variants of the semi-heuristic functionals. As a consequence of Lemma 6, we obtain some useful bounds.
Lemma 7. For any of the functionals ψ ∈ {ψHD, ψHR, ψQO}, and any α ∈ (0, αmax), we have
ηkx†k ψ(α, A ,A x†) ≤ C √ + ψ(α, A, Ax†), (2.66) η η s,t α
72 δ ηkx†k ψ(α, A , yδ) ≤ √ + (1 + C ) √ + ψ(α, A, Ax†), (2.67) η α s,t α 1 with the constants Cs,t from Lemma 6: s = 2 , t = −1 for the heuristic 1 3 discrepancy, s = 2 , t = − 2 the Hanke-Raus, and s = 1, t = −2 for the quasi-optimality functionals, respectively. Proof. The inequality (2.66) follows from (2.61) and (2.62), the inequality (2.67) from (2.66) and from the inequalities
† † ψ(α, Aη, yδ) ≤ ψ(α, Aη, yδ − y) + ψ(α, Aη, (A − Aη)x ) + ψ(α, Aη,Aηx ) δ ηkx†k ≤ √ + √ + ψ(α, A ,A x†). α α η η
We remark that the term ψ(α, A, Ax†) converges to 0 as α → 0; see Proposi- tion 12. Furthermore, if x† additionally satisfies a source condition (2.5), then the expression can be bounded by a convergence rate of order α (with some exponent depending on the source condition) that agrees with the standard † rate for the approximation error kxα − x k (see Proposition 7).
Convergence Analysis
Suppose that α∗ is the selected parameter by the proposed parameter choice rules with the operator noise (2.65). In the following lemma, we show that for such a choice of parameter, it follows that α∗ → 0 if all noise (with respect to both the data and the operator) vanishes: ¯ Lemma 8. Let α∗ be selected as in (2.65), i.e., with ψ as in and ψ ∈ {ψHD, ψHR, ψQO} with either (SH1) or (SH2). Suppose there exist positive constants (not necessarily equal which we denote universally by C) such that δ ∗ δ ky k ≥ C for ψ ∈ {ψHD, ψHR} and kAηy k ≥ C for ψ = ψQO. η √ If γ = γ(η) is chosen such that γ → 0 as η → 0 then
α∗ → 0, as δ, η → 0. Proof. Firstly, we recall Proposition 9 of Chapter 2, which proved that one may estimate the parameter α by the ψ-functionals. In particular, it follows that ( √ δ C α if ψ = ψHD, ψ(α, Aη, y ) ≥ (2.68) Cα if ψ ∈ {ψHR, ψQO}. By the standard error estimate
δ δ ky k kxα∗,ηk ≤ √ , α∗
73 we find, for the case in which the compensating functional is chosen as in (SH1) using (2.68) and (2.67) with t ∈ {1/2, 1} suited to ψ according to (2.68),
δ t ky k ¯ δ ¯ δ Cα∗ − Dη √ ≤ ψ(α∗,Aη, y ) = inf ψ(α, Aη, y ) α∗ α∈[γ,αmax] δ ≤ inf ψ(α, Aη, y ) α∈[γ,αmax] † δ ηkx k † ≤ inf √ + (1 + Cp,q) √ + ψ(α, A, Ax ) α∈[γ,αmax] α α † δ † ηkx k ≤ inf √ + +ψ(α, A, Ax ) + (1 + Cp,q) √ . (2.69) α∈[γ,αmax] α γ Hence, t δ † η Cα∗ ≤ inf √ + ψ(α, A, Ax ) + (C + D)√ . α∈[γ,αmax] α γ It is not difficult to verify the same estimate analogously for the case in which the compensating functional is chosen according to (SH2). Inserting the (non-optimal) choice α = δ + γ in the infimum, we obtain an upper bound that tends to 0 as δ, γ → 0. By the hypothesis, the last two terms vanish, thereby proving the desired result.
Remark. If α∗ is the minimiser of ψ(α, Aη, yδ), then this functional is the same as (SH1) and/or (SH2) with D = 0 and one obtains the same result as above; namely, that α∗ → 0 as δ, η → 0 provided that the conditions in the lemma are fulfilled. Now, we can establish an estimate from above for the total error which is derived courtesy of a lower estimate of the parameter choice functional with the data error, which we recall, due to Bakushinskii’s veto (cf. Proposition 3), necessitates a restriction on the noise. At first we study bounds for the functional in (2.65). ¯ Proposition 17. Let α∗ be selected according to (2.65) with ψ as in (SH1). Suppose that for the noisy data yδ, the noise condition is satisfied, with (2.17) with e ∈ N1 for ψ ∈ {ψHD, ψHR} and e ∈ N2 for ψ = ψQO. Then, for η sufficiently small, we get the following error estimate for all ψ ∈ {ψHD, ψHR, ψQO}: δ † kxα∗,η − x k −1 ηδ ¯ † ≤ (1 − DηCnc) C + Cnc inf ψ(α, Aη, yδ) + DCncηkx k α∗ α∈[γ,αmax] η † † ∗ −1 † + C √ kx k + Cψ(α∗, A, Ax ) + α∗k(A A + α∗I) x k . α∗ (2.70)
74 Proof. We begin by estimating the terms:
δ † ∗ −1 ∗ δ ∗ −1 ∗ † kxα∗,η − x k = k(AηAη + α∗I) Aηy − (AηAη + α∗I) (AηAη + α∗I)x k ∗ −1 ∗ δ ∗ † † ≤ k(AηAη + α∗I) Aηy − AηAηx − α∗x k ∗ −1 ∗ δ ∗ † ≤ k(AηAη + α∗I) Aη(y − y) + Aη(A − Aη)x k ∗ −1 † + α∗k(AηAη + α∗I) x k ∗ −1 ∗ δ η † ∗ −1 † ≤ k(AηAη + α∗I) Aη(y − y)k + √ kx k + α∗k(AηAη + α∗I) x k. 2 α∗ By (2.61), the last term can be bounded by
∗ −1 † ∗ −1 ∗ −1 † α∗k(AηAη + α∗I) x k ≤ α∗k (AηAη + α∗I) − (A A + α∗I) x k ∗ −1 † + α∗k(A A + α∗I) x k † ηkx k ∗ −1 † ≤ C0,−1 √ + α∗k(A A + α∗I) x k. α∗ This leaves the remaining term:
∗ −1 ∗ δ k(AηAη + α∗I) Aη(y − y)k ∗ −1 ∗ ∗ −1 ∗ δ ≤ k (AηAη + α∗I) Aη − (A A + α∗I) A (y − y)k ∗ −1 ∗ δ + k(A A + α∗I) A (y − y)k
5ηδ δ ≤ + Cncψ(α∗, A, y − y), 4α∗ where we used the noise condition (2.17) for the last estimate. Utilising the noise condition (2.17) again, and with the operator error esti- mates (2.61), (2.62), for all ψ ∈ {ψHD, ψHR, ψQO}, we obtain
δ † ∗ −1 ∗ δ kxα∗,η − x k = k(AηAη + α∗I) Aη(y − y)k
5ηδ δ δη ≤ + Cncψ(α∗,Aη, y − y) + CncCp,q 4α∗ α∗
(5 + CncCp,q)ηδ δ ≤ + Cncψ(α∗,Aη, y ) + Cncψ(α∗,Aη, y) 4α∗ (5 + CncCp,q)ηδ ¯ δ δ ≤ + Cncψ(α∗,Aη, y ) + DCncηkxα∗,ηk 4α∗ † + Cncψ(α∗,Aη, Ax ) Cηδ ¯ δ δ † † ≤ + Cnc inf ψ(α, Aη, y ) + DCncηkxα∗,η − x k + DCncηkx k α∗ α∈[γ,αmax] † + Cncψ(α∗,Aη, (Aη + A − Aη)x ) Cηδ ¯ δ δ † † ≤ + Cnc inf ψ(α, Aη, y ) + DCncηkxα∗,η − x k + DCncηkx k α∗ α∈[γ,αmax] † † + Cncψ(α∗,Aη,Aηx ) + Cncψ(α∗,Aη, (A − Aη)x ).
75 The last terms can be bounded using standard error estimates (cf. Proposi- tion 11) by 1 ηkx†k ψ(α ,A , (A − A )x†) ≤ √ k(A − A )x†k = √ , ∗ η η α η α while for the other term we employ (2.61) and (2.66) ηkx†k ψ(α ,A ,A x†) ≤ C √ + ψ(α , A, Ax†). ∗ η η p,q α ∗
Hence, for all ψ ∈ {ψHD, ψHR, ψQO}, we obtain
δ † (1 − DηCnc)kxα∗,η − x k Cηδ ¯ † ≤ + Cnc inf ψ(α, Aη, yδ) + DCncηkx k α∗ α∈[γ,αmax] η † † ∗ −1 † + C √ kx k + Cψ(α∗, A, Ax ) + α∗k(A A + α∗I) x k. α∗
The proof is easily adapted to obtain a similar proposition for the alternative choice of compensating functional as in (SH2):
Proposition 18. Let the assumptions of the Proposition 17 hold. Let α∗ be selected according to (2.65) with ψ¯ as in (SH2). Then for η sufficiently small, we get
δ † kxα∗,η − x k ηδ ¯ η ≤ C + Cnc inf ψ(α, Aη, yδ) + CncC √ α∗ α∈[γ,αmax] α∗ (2.71) η † † ∗ −1 † + C √ kx k + Cψ(α∗, A, Ax ) + α∗k(A A + α∗I) x k. α∗ Note that the setting D = 0 in the previous propositions yields upper bounds for the total errors in the case of employing the unmodified heuristic rules.
Theorem 13. Let α∗ be selected as in (2.65). Suppose that the noise con- dition (2.17) and the conditions of Lemma 8 are satisfied and furthermore suppose that γ ∈ [0, αmax] satisfies η ≤ C as η → 0, γ where C is a constant. Then
δ † kxα∗,η − x k → 0, as δ, η → 0.
76 Proof. Since we have that α∗ ≥ γ, the conditions in the theorem imply that ηδ η √ † ∗ −1 † γ → 0, γ → 0. The terms with ψ(α∗, A, Ax ) and α∗k(A A + α∗I) x k vanish by standard arguments (cf. Proposition 4 and Proposition 6, respec- ¯ tively), as α∗ → 0 according to Lemma 8. Finally, infα∈[γ,αmax] ψ(α, Aη, yδ) tends to 0 because of (2.69) and we may take an appropriate choice for α in the infimum.
Remark. Note that one might use more general functionals than those in (SH1) and (SH2) by replacing η with ηs, s ∈ (0, 1). Still, in this case, similar convergence results are valid with a slightly adapted choice of γ (depending on s). However, we observed through some numerical experimentation (see Chapter 5) that s = 1 appeared to be a natural choice, which is fully in line with our motivation that the compensating term should represent the error in ψ due to operator perturbations. We further remark that the unmodified heuristic choice (i.e., with D = 0), stipulating the same condition as in the previous theorem, also yields conver- gence as the errors tend to zero. However, as will be observed in Chapter 5, which gives a numerical study related to the methods of this section, the modified rules represent a substantial improvement. Convergence rates were not proven in [51] and also lie beyond the scope of this thesis.
77 Chapter 3
Convex Tikhonov Regularisation
In this chapter, we revert to the original representation of Tikhonov reg- ularisation, namely (2.4) and make use of it to expand our horizons from linear ill-posed problems to convex regularisation. This form of regularisa- tion gained popularity in recent decades due to the demand from application which, for instance, seeks to recover solutions which may be sparse or in the case of image processing, one may want to retain the edges when denoising an image. For the aforementioned applications, one may replace the square quadratic regularisation term in (2.4) with an `1 norm or a total variation seminorm, respectively. This is the flexibility which (2.4) allows us over the spectral form (2.1). The downside is that the analysis is arguably much more difficult in the absence of spectral theory. We begin this chapter with an initial section which covers the classical theory, mainly from the works of [21, 68, 142]. Then the proceeding sections are on heuristic parameter choice rules with a focus on proving convergence with respect to an analogous noise restriction to (2.17) from the linear theory.
3.1 Classical Theory
We assume henceforth that A : X → Y is a continuous linear operator with X as a Banach space (and that Y is a Hilbert space). Then the functional calculus for self-adjoint operators used previously becomes void. Moreover, the Moore-Penrose generalised inverse may no longer exists, thus we must redefine our notion of the best approximate solution. One survey in this setting can be found in [140]. One may also consult with [142] in case one wants to consider Y as a Banach space also, although this is beyond the scope of this thesis. Note that this section is by and large analogous to [90]. In this context, it is common to define the best approximate solution as the
78 so-called R-minimising solution, i.e., x† = argmin R(x). x∈X Ax=y One may still regularise ´ala Tikhonov, however, if one sticks with the vari- ational form: δ 1 δ 2 xα ∈ argmin kAx − y k + αR(x), (3.1) α 2 where the regularisation term R : X → R ∪ {∞} may be a generalisation of the usual square norm, which we shall assume to be convex, proper, coercive and weak-∗ lower semi-continuous (akin to the aforementioned norm func- tional), which are in fact the standard assumptions for the well-defindness of (3.1) cf. [78, 140, 142]. In contrast with k · k2, the R functional need not be differentiable. The optimality condition for the Tikhonov functional may be stated as fol- lows: ∗ δ δ δ 0 ∈ A (Axα − y ) + α∂R(xα), (3.2) δ for all α ∈ (0, αmax) and y ∈ Y . We may define specific selections of the δ subgradient of R at xα as: 1 ξδ : = − A∗(Axδ − yδ) ∈ ∂R(xδ ), α α α α and denote by ξα its respective noise-free variant. In this scenario, we also have to consider a different source condition to, e.g., (2.5). Namely, ∂R(x†) ∈ range A∗, (3.3) i.e., there exists ω such that A∗ω ∈ ∂R(x†) [21,140]. Subsequently, we define ξ := A∗ω, to be a subgradient of R at x†. It is useful to define the residual vectors as before:
δ δ δ pα := y − Axα, pα := y − Axα. For notational purposes, we also define shorthands for differences of noisy and noise-free variables:
δ δ ∆y := y − y, ∆pα := pα − pα. Additionally, in this context, rather than prove convergence in norm, we follow the examples of e.g., [21,68,142], and prove estimates and convergence with respect to the Bregman distance, denoted by Dξ(·, ·), cf. Appendix B and in particular, Definition 13. We begin by stating a useful estimate for the data propagation error:
79 Lemma 9. We have the following upper bound for the data propagation error: 1 D (xδ , x ) ≤ h∆y − ∆p , ∆p i, (3.4) ξα α α α α α δ for all α ∈ (0, αmax) and y, y ∈ Y . Proof. We may estimate
δ sym δ 1 δ 1 Dξα (xα, xα) ≤ D δ (xα, xα) = h∆pα,A(xα − xα)i = h∆pα, ∆y − ∆pαi, ξα,ξα α α where Dsym is the symmetric Bregman distance, cf. Appendix B and Defini- tion B.4, which proves the desired result. We give the error estimates [21], which, contrary to the previous chapter, are in terms of the Bregman distance: Proposition 19. Assume that (3.3) holds. Then there exist positive con- stants such that
† 2 Dξ(xα, x ) ≤ Ckωk α, kAxα − yk ≤ Ckωkα, δ2 D (xδ , x ) ≤ C , kA(xδ − x )k ≤ Cδ, ξα α α α α α and δ √ 2 D (xδ , x†) ≤ C √ + kωk α , kAxδ − yδk ≤ δ + Ckωkα, (3.5) ξ α α α
δ for all α ∈ (0, αmax) and y, y ∈ Y . Proof. We have 1 1 kAx − yk2 + αR(x ) ≤ kAx† − yk2 + αR(x†) = αR(x†). 2 α α 2 Now, from † † † Dξ(xα, x ) = R(xα) − R(x ) − hξ, xα − x i, it follows that
† † † αR(x ) = αR(xα) − αhξ, xα − x i − αDξ(xα, x ). Thus, 1 kp k2 + αR(x ) ≤ αR(x†) 2 α α 1 ⇐⇒ kp k2 + αR(x ) ≤ αR(x ) − αhξ, x − x†i − αD (x , x†) 2 α α α α ξ α 1 ⇐⇒ kp k2 + αD (x , x†) ≤ −αhξ, x − x†i, 2 α ξ α α
80 i.e., 1 kp k2 ≤ −αD (x , x†) − αhA∗ω, x − x†i ≤ −αhω, p i 2 α ξ α α α ≤ α|hω, pαi| ≤ αkωkkpαk.
Hence, we obtain the discrepancy estimate: kpαk ≤ 2kωkα. Now, for the approximation error, we recall the previous estimate 1 1 kp k2 + αD (x , x†) ≤ αhξ, x − x†i ≤ αkωkkp k ≤ kωk2α2 + kp k2 . 2 α ξ α α α 2 α Now, for the estimates with noise, we have 1 1 1 kpδ k2 + αD (xδ , x ) ≤ kpδ k2 − αh A∗p , xδ − x i 2 α ξα α α 2 α α α α α 1 = kpδ k2 − hp ,A(xδ − x )i 2 α α α α 1 = hpδ , pδ i − hA(xδ − x ), p i − hy − yδ,A(xδ − x )i 2 α α α α α α α 1 = h pδ + Axδ − Ax , pδ i − hy − yδ,A(xδ − x )i 2 α α α α α α 1 = hAx − yδ + Axδ + Axδ − Ax − Ax , Ax − Axδ + Axδ − yδi 2 α α α α α α α α δ δ − hy − y ,A(xα − xα)i 1 = hAx − yδ + Axδ + Axδ − Ax − Ax , Axδ − yδi 2 α α α α α α 1 − hAx − yδ + Axδ + Axδ − Ax − Ax ,A(xδ − x )i 2 α α α α α α α δ δ − hy − y ,A(xα − xα)i 1 1 1 = kAxδ − yδk2 + hA(xδ − x ), Axδ − yδi − kA(xδ − x )k2 2 α 2 α α α 2 α α 1 − hAxδ − yδ,A(xδ − x )i − hy − yδ,A(xδ − x )i 2 α α α α α 1 1 = kpδ k2 − kA(xδ − x )k2 − hy − yδ,A(xδ − x )i, 2 α 2 α α α α i.e., 1 kpδ k2 + αD (xδ , x ) 2 α ξα α α 1 1 ≤ kpδ k2 − kA(xδ − x )k2 − hy − yδ,A(xδ − x )i 2 α 2 α α α α 1 ⇐⇒ kA(xδ − x )k2 + αD (xδ − x ) ≤ −hy − yδ,A(xδ − x )i. 2 α α ξ α α α α
81 From the non-negativity of the Bregman distance, it follows that the above inequality implies that 1 kA(xδ − x )k2 ≤ −hy − yδ,A(xδ − x )i ≤ ky − yδkkA(xδ − x )k 2 α α α α α α δ ≤ δkA(xα − xα)k.
δ Thus, the data discrepancy can be estimated as kA(xα − xα)k ≤ 2δ. Now, from 1 kA(xδ − x )k2 + αD (xδ , x ) ≤ −hy − yδ,A(xδ − x )i 2 α α ξα α α α α 1 1 ≤ δkA(xδ − x )k ≤ δ2 + kA(xδ − x )k2, α α 2 2 α α we get the estimate for the data propagation error. For the total error and discrepancy estimates, we begin by recalling that 1 1 kAxδ − yδk2 + αD (xδ , x ) ≤ kAx − yk2 − αhξ , xδ − x i, 2 α ξα α α 2 α α α α which implies
δ δ 2 2 ∗ δ kAxα − y k ≤ kAxα − yk + 2hA (Axα − y), xα − xαi 2 2 δ ≤ 4kωk α + 2hAxα − y, A(xα − xα)i 2 2 δ ≤ 4kωk α + 2kAxα − ykkA(xα − xα)k ≤ Ckωk2α2 + Cδ2 ≤ (Cδ + Ckωkα)2 , thereby yielding the estimate for the total discrepancy. Finally, for the total δ error, we may use the optimality of xα to estimate 1 δ2 kAxδ − yδk2 + αR(xδ ) ≤ + αR(x†), 2 α α 2 which is equivalent to 1 δ2 kAxδ − yδk2 ≤ + R(x†) − R(xδ ). 2α α 2α α Therefore,
1 δ δ 2 δ † kAx − y k + D δ (x , x ) 2α α ξα α δ2 ≤ + R(x†) − R(xδ ) + R(xδ ) − R(x†) − hξ, xδ − x†i 2α α α α δ2 ≤ −hA∗ω, xδ − x†i + , α 2α
82 which allows us to estimate the total error was 2 δ † δ † δ 1 δ δ 2 D δ (x , x ) ≤ −hω, A(x − xα + xα − x )i + − kAx − y k ξα α α 2α 2α α δ2 ≤ kωkkA(xδ − x )k + kωkkAx − yk + α α α 2α δ2 ≤ Ckωkδ + Ckωk2α + 2α δ √ 2 ≤ C √ + kωk α , α which completes the proof [78]. Corollary 5. Let a source condition (3.3) hold. Then we have
δ † δ † Dξ(xα, x ) ≤ Dξα (xα, xα) + Dξ(xα, x ) + Ckωkδ, (3.6)
δ for all α ∈ (0, αmax) and y, y ∈ Y .
δ † Proof. It follows from (B.3), with xα = x1, x = x2 and xα = x3, that
δ † δ δ † δ Dξ(xα, x ) = Dξα (xα, xα) + Dξ(xα, x ) + hξα − ξ, xα − xαi. Observe that the last term can be estimated as 1 hξ − ξ, xδ − x i = −h A∗(Ax − y) + A∗ω, xδ − x i α α α α α α α 1 = − hAx − y, A(xδ − x )i − hω, A(xδ − x )i α α α α α α 1 ≤ kA(xδ − x )k kAx − yk + kωk α α α α ≤ Cδ(Ckωk + kωk) = Ckωkδ, where we used the estimates of the previous proposition. Theorem 14. Let a source condition (3.3) hold. Then
δ † Dξ(xα, x ) → 0, as δ → 0. Proof. We may estimate the error in three parts, as in (3.6), with the first and second terms corresponding to the data propagation and approximation errors, respectively. Subsequently, from Proposition 19, we obtain δ2 D (xδ , x†) ≤ C + Cα + Cδ. (3.7) ξ α α Thus, choosing α such that δ2/α → 0 and α → 0 as δ → 0 yields the desired result.
83 In this setting, we can also generalise iterated Tikohonov regularisation (2.9) to a convex analogue, namely, Bregman iteration, which is defined as
δ 1 δ 2 δ xα,k ∈ argmin kAx − y k + αDξδ (x, xα,k−1), (3.8) x∈X 2 α,k−1
δ δ with ξα,k−1 ∈ ∂R(xα,k−1) for k ∈ N. For certain parameter choice rules, we are particularly interested in the second Bregman iterate (cf. [17,124,154]), which we denote by
II 1 δ 2 δ x ∈ argmin kAx − y k + αD δ (x, x ). (3.9) α,δ ξα α x∈X 2 Note that the second Bregman iterate may be computed by minimising a slightly simpler expression which does not involve the Bregman distance [153]:
Proposition 20. We may compute the second Bregman iterate as
II 1 δ δ δ 2 xα,δ ∈ argmin kAx − y − (y − Axα)k + αR(x). x∈X 2 for all α ∈ (0, αmax).
δ Proof. Expanding the second term in (3.9), and using the definition of ξα, II we see that xα,δ minimises 1 kAx − yδk2 + αR(x) − αR(xδ ) − αhξδ , x − xδ i 2 α α α 1 = kAx − yδk2 − hAxδ − yδ,A(x − xδ )i + αR(x) − αR(xδ ) 2 α α α 1 = kAx − yδk2 − hAxδ − yδ, Ax − yδ − (Axδ − yδ)i + αR(x) − αR(xδ ) 2 α α α 1 = kAx − yδ + Axδ − yδ − Axδ + yδk2 − hAx − yδ − (Axδ − yδ), Axδ − yδi 2 α α α α δ + αR(x) − αR(xα) 1 = kAx − yδ − (yδ − Axδ )k2 + hAx − yδ + Axδ − yδ, Axδ − yδi 2 α α α 1 + kAxδ − yδk2 − hAx − yδ − (Axδ − yδ), Axδ − yδi 2 α α α δ + αR(x) − αR(xα) 1 = kAx − yδ − (yδ − Axδ )k2 + αR(x) + 2kAxδ − yδk2 − R(xδ ). 2 α α α Notice that the last two terms do not depend on x; hence, the result follows.
84 For the residual vector with the second Bregman iterate, we stick to the notation: II δ II II II pα,δ := y − Axα,δ, pα := y − Axα , for noisy and exact data, respectively. Subsequently, we define:
II II II ∆pα := pα,δ − pα .
Proposition 21. We have
II δ δ II kpα,δk ≤ kpαk, and R(xα) ≤ R(xα,δ), (3.10) for all α ∈ (0, αmax).
II Proof. From the optimality of xα,δ, it follows that
1 II 2 II δ 1 δ 2 δ δ 1 δ 2 kp k + αD δ (x , x ) ≤ kp k + αD δ (x , x ) = kp k , 2 α,δ ξα α,δ α 2 α ξα α α 2 α i.e.,
1 II 2 1 δ 2 II δ 1 δ 2 kp k ≤ kp k − αD δ (x , x ) ≤ kp k , 2 α,δ 2 α ξα α,δ α 2 α which follows from the non-negativity of the Bregman distance. Similarly as for the Tikhonov functional (i.e. (3.2)), we can state the opti- mality condition for the Bregman functional in the same manner:
∗ II δ δ δ II 0 ∈ A (Axα,δ − y − (y − Axα)) + α∂R(xα,δ).
It follows immediately as a consequence of the above that one can define a II specific selection of the subgradient of R at xα,δ as
1 II ξII := − A∗(AxII − yδ − (yδ − Axδ )) ∈ ∂R(xδ ), α,δ α α,δ α α
II and denote by ξα its respective noise-free variant.
δ II Proposition 22. The residuals pα, pα,δ may be expressed in terms of a prox- imal mapping operator, proxJ : Y → Y ,
−1 proxJ = (I + ∂J ) , (3.11)
∗ 1 ∗ with the functional J := αR ◦ α A in the following form:
δ δ II δ δ pα := proxJ (y ), pα,δ := proxJ y + pα − pα. (3.12)
85 Proof. It follows from the optimality condition for the Tikhonov functional that
δ ∗ δ δ δ ∗ ∗ δ δ ∂R(xα) 3 −A (Axα − y )/α ⇐⇒ xα ∈ ∂R (−A (Axα − y )/α), (3.13) since x∗ ∈ ∂R(x) ⇐⇒ x ∈ ∂R∗(x∗) for all x ∈ X and x∗ ∈ X∗ (cf. [32]). Furthermore, (3.13) is equivalent to
δ δ ∗ ∗ δ δ δ Axα − y ∈ A∂R (−A (Axα − y )/α) − y δ ∗ ∗ δ δ ⇐⇒ (−pα) − A∂R (A pα/α) ∈ −y .
We can rewrite the above as
∗ ∗ δ δ ∗ ∗ δ δ (I + A∂R (A · /α))(pα) ∈ y ⇐⇒ (I + α∂(R (A · /α)))(pα) ∈ y , due to the identity A(∂R∗(A∗·)) = ∂(R∗(A∗·)), which holds true if R∗ is finite and continuous at a point in the range of A∗ (cf. [32, Prop. 5.7]), for instance, at 0 = A∗0. By a result of Rockafellar [137, Thm. 4C, Thm 7A], this follows if R has bounded sublevel sets, which is a consequence of the assumed coercivity. By definition of the proximal mapping, (3.12) follows for δ II pα and analogously for pα,δ.
3.2 Parameter Choice Rules
Similarly as before, we define a ψ-functional based heuristic parameter choice rule by δ α∗ ∈ argmin ψ(α, y ). (3.14) α∈(0,αmax) 2 In the linear theory, αmax is usually chosen as kAk . In the convex case, its choice is irrelevant for the convergence theory. In practice, one might choose it similarly as in the linear case. In case no α∗ exists in the defined interval, the reader is referred to [83, (2.13), p. 237], which details how one can extend the definition of ψ to overcome that issue. For simplicity, we persevere with (3.14) and simply assume existence. The conceptual basis for (3.14) in this setting is that the surrogate functional δ † ψ should act as an error estimator of the form ψ ∼ Dξ(xα, x ), so that its minimiser should, in theory, be a good approximation of the optimal parameter choice. As in the linear theory, the error estimating behaviour is not guaranteed to hold in general (due to the Bakushinskii veto) unless restrictions on the noise are postulated cf. (2.17) in Chapter 2. Since the Bregman distance in the linear case corresponds to the norm squared, we henceforth, in this chapter, redefine the ψ-functionals of before such that they coincide in the linear case with the squared versions of the ones in,
86 2 e.g., Chapter 1 and 2 (i.e., e.g., ψHD in this chapter corresponds to ψHD of Chapter 2). As we continue to assume that Y is a Hilbert space and the heuristic dis- crepancy (cf. [79, 80, 154]), Hanke-Raus and heuristic monotone error rules are defined in the Y space, the expressions for their respective functionals II may remain unchanged with the consideration that xα,δ would be the second Bregman iterate rather than the second Tikhonov iterate: 1 1 ψ (α, yδ) := kAxδ − yδk2, ψ (α, yδ) := hAxII − yδ, Axδ − yδi. HD α α HR α α,δ α However, the quasi-optimality and simple-L rules were defined in the X space, and since we do not prove convergence in the strong topology of X, perhaps this would not be the appropriate formulation of the aforementioned rules in this setting. Indeed, we prove convergence with respect to the Bregman distance, however, its general non-symmetricity opens the possibility for sev- eral ways of defining the quasi-optimality functional, for instance (cf. [86]). Indeed, we define
δ II δ δ δ II ψRQO(α, y ) := D δ (x , x ), ψLQO(α, y ) := D II (x , x ), (3.15) ξα α,δ α ξα,δ α α,δ δ sym II δ ψSQO(α, y ) := D II δ (xα,δ, xα), (3.16) ξα,δ,ξα as the right, left and symmetric quasi-optimality functionals, respectively. For the simple-L rule, there also exist a plethora of options, one being
δ II δ ψL(α, y ) := R(xα,δ) − R(xα), (3.17) which we will not explore in this section. However, it does make an ap- pearance in Chapter 7, where we test it numerically, and this is thus far the author’s preferred definition of ψL in this setting. Of course, this does not coincide exactly with ψL as defined in (2.12) in Chapter 2, since there, δ δ δ II δ 2 δ II ψL(α, y ) = hxα, xα − xα,δi = kxαk − hxα, xα,δi. Note that we omit ψLQO from the following analysis as in preliminary numer- ical tests, it performed very poorly in comparison to the other rules (which are tested in Chapter 6) and this was also the conclusion of [86]. As in the previous settings, it is also possible to define the ratio rules in the δ 2 convex setting; for instance, by replacing the kxαk from the expressions in δ the linear setting with R(xα). Similarly as for the heuristic discrepancy and Hanke-Raus rules, the symmet- ric quasi-optimality functional may also be expressed in terms of residuals: Proposition 23. We have that 1 ψ (α, yδ) = hpδ − pII , pII i, SQO α α α,δ α,δ δ for all α ∈ (0, αmax) and y ∈ Y .
87 Proof. From the symmetric Bregman distance definition (cf. (B.4)), we have
δ sym II δ II δ II δ ψSQO(α, y ) = D II δ (xα,δ, xα) = hξα,δ − ξα, xα,δ − xαi ξα,δ,ξα 1 = hA∗(Axδ − yδ) − A∗(AxII − yδ + Axδ − yδ), xII − xδ i α α α,δ α α,δ α 1 = hA(xδ − xII ), AxII − yδi α α α,δ α,δ 1 = hAxδ − yδ − (AxII − yδ), AxII − yδi, α α α,δ α,δ
δ for all α ∈ (0, αmax) and y ∈ Y , which is what we wanted to show. The proposition below provides some useful estimates:
δ Proposition 24. For all α ∈ (0, αmax) and y ∈ Y , we have
δ δ δ δ 0 ≤ ψRQO(α, y ) ≤ ψSQO(α, y ) ≤ ψHR(α, y ) ≤ ψHD(α, y ). (3.18)
Moreover, if (3.3) holds, then
δ √ 2 ψ (α, yδ) ≤ √ + 2kωk α . (3.19) HD α
In particular, with (3.3) and α∗ selected as (3.14) with ψ ∈ {ψHD, ψHR, ψSQO, ψRQO}, we have that δ lim ψ(α∗, y ) = 0. (3.20) δ→0 Proof. Since
δ δ δ II ψSQO(α, y ) = ψRQO(α, y ) + D II (x , x ), ξα,δ α α,δ it follows immediately that ψSQO ≥ ψRQO. Moreover, from Proposition 23 and Proposition 22, we get that 1 1 1 1 ψ (α, yδ) = hpδ − pII , pII i = hpδ , pII i − kpII k2 ≤ hpδ , pII i SQO α α α,δ α,δ α α α,δ α α,δ α α α,δ δ (3.12) 1 δ δ δ δ δ δ = ψ (α, y ) = prox ∗ (y + p ) − prox ∗ (y ), y + p − y HR α J α J α (B.7) 1 δ δ δ δ ≤ k prox ∗ (y + p ) − prox ∗ (y )kkp k α J α J α (3.10) 1 ≤ kpδ k2 = ψ (α, yδ). α α HD
The last estimate (3.19) follows from (3.5). As the respective α∗ are the δ minimisers, we may estimate ψ(α∗, y ) by the minimiser over α of (3.19), which is easily shown to be of the order of δ and thus tends to 0.
88 3.2.1 Convergence Analysis As observed in the previous chapters, for heuristic rules in the linear case, it is often standard to show convergence of the selected regularisation parameter α∗ as the noise level tends to zero. This is not necessarily true in the convex case, as we shall see. The next lemma deals with the (exceptional) case in which limδ→0 α∗ 6= 0.
Lemma 10. Assume that A : X → Y is compact. Let α∗ be the minimiser of ψ ∈ {ψHD, ψHR, ψSQO, ψRQO}. In case of ψSQO, ψRQO, we additionally assume that R is strictly convex. Suppose that limδ→0 α∗ =α ¯ > 0. Then
δ † Dξ(xα∗ , x ) → 0, as δ → 0.
δ † Proof. We show that any subsequence of Dξ(xα∗ , x ) has a convergent sub- sequence with limit 0. In case of ψHD, the result follows from [78, proof of Thm. 3.5] even without the compactness assumption of A. From the same δ proof, it follows also that xα∗ is bounded and similarly, we may show that II xα∗,δ is bounded as δ → 0. Hence, there exist weakly (or weakly-*) convergent subsequences with δ II xα∗ * v, xα∗,δ * z. δ In [78], it was shown that v = xα¯, i.e., the limit of xα∗ is the regularised solution for exact data and regularisation parameterα ¯. Now using the com- pactness of A, we may find strongly convergent subsequences for the residuals as δ → 0: δ II pα∗ → y − Axα¯ =: pα¯, pα∗,δ → y − Az. II From lower semicontinuity, the minimisation property of xα∗,δ, and the strong δ convergence of pα∗ , we obtain that for any x ∈ X 1 kAz − y − (y − Ax )k2 +α ¯R(z) 2 α¯ 1 II δ δ δ 2 δ ≤ lim inf kAxα ,δ − y − (y − Axα )k + α∗R(xα) δ→0 2 ∗ ∗ 1 δ δ δ 2 ≤ lim inf kAx − y − (y − Axα )k + α∗R(x) δ→0 2 ∗ 1 = kAx − y − (y − Ax )k2 +α ¯R(x). 2 α¯ Hence, z is the minimiser of the functional on the left-hand side and by its II II II II uniqueness, it follows that z = xα¯ and thus pα∗,δ → y − Axα¯ =: pα¯ . From the optimality conditions, we furthermore obtain the strong convergence of δ ξα∗ → ξα¯.
89 Consider ψHR: it follows from (B.7) that
1 II 2 1 δ II δ kpα∗,δk ≤ hpα∗ , pα∗,δi = ψHR(α∗, y ), α∗ α∗
II II and by (3.20), we find that pα¯ = 0. Since pα¯ = proxJ (y +pα¯)−pα¯, it follows that
proxJ (y + pα¯) = pα¯ = proxJ (y).
As proxJ is bijective, we obtain that pα¯ = 0. From this, the result follows as in [78, Proof of Thm. 3.5]. II δ In case of ψSQO, ψRQO, we have, by (3.20), that D δ (x , x ) → 0, and from ξα∗ α∗,δ α∗ δ the lower semicontinuity, and the strong convergence of the subgradient ξα, it follows that II Dξα¯ (xα¯ , xα¯) = 0. By the assumed strict convexity of R, the Bregman distance is strictly pos- II II itive for nonidentical arguments; thus xα¯ = xα¯ ⇒ pα¯ = pα¯ . Employing the proximal representation, we thus have that
proxJ (y + pα¯) − pα¯ = pα¯ ⇔ 2pα¯ + ∂J (2pα¯) = y + pα¯
⇔ pα¯ + ∂J (2pα¯) = y = pα¯ + ∂J (pα¯) ⇔ ∂J (2pα¯) = ∂J (pα¯).
The strict convexity implies 2pα¯ = pα¯, hence pα¯ = 0. The results then follows as before from those in [78].
Noise Restriction As discussed, the auto-regularisation condition (1.38) becomes void in this setting due to the dominating term ψ(α, y − yδ) no longer making sense when the regularisation operator is nonlinear. In order to grasp this, firstly consider that the ψ functionals can no longer be expressed in the spectral form `ala (1.34). Thus, there are no linear filter functions as such to act on the vector y − yδ. Instead, we stipulate an alternative auto-regularisation condition, as suggested in [90]. In the following, we state the main convergence theorem of this section along with the analogous auto-regularisation conditions:
Theorem 15. Let A : X → Y be compact, the source condition (3.3) be sat- isfied, α∗ be the minimiser of ψ ∈ {ψHD, ψHR, ψSQO, ψRQO}, and assume there exist constants C > 0 such that the respective auto-regularisation condition
2 h∆y − ∆pα, ∆pαi ≤ Ck∆pαk , (ARC-HD) II h∆y − ∆pα, ∆pαi ≤ Ch∆pα , ∆pαi, (ARC-HR) II II h∆y − ∆pα, ∆pαi ≤ Ch∆pα − ∆pα , ∆pα i, (ARC-SQR)
1 II δ h∆y − ∆pα, ∆pαi ≤ CD δ (x , x ) + O(α), (ARC-RQO) α ξα α,δ α
90 δ holds for all α ∈ (0, αmax) and y ∈ Y for the heuristic discrepancy, Hanke- Raus, symmetric quasi-optimality and right quasi-optimality rules. If ψ ∈ {ψSQO, ψRQO}, assume in addition that R is strictly convex. Then it follows that the method converges; i.e.,
δ † Dξ(xα∗ , x ) → 0, as δ → 0.
δ † Proof. Take an arbitrary subsequence of Dξ(xα∗ , x ). We show that it con- tains a convergent subsequence with limit 0, which would prove the state- ment. Since α∗ is bounded, we may consider a subsequence, which, in ad- δ † dition satisfies α∗ → α¯. In case α∗ → α¯ > 0, the result limδ→0 Dξ(xα∗ , x ) = 0 follows from Lemma 10 without even needing to invoke the auto-regularisation conditions. Thus, let us assume that α∗ → 0. Then it follows from the estimate (3.6) and Proposition 19 that one only δ needs to prove convergence of the data propagation error Dξα (xα, xα), which we can immediately estimate via (3.4) and the respective auto-regularisation condition in Theorem 15. For α∗ minimising the heuristic discrepancy functional, it follows from Propo- δ sition 19 and (3.20) that limδ→0 ψHD(α∗, y) = 0 and limδ→0 ψHD(α∗, y ) = 0. Thus, we may conclude that
2 k∆p k2 kpδ k kp k 2 α∗ α∗ α∗ p δ p ≤ √ + √ = ψHD(α∗, y ) + ψHD(α∗, y) → 0, α∗ α∗ α∗
δ as δ → 0. Hence, using (3.4) and (ARC-HD) yields that Dξα∗ (xα∗ , xα∗ ) → 0 as δ → 0. For the approximation error, it follows from Proposition 19 that † Dξ(xα∗ , x ) ≤ Cα∗ → 0. Therefore, each term in (3.6) tends to 0 as δ → 0. Let α∗ be the minimiser of the Hanke-Raus functional. Then, as before, we estimate the Bregman distance as (3.4) and from (ARC-HR), deduce that
δ C II C II C II Dξα∗ (xα∗ , xα∗ ) ≤ h∆pα∗ , ∆pα∗i = hpα∗ , ∆pα∗i − hpα∗,δ, ∆pα∗i α∗ α∗ α∗ C II C II δ C II δ C II = hpα∗ , pα∗ i − hpα∗ , pα∗ i + hpα∗,δ, pα∗ i − hpα∗,δ, pα∗ i α∗ α∗ α∗ α∗ δ C II δ C II ≤ CψHR(α, y ) + CψHR(α∗, y) − hpα∗ , pα∗ i − hpα∗,δ, pα∗ i. (3.21) α∗ α∗ The last two terms can be estimated via the Cauchy-Schwarz inequality and (3.10), and are bounded from above by Proposition 19
1 δ 1 C kpα∗ kkpα∗ k ≤ 2ωα∗(δ + 2kωkα∗). α∗ α∗ Moreover, the last terms decay to zero as δ → 0 and clearly, the first couple δ of terms vanish as the noise decays also, since limδ→0 ψHR(α∗, y ) = 0 due to
91 (3.20) and from (3.18) and Proposition 19, ψHR(α∗, y) ≤ ψHD(α∗, y) → 0 as δ → 0. For α∗ minimising ψSQO, note that from (3.6) and (ARC-SQR), it remains to estimate
δ 1 δ II II 1 II II Dξα∗ (xα∗ , xα∗ ) ≤ hpα∗ − pα∗,δ, pα∗,δi + hpα∗ − pα∗ , pα∗ i α∗ α∗ 1 II II 1 δ II II − hpα∗ − pα∗ , pα∗,δi − hpα∗ − pα∗,δ, pα∗ i α∗ α∗ δ ≤ ψSQO(α, y ) + ψSQO(α∗, y)
1 II II 1 δ II II − hpα∗ − pα∗ , pα∗,δi − hpα∗ − pα∗,δ, pα∗ i. α∗ α∗ As before, the result follow very similarly by estimating the first two terms from above (cf. Proposition 24) and the “remainder” terms via the Cauchy- Schwarz inequality, the triangle inequality, and (3.10) and all of which vanish as the noise decays. We omit the proof for the right quasi-optimality rule as it is analogous to the above (and in fact, even simpler as the RHS of its associated auto- regularisation condition (ARC-RQO) vanishes due to (3.20)). Remark. Note that Theorem 15 holds true if the left-hand side of the auto- δ regularisation conditions, i.e., h∆y − ∆pα, ∆pαi is replaced by Dξα (xα, xα). Moreover, it is easy to see that for (ARC-HD) it is enough to prove
2 h∆pα, ∆yi ≤ Ck∆pαk , (3.22) for some positive constant C. Remark. Observe that the heuristic discrepancy, the Hanke-Raus, and the symmetric quasi-optimality rules can all be expressed in terms of the residuals δ II of the Bregman iteration pα, pα,δ. The similarity of patterns in the formulae for ψ may provide a hint that such a larger family of rules could be defined in the convex case as well, similar as in the linear case; see (2.11) in Chapter 2. The auto-regularisation condition is an implicit condition on the noise. One may observe that it resembles the analogous condition of (1.38) in the linear case. To gain a better understanding and in particular show that they can be satisfied in practical situations, in Section 3.3, we will derive sufficient and more transparent conditions in the form of Muckenhoupt-type inequalities (2.17) .
3.2.2 Convergence Rates (for the Heuristic Discrep- ancy Rule) With the aid of the source condition, auto-regularisation condition and an additional regularity condition, we can even derive rates of convergence. We
92 will do so only for the heuristic discrepancy rule, as per the title of this section, as rates for the other rules lie beyond the scope of this thesis. These results stem from [90]. We start with the following proposition:
Proposition 25. Suppose that ∂R∗ satisfies the following condition:
x → 0 ⇒ ξ ∈ ∂R∗(x) * 0. (3.23)
Then, for all positive constants D > 0, there exists another constant D1 > 0 such that for all yδ ∈ Y with kyδk ≥ D, it holds that
δ α ≤ D1ψHD(α, y ) ∀α ∈ (0, αmax). (3.24)
Proof. Suppose that the statement (3.24) is not true. Then there would exist a constant D > 0 and a sequence of yδk such that kyδk k ≥ D, and a sequence of αk with pδk δk αk ψHD(αk, y ) = → 0 as k → ∞. αk αk
δk pαk Define zk := . From its representation as a proximal mapping, we have αk that zk satisfies δk ∗ ∗ y = αkzk + A∂R (A zk). ∗ Thus, from zk → 0, the boundedness of αk, A zk → 0, and (3.23), we obtain
δk 2 δk ∗ ∗ ∗ δk 0 < D ≤ ky k = αkhzk, y i + h∂R (A zk),A y iX∗,X∗∗ → 0, which yields a contradiction; hence the statement of the proposition is true.
We now state the main convergence rates result:
Proposition 26. Let the source condition (3.3) hold, α∗ minimise ψHD and suppose the auto-regularisation condition (ARC-HD) is satisfied. Assume that kyδk ≥ C and, in addition that (3.23) holds. Then δ † Dξ(xα∗ , x ) = O(δ), for δ → 0.
Proof. Note that from Proposition 25 and since α∗ is the global minimiser, δ δ for any α, we have that α∗ ≤ CψHD(α∗, y ) ≤ CψHD(α, y ). Observe that
93 from (3.6), it follows that kwk2 D (xδ , x†) ≤ D (xδ , x ) + α + 6kwkδ ξ α∗ ξα∗ α∗ α∗ 2 ∗ (ARC-HD) 2 p δ p ≤ ψHD(α, y ) + ψHD(α∗, y) + Cδ + Cα∗ δ √ √ 2 ≤ √ + C α + C α + Cδ + Cα via Proposition 19 α ∗ ∗ ! δ √ 2 δ2 = O √ + α + + α + δ since α ≤ Cψ (α, yδ), α α ∗ HD = O (δ) , choosing α = α(δ) = δ.
Note that if the source condition (3.3) holds, α∗ is selected as the minimiser of ψ ∈ {ψHR, ψSQO, ψRQO} and the respective auto-regularisation conditions ((ARC-HR),(ARC-SQR), or (ARC-RQO)) are satisfied, and additionally, that for some µ > 0 µ δ α∗ ≤ Cψ(α∗, y ), (3.25) then one may also prove, analogously that
1 δ † µ Dξ(xα, x ) = O(δ ), for δ → 0. In the linear theory, the inequality (3.25) corresponds to (12) from Chapter 2, and we recall that under certain additional regularity conditions on x†, e.g., (2.26), optimal convergence rates are obtained (cf. Theorem 6). We note that the condition on ∂R∗, (3.23), holds if R∗ is continuously differ- ∗ p entiable in 0 with ∇R(0) = 0. This is true, for instance, for R (x) = kxk`p with 1 < p < ∞. On the other hand we observe that the conclusion in Proposition 25 is not satisfied for exact penalisation methods [21] (such as 1 ` -regularisation) as then the residual pα (and hence ψHD) could vanish for nonzero α, which does not concur with the estimate (3.24).
3.3 Diagonal Operator Case Study
This section, as the rest of the chapter, is from the paper [90]. In the following analysis, we consider the case of the operator A : X → Y being diagonal between spaces of summable sequences; in particular, X = `q(N), with 1 < q < ∞, and Y = `2(N), and the regularisation functional is selected as the `q-norm to the q-th power divided by q. The main objective in this setting is to state sufficient conditions for the auto-regularisation conditions in the form of Muckenhoupt-type inequalities similar to (2.17) and to illustrate their validity for specific instances.
94 Let {en}n∈N be the canonical (i.e., Cartesian) basis for X and also Y , and let {λn}n∈N be a sequence of real (and for simplicity) positive scalars monoton- ically decaying to 0. Then we define a diagonal operator A : `q(N) → `2(N),
Aen = λnen. (3.26)
The regularisation functional is chosen as the q-th power of the `q-norm (divided by q):
1 q q−1 R := k · k q with ∂R(x) = {|x | sgn(x )} , q ∈ (1, ∞). q ` n n n∈N (3.27) As we assume q > 1, the choice of sgn at 0 is not relevant and we may assume sgn(0) = 0 throughout. In the present situation, the problem decouples and the components of the regularised solution can be computed independently of each other. Thus, for notational purposes, one may omit the sequence index n for the components of the regularised solutions and write
δ δ δ xα =: {xα,n}n =: {xα }n,
xα =: {xα,n}n =: {xα}n, δ δ y =: {y }n,
y =: {y}n,
δ δ δ where xα, xα , y , y ∈ R. As the problem decouples, xα and xα can be computed by an optimisation problem on R, i.e., the optimality conditions read
δ δ δ q−1 y α xα + γn|xα | sgn(x) = , with γn := 2 , λn λn
II II δ q−1 and similar expressions hold for xα,δ := xαn,δn . Because the term |xα | is homogeneous of degree q − 1, by an appropriate scaling we can further δ simplify expressions: define the components of pα, pα as
δ δ δ pα := y − λnxα , pα := y − λnxα,
δ II and we use the expressions y, y , ∆y, ∆pα, ∆pα to denote the components of δ II y , y, ∆y, ∆pα, ∆pα , respectively, where we again omit the sequence index n in the notation:
δ δ y = yn,
y = yn, δ ∆y : = yn − yn, δ ∆pα : = pαn − pαn, II II II ∆pα : = pαn − pα,δ,
95 for n ∈ N. Letting
1 1 2−q q−1 2−q α hq(x) := x + |x| sgn(x), x ∈ R, ηn := γn = 2 , λn −1 Φq(y) := hq (y), we obtain via some simple calculation that
δ yδ II yδ yδ = η Φ ( ), = η Φ , +Φ ∗ ( ) , (3.28) xα n q ηnλn xα,δ n q ηnλn q ηnλn δ yδ II yδ yδ yδ = λ η Φ ∗ ( ), = λ η Φ ∗ + Φ ∗ ( ) − Φ ∗ ( ) , pα n n q ηnλn pα,δ n n q ηnλn q ηnλn q ηnλn (3.29)
∗ where q is the conjugate index to q. Note that Φq corresponds to a proximal operator on R. For x > 0 we have
q−1 x ≤ hq(x). (3.30)
Moreover, Φq is monotonically increasing and it is not difficult to verify that for any 1 < q < ∞,
1 q∗−2 x 7→ Φq∗ (x 2−q ) is increasing for x > 0. (3.31)
We now state useful estimates for the function Φq:
Lemma 11. For 1 < q < ∞, q 6= 2, there exist constants Dp,Dq, and for any τ > 0, a constant Dq,τ , such that for all x1 > 0 and |x2| ≤ x1, 1 Φ (x ) − Φ (x ) 1 ≤ q 1 q 2 ≤ , (3.32) q−2 q−2 1 + DpΦq(x1) x1 − x2 1 + DpΦq(x1) 1 2−q x q−1 if 1 < q < 2, 1 D 1 ≤ q (3.33) q−2 1 2−q 1 + DpΦq(x1) q−1 x1 if q > 2 and ∀x1 > τ. Dq,τ
Proof. For any z1 > 0, |z2| ≤ z1, we have
h (z ) − h (z ) zq−1 − |z |q−1 sgn(z ) q 1 q 2 = 1 + 1 2 2 z1 − z2 z1 − z2 ( q−2 q−2 z2 ≤ 1 + z1 Dp, = 1 + z1 k q−2 z1 ≥ 1 + z1 Dp, where 1 − |z|q−1 sgn(z) D ≤ k(z) := ≤ D , ∀z ∈ [−1, 1]. p 1 − z p
96 Replacing zi by Φq(xi) yields the lower bound and the first upper bound in (3.32). In case 1 < q < 2, we find that
2−q 1 Φ (x )2−q x q−1 = q 1 ≤ 1 , q−2 2−q 1 + DpΦq(x1) Φq(x1) + Dp Dp
2−q where we used that Φq(x1) ≥ 0 in the denominator and the estimate 1 q−1 Φq(x1) ≤ x1 that follows from (3.30). Now consider the case q > 2. Then
2−q q−1 2−q 1 Cτ q−1 ≤ x ∀x1 ≥ τ, q−2 1 1 + DpΦq(x1) Dp
q−1 where we used the estimate hq(x) ≤ Cτ x for x ≥ hq(τ). This yields the result.
3.3.1 Muckenhoupt Conditions As in the rest of this section, we follow the example of [90] very closely. In case the forward operator is diagonal, we may specialise the auto-regularisation condition to Muckenhoupt-type inequalities [83, 87, 88] similar to the linear case, cf. (2.17) in Chapter 2. If we consider A : X → Y a compact operator 2 and R = k · k`2 , then the Muckenhoupt-type conditions take the following form, with some t ∈ {1, 2}: There exist constants C1,C2 such that for all ∆y
2 t−1 X 2 α X 2 σn |∆y| ≤ C2 |∆y| , (3.34) σ2 α 2 n 2 σn σn {n: α ≥C1} {n: α δ for all α ∈ (0, αmax), where ∆y = hy − y, uni with un the eigenvectors and 2 ∗ σn the eigenvalues of AA . The integer t is taken as t = 1 for the heuristic discrepancy and Hanke-Raus rules and t = 2 for the quasi-optimality rules, which is analogous with the theory of Chapter 2, where the Np condition (2.17) held for the HD and HR rules with p = 1 and for the quasi-optimality rule with p = 2. Here, t plays the role of p. In this case, the linear auto- regularisation conditions hold for the respective rules and one can prove convergence of the method. The Muckenhoupt inequalities hold in many situations; see, e.g., [87,88]. In order to realise the insight (3.34) provides, one can observe that the right- hand side of the inequality (3.34) represents the high frequency components of the noise. Thus, in order for this upper bound to hold, one can interpret this as requiring that the noise be sufficiently irregular. This is analogous with the requirements of Chapter 2. In the case of the diagonal setting above, we have that σn = λn and the definition of ∆y agrees with that in (3.34). 97 For later reference, we define the following sequence of positive numbers: q δ 2−q θq,n := λn max{|y|, |y |} . (3.35) Then the following theorem provides a sufficient condition for the auto- regularisation condition to hold: The Heuristic Discrepancy Rule Theorem 16. Let A be a diagonal operator (3.26) and R the regularisation functional in (3.27) with q ∈ (1, ∞). Suppose that for some constant C1, δ there exists a constant C2 such that for all y and α ∈ (0, αmax) X 2 α X 2 |∆y| ≤ C2 |∆y| . (3.36) θq,n θq,n θq,n {n: α ≥C1} {n: α X X 2 ∆pα∆y ≤ C |∆pα| . (3.37) n∈N n∈N Let IHD ⊂ N be a set of indices where, for some fixed constants β1, β2, it holds that n ∈ IHD ⇒ |∆y| ≤ β1|∆pα|, (3.38) α n 6∈ IHD ⇒ |∆pα| ≤ β2|∆y| . (3.39) θq,n Note that we construct this set in the proceeding. We first show that for (3.37), it is sufficient that there exists a constant C2 such that X 2 α X 2 |∆y| ≤ C2 |∆y| . (3.40) θq,n n6∈IHD n∈IHD Indeed, splitting the sum in (3.37) into two parts and using (3.39), (3.38), and (3.40), and noting that ∆pα always has the same sign as ∆y, we obtain X X X ∆pα∆y = ∆pα∆y + ∆pα∆y n∈N n6∈IHD n∈IHD X 2 α X 2 ≤ β2 |∆y| + β1 |∆pα| θq,n n6∈IHD n∈IHD X 2 X 2 2 X 2 ≤ β2C2 |∆y| + β1 |∆pα| ≤ (β2C2β1 + β1) |∆pα| . n∈IHD n∈IHD n∈IHD 98 The Lipschitz continuity of the proximal mapping |∆pα| ≤ |∆y| now implies (3.37). Note that (3.36) has the form of (3.40) with θ I := {n : q,n < C }. HD α 1 Thus, it remains to verify that for this index set, there exist constants β1, β2 for which (3.39) and (3.38) hold. We note that by monotonicity, the ratio ∆pα is always positive and invariant ∆y when y, yδ are switched and when y, yδ are replaced by −y, −yδ. Thus, without loss of generality, we may assume that y > 0 and |yδ| ≤ y such that y = max{|y|, |yδ|}. Using this convention, noting that 1 α 2−q λ η = n n λq and the definition of θq,n, (3.35), we have that 2−q δ 2−q q y max{|y|, |y |} λ θq,n = n = . (3.41) λnηn α α Thus for n ∈ IHD and by (3.31), we have ∗ ∗ q −2 1 q −2 y 2−q Φq∗ ≤ Φq∗ C1 . λnηn Using (3.29), Lemma 11 with y max{|y|, |yδ|} yδ x1 = = and x2 = , λnηn λnηn λnηn we find that for n ∈ IHD ∆pα 1 1 ≥ q∗−2 ≥ q∗−2 > 0, (3.42) ∆y y 1 1 + D Φq∗ 2−q p λnηn 1 + DpΦq∗ C1 which verifies (3.38). θq,n In view of (3.39), let n 6∈ IHD, then as α ≥ C1, we conclude that 1 y 2−q ∗ ≥ C1 if q ≤ 2 ( i.e. q > 2). (3.43) λnηn y Applying Lemma 11 with Φ ∗ , x = , we observe that the conditions on q 1 λnηn the right-hand side of (3.33) hold true by (3.43). Noting that the exponents 2−q∗ in (3.32) satisfy q∗−1 = q − 2, we obtain the upper bound 2−q∗ q∗−1 q−2 ∆pα y y α ≤ C˜ = C˜ = C˜ , (3.44) ∆y λnηn λnηn θq,n which verifies (3.39) and thus completes the proof. 99 The Hanke-Raus Rule Contrary to the heuristic discrepancy case, we have to impose a restriction on the regularisation functional exponent q in order to keep certain expressions positive: 3 Lemma 12. If q ≥ 2 , then it follows that II ∆pα ∆pα ≥ 0, δ for all α ∈ (0, αmax) and y ∈ Y . δ Proof. Setting z = y , z = y , noting (3.29) and the identity 1 ηnλn 2 ηnλn Φq∗ (z1 + Φq∗ (z1)) = Φq∗ (hq∗ (Φq∗ (z1)) + Φq∗ (z1)) , it is enough to verify that the mapping F : p 7→ Φq∗ (hq∗ (p) + p) − p, is monotonically increasing. As this function is differentiable everywhere except at p = 0, it suffices to prove the inequality ∗ q∗−2 0 2 + (q − 1)|p| 0 ≤ F (p) = ∗ q∗−2 − 1, 1 + (q − 1)|Φq∗ (hq∗ (p) + p) | for any p ∈ R. Since F is antisymmetric and hence F 0 is symmetric, it is in fact sufficient to prove this inequality for p > 0. Setting r = Φq∗ (hq∗ (p) + p) ≥ p, we thus have to show that 2 + (q∗ − 1)|p|q∗−2 ≥ 1, where h ∗ (r) = h ∗ (p) + p. (3.45) 1 + (q∗ − 1)|r|q∗−2 q q Defining the number ζ implicitly by hq∗ (ζp) = hq∗ (p) + p, (i.e., r = ζp), it q∗−2 2−ζ follows that ζ ∈ [1, 2] and that p = ζq∗−1−1 . Plugging this formula and that for r into the inequality (3.45), we obtain that monotonicity holds if (2 − ζ)(ζq∗−2 − 1) (q∗ − 1) ≤ 1, ∀ζ ∈ [1, 2]. (3.46) ζq∗−1 − 1 Some detailed analysis reveals that this is satisfied for q∗ ≤ 3, which is 3 equivalent to q ≥ 2 . The next lemma is needed to estimate a term in (ARC-HR). Lemma 13. Let n ∈ N be such that θ q,n ≤ C , (3.47) α 1 with a constant C1 that is sufficiently small. Then there is a constant β1 depending on C1 but independent of n with II ∆pα∆y ≤ β1∆pα ∆pα. 100 Proof. Define δ δ δ δ y y y+pα y y y +pα x = + Φ ∗ ( ) = , x = + Φ ∗ ( ) = . 1 ηnλn q ηnλn ηnλn 2 ηnλn q ηnλn ηnλn δ From the definition of pα, pα it follows that sgn(x1) = sgn(y), sgn(x2) = δ δ sgn(y ) and x1, x2 are increasing functions of pα, pα, respectively. Thus the following ratio II ∆p + ∆pα Φq∗ (x1) − Φq∗ (x2) RII := α = , ∆pα + ∆y x1 − x2 is always positive and, moreover, invariant when x1, x2 are switched and, respectively, replaced by −x1, −x2. Thus, we may assume without loss of generality (otherwise we redefine the variables x1, x2) that x1 > 0 and |x2| ≤ δ x1, which is equivalent to y > 0 and |y | ≤ y. Applying Lemma 11 yields then II II ∆pα + ∆pα Φq∗ (x1) − Φq∗ (x2) 1 R = = ≥ q∗−2 . ∆pα + ∆y x1 − x2 1 + DpΦq∗ (x1) It follows from (3.47) and y ≤ y + pα ≤ 2y that 2−q 2−q 2−q 2−q y + pα y θq,n 2−q x1 = ≤ C4 = C4 = C4 C1, ληn ληn α where C4 ∈ {2, 1} depending whether q > 2 or q < 2. In any case, we obtain with (3.31) that as before, ∗ ∗ q −2 1 q −2 y + pα 2−q Φq∗ ≤ Φq∗ C4C1 , λnηn and hence II 1 R ≥ ∗ . 1 q −2 2−q 1 + DpΦq∗ C4C1 Some standard calculus furthermore reveals that ∗ 1 q −2 2−q lim Φq∗ C4C1 = 0. C1→0 Thus we may choose C1 sufficiently small such that ∗ 1 q −2 2−q DpΦq∗ C4C1 ≤ θ < 1, (3.48) 101 II 1 1 as then R ≥ 1+θ > 2 . From this inequality and using Lipschitz continuity of the residuals, |∆pα| ≤ |∆y|, we find that II II 2 2 ∆pα ∆pα = R |∆pα| + ∆y∆pα − |∆pα| 1 1 2 ≥ ∆y∆pα − (1 − )|∆pα| 1 + θ 1 + θ 2 ≥ ( − 1)∆y∆pα. 1 + θ This completes the proof. 3 Theorem 17. Let A and R be as in Theorem 16 with 2 ≤ q < ∞. Suppose that there is a sufficiently small constant C1, and a constant C2 such that for all yδ, (3.36) holds. Then the auto-regularisation condition (ARC-HR) holds for the Hanke-Raus rule. Proof. As in the HD-case, we define a set of indices IHR with the property II n ∈ IHR ⇒ ∆pα∆y ≤ β1∆pα ∆pα and ∆y ≤ β2∆pα, (3.49) α n 6∈ IHR ⇒ |∆pα| ≤ β3|∆y| . (3.50) θq,n Then, sufficient for (ARC-HR) is that a constant C2 exists with X 2 α X 2 |∆y| ≤ C2 |∆y| . (3.51) θq,n n6∈IHR n∈IHR This can be seen as follows: X 2 X X ∆pα∆y − |∆pα| ≤ ∆pα∆y + ∆pα∆y n∈N n∈IHR n6∈IHR X II X 2 α ≤ β1 ∆pα ∆pα + β3 |∆y| θq,n n∈IHR n6∈IHR X II X 2 X II ≤ β1 ∆pα ∆pα + β3C2 |∆y| ≤ (β1 + β3C2β2β1) ∆pα ∆pα n∈IHR n∈IHR n∈IHR X II ≤ (β1 + β3C2β2β1) ∆pα ∆pα, n∈N II where we used that ∆pα ∆pα ≥ 0 in the last step. Hence (ARC-HR) follows from (3.51). Note that (3.36) has the form (3.51) with θ I := {n : q,n ≤ C }, HR α 1 and C1 sufficiently small. We have already shown that for such indices, ∆y ≤ β2∆pα holds and for n 6∈ IHR, (3.50) holds. Moreover, from Lemma 13 it fol- II lows that on IHR also ∆pα∆y ≤ β1∆pα ∆pα holds. Thus collecting these re- sults yields that (3.36) implies the auto-regularisation condition (ARC-HR). 102 The smallness condition on C1 is given by (3.48). The Symmetric Quasi-Optimality Rule Similar as for the Hanke-Raus rule, we first have to verify the nonnegativity of certain expressions: 3 Lemma 14. If q ≥ 2 , then II II (∆pα − ∆pα )∆pα ≥ 0, δ for all α ∈ (0, αmax) and y, y ∈ R. II Proof. Recall the mapping F : pα 7→ pα defined in the proof of Lemma 12. In order to prove the statement, it is enough to show that ((p1 − F (p1)) − (p2 − F (p2)) (F (p1) − F (p2)) ≥ 0 ∀p1, p2. It is not difficult to see that if F is monotone and Lipschitz continuous, then this inequality holds true. Thus, we have to prove that 0 ≤ F 0(p) ≤ 1, ∀p. As in the proof of Lemma 12, we may employ the variable ζ ∈ [1, 2], where 0 3 also the inequality 0 ≤ F (p) was verified for q ≥ 2 . The additional condition 0 q∗−2 1 ∗ F (p) ≤ 1 leads to ζ ≥ 2 , which holds for any q ≥ 1. This shows the result. 3 Theorem 18. Let A and R be as in Theorem 16 with 2 ≤ q < ∞. Suppose that there are constants C1,C2,C3, with C1 sufficiently small, such that for δ all y and α ∈ (0, αmax) 1 q−1 X 2 α X 2 X θq,n 2 |∆y| + |∆y| ≤ C2 |∆y| , θq,n α θq,n θq,n θq,n {n: ≥C1} {n: ≤C }∩Ic {n: Proof. We define an index set ISOR with the property that II II n ∈ ISOR ⇒ |∆pα − ∆y| ≤ β1|∆pα − ∆pα | and |∆pα| ≤ β2|∆pα |. (3.53) Then, for (ARC-SQR) it is sufficient to prove that X X II II (∆y − ∆pα)∆pα ≤ C (∆pα − ∆pα )∆pα . (3.54) n6∈ISOR n∈ISOR 103 This can be seen as in the previous cases since the sum P (∆ − ∆ )∆ n∈ISOR y pα pα can be bounded by P (∆ − ∆ II )∆ II by the definition of I and n∈ISOR pα pα pα SOR the sum P on the right can be bounded from below by 0 according to n6∈ISOR Lemma 14. Now take ISOR = IHR ∩ I2. II Similar as for the Hanke-Raus rule, the inequality |∆pα| ≤ β2|∆pα | holds true on IHR with C1 sufficiently small, thus ISOR satisfies the requirements (3.53). Thus, it remains to show that the stated condition (3.52) implies (3.54). The left-hand side in (3.54) can be bounded from above by X X X (∆y − ∆pα)∆pα ≤ (∆y − ∆pα)∆pα + (∆y − ∆pα)∆pα c n6∈ISOR n/∈IHR nIHR∩I2 X α X ≤ |∆y|2 + |∆y|2, θq,n c n/∈IHR n∈IHR∩I2 where we used the estimates (3.39) on the complement of IHR and the Lipschitz-continuity of the proximal mapping for the second sum: 2 (∆y − ∆pα)∆pα ≤ |∆y||∆pα| ≤ |∆y| . Thus, the left-hand side of (3.52) serves as an upper bound for the left-hand side of (3.54). The sum on the right-hand side of (3.54) can be bounded from below as follows: the summation index n is in ISQO ⊂ IHR, hence II II (∆pα − ∆pα )∆pα ≥ |∆pα − ∆y||∆pα| ≥ β1(∆pα − ∆y)∆y ∆ 1 2 pα 2 ≥ β1|∆y| − 1 ≥ β1|∆y| q∗−2 − 1 ∆y y 1 + D Φ ∗ p q λnηn ∗ 1 q −2 θq,n 2−q ∗ DpΦq α 2 = β1|∆y| ∗ 1 q −2 θq,n 2−q ∗ 1 + DpΦq α 1 C θ q−1 2 q,n ≥ β1|∆y| q∗−2 , 1 α 2−q 1 + DpΦq∗ C1 where we used (3.41), (3.42), and a bound for z > 0 on ISQO of the form q∗−2 1 0 1 Φq∗ z 2−q ≥ C z q−1 , 104 that can be obtained by similar means as above. Thus, the right-hand side of (3.52) is a lower bound for the right-hand side of (3.54). Together, (3.52) implies (3.54) and thus the desired auto-regularisation condition. Remark. The condition in (3.52) has an additional sum over the index set c IHR ∩ I2 on the left-hand side. It might be possible to prove that this set is empty, e.g., if IHR ⊂ I2. Then the corresponding sum would vanish, and this happens in the linear case (q = 2). However, we postpone a more detailed analysis of this issue to the future. We also point out that the Muckenhoupt-type conditions (3.36), and (3.52) (except for the additional sum) agree with the respective ones for the linear case q = 2 so that they appear, in fact, as natural extensions of the linear convergence theory. Case Study of Noise Restrictions For the cases that the operator ill-posedness, the regularity of the exact solution and the noise show some typical behaviour, we investigate the re- strictions that the Muckenhoupt-type condition (3.36) impose on the noise. In particular, we would like to point out that the restrictions are completely realistic and they are satisfied in paradigmatic situations. Consider a polynomially ill-posed problem, with a given decay of the exact solution and a polynomial decay of the error: D1 D2 1 λn = , |y| = , ∆y = δsn , nβ nν nκ for ν > β > 0, 0 < κ < ν, and sn ∈ {−1, 1}. The restrictions κ < ν, ν > β > 0 are natural as the noise is usually less regular than the exact solution and the exact solution has higher decay rates than λn due to regularity. In the linear case, Muckenhoupt-type conditions lead to restrictions on the regularity of the noise, i.e., upper bounds for the decay rate κ. This is perfectly in line with their interpretation as conditions for sufficiently irregular noise. In the following, we write ∼ if the left and right expressions can be estimated by constants independent of n. (There may be a q-dependence, however). The numbers θq,n that appear in (3.36) now read as δ 2−q q δ q−2 2−q q θq,n := max{|y|, |y |} λn = max{1, |y |/|y|} |y| λn 1 snδ ν−κ 2−q ∼ βq+ν(2−q) max{1, |1 + n |} . n C2 We additionally impose the restriction that for sufficiently large n, θq,n → 0 monotonically. If 2 − q > 0, this is trivially satisfied, while for 2 − q < 0, we require that βq + κ(2 − q) > 0, if 2 − q < 0. (3.55) 105 Under these assumptions, for any α sufficiently small, we find an n∗ such ∗ that θq,n = C1α and θq,n ≤ C1α for n ≥ n . Expressing α in terms of θq,n yields a sufficient condition for (3.36), as ∗ n 2 ∞ X |∆y| X 2 1 θq,n∗ ≤ C |∆y| ∼ . (3.56) θ n∗2κ−1 n=1 q,n n=n∗+1 By the straightforward estimate max{1, |1 + snδ nν−κ|} ∼ 1 + δnν−κ, the C2 inequality (3.56) reduces to n∗ (1 + δn∗ν−κ)2−q X nβq+ν(2−q)−2κ ≤ Cn∗. (3.57) n∗βq+ν(2−q)−2κ (1 + δnν−κ)2−q n=1 For any x ≥ 0 and 0 ≤ z ≤ 1, it holds that 1 + x 1 1 ≤ ≤ . 1 + zx z n ∗ We use this inequality with z = n∗ and x = δn . Then, we obtain the sufficient conditions n∗ 1 X nβq+ν(2−q)−2κ−(2−q)(ν−κ) ≤ Cn∗, 2 − q > 0, n∗βq+ν(2−q)−2κ−(2−q)(ν−κ) n=1 n∗ 1 X βq+ν(2−q)−2κ ∗ n ≤ Cn , 2 − q < 0. n∗βq+ν(2−q)−2κ n=1 These inequalities are satisfied if the exponent for n is strictly larger than −1. This finally leads to the restrictions 1 κ ≤ β + , q < 2, q q 2 − q 1 κ ≤ β + ν + , q > 2. 2 q 2 Note that for q > 2, we additionally require (3.55). We hope to have the reader convinced that the imposed conditions on the noise are not too restrictive and, in particular, the set of noise that satis- fies them is nonempty. These conditions provide a hint for which cases the methods may work or fail: 3 In case 2 ≤ q ≤ 2, both the heuristic discrepancy and the Hanke-Raus are reasonable rules. The conditions on the noise are less restrictive the smaller q is. 3 In case 1 < q < 2 , our convergence analysis only applies to the heuristic discrepancy rule, as the nonnegativity condition of the Hanke-Raus rule is 106 not guaranteed in this case. It could be said that the heuristic discrepancy rule is the more robust one then. In case q > 2, we observe that the restriction on the noise depends on the regularity of the exact solution. For highly regular exact solutions (ν 1) the noise condition might fail to be satisfied as q becomes very large. This happens for both the heuristic discrepancy, and the Hanke-Raus rules. We did not include the quasi-optimality condition in this analysis as it still requires further analysis. However, the conditions for it are usually even more restrictive than for the Hanke-Raus rules and we expect similar problems for the case q > 2. 107 Chapter 4 Iterative Regularisation In this chapter, we initially revert to treating linear ill-posed problems through an iterative method, namely Landweber iteration (cf. [97]) which the reader might recall from Section 1.3.2. This presents an alternative approach to Tikhonov regularisation which we covered quite thoroughly in the preceding chapters. In the second section of this chapter, we then cover the basics of Landweber iteration for nonlinear forward operators. 4.1 Landweber Iteration for Linear Opera- tors Let A : X → Y be a linear operator mapping between Hilbert spaces. Then recall that we may solve the linear problem Ax = y via Landweber iteration (1.24), more specifically defined as the iterative procedure: δ δ ∗ δ δ xk = xk−1 + ωA (y − Axk−1), k ∈ N (4.1) δ with an initial guess x0 := x0, which we take as x0 = 0, and a relaxation (i.e., step-size) parameter ω ∈ (0, kAk−2]. Note that without loss of generality, we may assume kAk < 1 and thence drop the parameter ω. In terms of spectral theory, which we will use for our analysis as we did for Tikhonov regularisation, the Landweber iterates may be expressed as Z ∞ ∗ xk = gk(λ) dEλA y, (4.2) 0 with the filter function this time defined as k−1 X j gk(λ) := (1 − λ) , j=0 i.e., k−1 X ∗ j ∗ xk = (I − A A) A y, j=0 108 where the second equality follows from the geometric sum formula and our assumption that kAk < 1. Incidentally, the filter function in (4.2) may also be expressed as 1 − (1 − λ)k g (λ) = , (0 < λ < 1) k λ and consequently we can write the filter function for the associated residual as k rk(λ) = (1 − λ) . (4.3) We now state the following proposition from [35, Theorem 6.1, p. 155] which proves that Landweber iteration is a convergent regularisation method: † † † Proposition 27. If y ∈ dom A , then xk → x as k → ∞. If y∈ / dom A , then kxkk → ∞ as k → ∞. Proof. For y ∈ dom A†, we have † ∗ † ∗ k † xk − x = rk(A A)x = (I − A A) x , which follows from the definition of the residual filter function, cf. (4.3). Moreover, due to our assumption that kAk < 1, we have that for all λ ∈ (0, 1): k−1 X j k |λgk(λ)| = |λ (1 − λ) | = |1 − (1 − λ) | ≤ C, j=0 and k λgk(λ) = 1 − (1 − λ) → 1, as k → ∞ for λ > 0. Hence, 1 g (λ) → , (λ > 0) k λ as k → ∞. The rest of the proof then follows `ala [35, Theorem 4.1, p.72]. The observant reader will have noticed that the above proposition is merely a particular instance of Theorem 3. Now, in the next proposition, we provide the standard error estimates [35]: Proposition 28. Assume that x† satisfies the source condition (2.5). Then we have √ δ † −µ kx − xkk ≤ C kδ, kxk − x k ≤ C(µ, ω)(k + 1) , k √ δ −µ−1 kA(xk − xk)k ≤ 2δ, kAxk − yk ≤ C(µ, ω)(2µ + k + 1) k, and consequently √ kxδ − x†k ≤ C kδ + C(µ, ω)(k + 1)−µ, k √ δ δ −µ−1 kAxk − y k ≤ δ + C(µ, ω)(2µ + k + 1) k, for all k ∈ N and y, yδ ∈ Y . 109 Proof. We begin by following the example of [35, Lemma 6.2, p. 156] by first estimating the data propagation error as k−1 k−1 δ X ∗ i ∗ δ X ∗ i ∗ δ kxk − xkk = k (I − A A) A (y − y)k ≤ k (I − A A) A kky − yk. i=0 i=0 Moreover, recalling that {Rk} define our family of regularisation operators, the first term in the product can be estimated as k−1 X ∗ i ∗ 2 2 ∗ k (I − A A) A k = kRkk = kRkRkk j=0 k−1 X = k (I − A∗A)i(I − (I − A∗A)k)k i=0 k−1 k−1 X X = k (I − A∗A)i − (I − A∗A)i(I − A∗A)kk i=0 i=0 k−1 k−1 X X ≤ k (I − A∗A)ik + k (I − A∗A)ikk(I − A∗A)kk i=0 i=0 k−1 X ≤ 2k (I − A∗A)ik ≤ 2k, i=0 which allows us to prove the first estimate. We proceed to estimate the approximation error [35, Theorem 6.5, p. 159]: Z 1+ † 2 ∗ k † 2 k † 2 kxk − x k = k(I − A A) x k = (1 − λ) dkEλx k 0 Z 1+ 2µ Z 1+ 2µ k 2 µ 2 = λ (1 − λ) dkEλωk ≤ dkEλωk 0 µ + k 0 ( (k + 1)−2µkωk2, if µ ≤ 1, = µ2µ(k + 1)−2µkωk2, if µ > 1. The residual with exact data may be estimated as 2 ∗ k 2 ∗ k † 2 kAxk − yk = k(I − AA ) yk = kA(I − A A) x k Z 1+ 2µ+1 k 2 = λ (1 − λ) dkEλωk 0 k 2µ + 1 2µ+1 ≤ kωk2 2µ + 1 + k 2µ + 1 + k = (2µ + 1)2µ+1(2µ + k + 1)−2µ−2kkωk2. 110 And δ ∗ ∗ δ ∗ ∗ δ kA(xk − xk)k = kAgk(A A)A (y − y)k = kAA gk(AA )(y − y)k = k(I − (I − AA∗)k)(yδ − y)k ≤ δ + k(I − AA∗)k(yδ − y)k ≤ 2δ. Finally, the residual with noise can be estimated as follows: δ δ ∗ k δ ∗ k δ kAxk − y k = k(I − AA ) y k = k(I − AA ) (y − y + y)k ≤ k(I − AA∗)k(yδ − y)k + k(I − AA∗)kyk √ ≤ δ + C(µ, ω)(2µ + k + 1)−µ−1 k. Note that the estimate for the total error also follows from a simple applica- tion of the triangle inequality. Remark. We observe courtesy of the proposition above that, as opposed to Tikhonov regularisation, Landweber iteration does not exhibit H¨oldertype saturation, i.e., with a H¨older type source condition (2.5), Proposition 2 holds for all µ > 0. In other words, the qualification index is µ0 = ∞ [35]. It should be noted, however, that one may take a more general view of qualification `a la [108], i.e., for all 0 ≤ λ ≤ αmax there exists a constant Cλ > 0 such that |1 − λgα(λ)| inf ≥ Cλ, 0≤α≤αmax ρ(α) 1 then ρ(α) is said to be the maximal qualification. Now, taking α ∼ k , [108] −κk gives the maximal qualification for Landweber iteration as ρκ(k) = e , with a positive constant κ > 0, and consequently with saturation φ(δ) = 1 1 2κ δ log δ . Thus, one concludes that one of the most “powerful” regu- larisations may not regularise optimally, even for mildly ill-posed problems (cf. [108]). Remark. Note that in the absence of a source condition, one may still ob- tain estimates and subsequently convergence for the error components with respect to the exact data. For instance, √ † kAxk − yk ≤ kx1 − x k k, where x1 is the first Landweber iterate (cf. [35, p. 158]). In light of the estimates above in the previous proposition, one may observe δ † that the asymptotic behaviour of the function k 7→ kxk −x k is opposite that δ † δ of α 7→ kxα − x k, where√ xα minimises (2.1). That is, the data propagation error is of the order of kδ which is the dominating term for large k and tends “blows up” (i.e., tends to infinity) as k → ∞. The approximation error is of the order of (k + 1)−µ which is larger for small k. Thus, as stated previously, 111 the behaviour of the stopping index can be somewhat loosely related to the 1 Tikhonov regularisation parameter by α ∼ k . We now state the general convergence rates theorem for Landweber iteration [35]: Theorem 19. If x† satisfies a source condition (1.17), then choosing k = − 2 k(δ) = kopt := O δ 2µ+1 yields that 2µ δ † 2µ+1 kxα − x k = O δ , as δ → 0. Proof. From previous estimates of Proposition 28, we have † δ † kxk∗ − x k ≤ kxk − xkk + kxk − x k √ = O kδ + (k + 1)−µ , and the result simply follows by bounding (k + 1)−µ ≤ k−µ, and the choice of k = kopt. 4.1.1 Heuristic Stopping Rules We may use a heuristic stopping rule to select k∗ according to (1.33) with the functionals defined as in Chapter 1: √ √ δ δ ∗ δ ψHD(k, y ) = kkpkk = kkrk(AA )y k, √ 1 √ 3 δ δ δ 2 2 ∗ δ ψHR(k, y ) = khp2k, pki = kkr (AA )y k, 1 1 δ δ δ δ 2 ∗ ∗ 2 ∗ δ ψL(k, y ) = hxk, x2k − xki = kA gk(AA )rk (AA )y k, δ δ δ ∗ ∗ ∗ δ ψQO(k, y ) = kx2k − xkk = kA gk(AA )rk(AA )y k, δ δ where x2k, p2k may be identified with the second iterates as explained in Chapter 1; see (1.27). Recalling that the heuristic functionals may be expressed in the form Z 1+ 2 δ δ 2 ψ (k, y ) = Φk(λ) dkFλy k , 0 we can, similarly as for Tikhonov regularisation, write the associated filter functions Φk : (0, ∞) → R for the heuristic rules as [83,91]: 2 2k Φk(λ) = krk(λ) = k(1 − λ) , (HD) 3 3k Φk(λ) = krk(λ) = k(1 − λ) , (HR) 1 Φ (λ) = λg2(λ)r (λ) = 1 − (1 − λ)k (1 − λ)k − (1 − λ)2k , (L) k k k λ 1 Φ (λ) = λg2(λ)r2(λ) = [(1 − λ)k − (1 − λ)2k]2. (QO) k k k λ 112 The observant reader will note from the above equations that for Landweber δ δ iteration, one has that ψHD(k, y ) ≈ ψHR(k, y ). This was an observation also made in [59] (where both rules were originally introduced). Analogously as Proposition 9 of Chapter 2, we first prove that we can esti- mate the stopping index from above by its associated heuristic functional: Proposition 29. Let δ k∗ = argmin ψ(k, y ), k∈N with ψ ∈ {ψHD, ψHR, ψQO, ψL} and assume there exists a positive constant δ ∗ δ such that ky k ≥ C1 whenever ψ ∈ {ψHD, ψHR} and kA y k ≥ C2 whenever ψ ∈ {ψL, ψQO}. Then δ p ∗ k∗ ψHD(k, y ) ≥ C k∗ (1 − kAA k) , 3 δ p ∗ k∗ ψHR(k, y ) ≥ C k∗ (1 − kAA k) 2 , 1 δ ∗ k∗ ψL(k, y ) ≥ C(1 − kA Ak) 2 , δ ∗ k∗ ψQO(k, y ) ≥ C(1 − kA Ak) , for all yδ ∈ Y . Proof. This follows identically as in the proof of Proposition 9. For instance, if ψ = ψHD, then since kyδk = k(I − AA∗)−k(I − AA∗)kyδk ≤ k(I − AA∗)−kkk(I − AA∗)kyδk, one has that, kyδk k(I − AA∗)kyδk ≥ ≥ C(1 − kAA∗k)k, k(I − AA∗)−kk which follows from the observation that 1 k(I − AA∗)−kk ≤ k(I − AA∗)−1kk ≤ , (1 − kAA∗k)k where we used the Neumann series estimate k(I − AA∗)−1k ≤ (1 − kAA∗k)−1 [139]. Thus, combining the above yields √ √ δ ∗ k δ ∗ k ψHD(k, y ) = kk(I − AA ) y k ≥ C k(1 − kAA k) , 0 ∗ −2k 2 δ from which we can derive the estimate k ≤ C (1 − kAA k) ψHD(k, y ). For ψ = ψHR, it follows√ in exactly the same way, observing that we may δ ∗ 3 k δ write ψHR(k, y ) = kk(I − AA ) 2 y k. For ψ = ψQO, it follows similarly. 113 In particular, since we may write 2k−1 k−1 δ δ X ∗ j ∗ δ X ∗ j ∗ δ kx2k − xkk = (I − A A) A y − (I − A A) A y j=0 j=0 2k−1 X ∗ j ∗ δ = (I − A A) A y , j=k we thus have that 2k−1 2k−1 2 δ δ δ 2 X X ∗ j ∗ δ ∗ i ∗ δ ψQO(k, y ) = kx2k − xkk = h(I − A A) A y , (I − A A) A y i j=k i=k 2k−1 2k−1 X X = h(I − A∗A)i+jA∗yδ,A∗yδi j=k i=k 2k−1 2k−1 X X ≥ (1 − kA∗Ak)i+jkA∗yδk2 j=k i=k 1 − (1 − kA∗Ak)k 2 = (1 − kA∗Ak)2kkA∗yδk2, 1 − (1 − kA∗Ak) where we have twice used the formula for the geometric sum. Now, since 1 − kA∗Ak < 1, and 0 ≤ (1 − kA∗Ak)k ≤ (1 − kA∗Ak) =⇒ 1 − (1 − kA∗Ak)k ≥ 1 − (1 − kA∗Ak) = kA∗Ak, it follows that 1 − (1 − kA∗Ak)k kA∗Ak ≥ = 1. 1 − (1 − kA∗Ak) kA∗Ak Hence, δ ∗ k ∗ δ ψQO(k, y ) ≥ (1 − kA Ak) kA y k, yielding the desired estimate. 114 Finally, for ψ = ψL, it similarly follows that *k−1 2k−1 + 2 δ δ δ δ X ∗ j ∗ δ X ∗ i ∗ δ ψL(k, y ) = hxk, x2k − xki = (I − A A) A y , (I − A A) A y j=0 i=k k−1 2k−1 X X ≥ (1 − kA∗Ak)j+ikA∗yδk2 j=0 i=k k−1 ! 2k−1 ! X X = (1 − kA∗Ak)j (1 − kA∗Ak)i kA∗yδk2 j=0 i=k k−1 ! 1 − (1 − kA∗Ak)k X = (1 − kA∗Ak)k (1 − kA∗Ak)j kA∗yδk2 1 − (1 − kA∗Ak) j=0 1 − (1 − kA∗Ak)k 2 = (1 − kA∗Ak)kkA∗yδk2 1 − (1 − kA∗Ak) ≥ C(1 − kA∗Ak)k, by similar arguments as we used for the ψQO estimate, which completes the proof. Proposition 30. Let the source condition (2.5) be satisfied. Then for all ψ ∈ {ψHD, ψHR, ψL, ψQO}, there exist positive constants such that √ δ −µ ψ(k, y − y ) ≤ C kδ, and ψ(k, y) ≤ Cµ(k + 1) , for all k ∈ N and y, yδ ∈ Y . Proof. In case ψ = ψHD, it follows immediately that √ √ δ ∗ k δ ψHD(k, y − y ) = kk(I − AA ) (y − y )k ≤ 2 kδ, and √ −µ−1 ψHD(k, y) = kkAxk − yk ≤ C(2µ + k + 1) k ≤ C(k + 1)−µ−1(k + 1) = C(k + 1)−µ. For ψ = ψHR, the estimate follows similarly as for the HD rule following the observation that √ √ δ ∗ 3 k δ ψHR(k, y − y ) ≤ kk(I − AA ) 2 kk(y − y )k ≤ C kδ, and √ 3 −µ−1 ψHR(k, y) = kkAx 3 k − yk ≤ C 2µ + k + 1 k 2 2 3 −µ−1 2−µ ≤ C k + 1 (k + 1) ≤ C k + ≤ C (k + 1)−µ, 2 µ 3 µ 115 2 −µ−1 with Cµ = 3 . For ψ = ψQO, note that we may write ∗ ∗ ∗ ψQO(k, y) = kA gk(AA )rk(AA )yk ∗ 1 ∗ 1 ∗ 1 ∗ ≤ k(AA ) 2 gk(AA ) 2 kkgk(AA ) 2 kkrk(AA )yk √ −µ ≤ C kkpkk ≤ C(k + 1) , since |λgk(λ)| ≤ C and |gk(λ)| ≤ k for all λ ∈ (0, 1) and due to similar considerations as above. Furthermore, δ ∗ ∗ ∗ δ ψQO(k, y − y ) = kA gk(AA )rk(AA )(y − y )k ∗ 1 ∗ 1 ∗ ∗ 1 δ ≤ k(AA ) 2 gk(AA ) 2 kkrk(AA )kkgk(AA ) 2 kky − y k √ ≤ kδ, which follows from the previous recollections and that |rk(λ)| ≤ C for all 0 < λ < 1. For ψ = ψL, the upper bounds follow similarly as ∗ ∗ ∗ 1 ∗ ∗ ∗ 1 † ψL(k, y) = kA gk(AA )rk(AA ) 2 yk = kA Agk(A A)rk(A A) 2 x k −µ ∗ ∗ ∗ 1 † k µ −µ ≤ kA Ag (A A)kkr (A A) 2 x k ≤ C + 1 = C · 2 (k + 2) k k 2 −µ ≤ Cµ(k + 1) , and also, δ ∗ 1 ∗ 1 ∗ 1 ∗ 1 δ ψL(k, y − y ) ≤ k(AA ) 2 gk(AA ) 2 kkrk(AA ) 2 kkgk(AA ) 2 kky − y k √ ≤ C kδ, due to similar estimates as before, which completes the proof. Note that the Muckenhoupt condition (2.17) remains identical with that used 1 for Tikhonov regularisation when we consider the relationship α ∼ k ; that is, e ∈ Np if there exists a positive constant such that 1 Z ∞ Z k 1 −1 2 p−1 2 p λ dkFλek ≤ C λ dkFλek , (4.4) k 1 k 0 for all k ∈ N. Proposition 31. Assume that k > 1 and ( N , if ψ ∈ {ψ , ψ }, e = y − yδ ∈ 1 HD HR N2, if ψ ∈ {ψQO, ψL}. 116 Then there exists a positive constant such that δ δ kxk − xkk ≤ Cψ(k, y − y ), for all k ∈ N [83, 91]. Proof. First we estimate the data propagation error as 1 ! Z k Z 1+ δ 2 2 δ 2 kxk − xkk = + λgk(λ) dkFλ(y − y)k . 1 0 k We may bound the integrand of the first integral from the estimates: 2 2 2 |λgk(λ)| ≤ Ck λ, and |λgk(λ)| ≤ C|gk(λ)| ≤ Ck, 2 −1 and similarly, since λgk(λ) = λ (1−(1−λ)), it follows that the integrand of the second integral can be bounded from above by λ−1, since 1−(1−λ)k < 1, where both the aforementioned estimates follow in fact for all λ ∈ (0, 1). Thus, we have 1 Z k Z 1+ δ 2 2 δ 2 −1 δ 2 kxk − xkk ≤ Ck λ dkFλ(y − y)k + λ dkFλ(y − y)k , (4.5) 1 0 k 1 Z k Z 1+ δ 2 −1 δ 2 ≤ Ck dkFλ(y − y)k + λ dkFλ(y − y)k . (4.6) 1 0 k In case p = 2, the second term in (4.5) may be bounded by the first term. Whilst for p = 1, it follows similarly, i.e., the first term in (4.6) may be estimated by the second term. Hence, for each ψ-functional, similarly as for Tikhonov regularisation, it suffices to bound the ψ-functional acting on the noise y − yδ by the first term in either (4.6) or (4.5); that is, 1 Z k δ p p−1 δ 2 ψ(k, y − y ) ≥ Ck λ dkFλ(y − y )k , (4.7) 0 where p = 1 for ψ ∈ {ψHD, ψHR} and p = 2 for ψ ∈ {ψQO, ψL} as per the statement of the proposition. For ψ = ψHD, we have 1 Z k 2 δ ∗ k δ 2 2k δ 2 ψHD(k, y − y ) = kk(I − AA ) (y − y )k ≥ k (1 − λ) dkFλ(y − y )k 0 1 Z k (4.4) Z 1+ δ 2 −1 δ 2 ≥ Ck dkFλ(y − y )k ≥ λ dkFλ(y − y )k , 1 0 k 117 1 where the second last inequality follows from the fact that for 0 < λ ≤ k , we have 1 2 (1 − λ)2k ≥ (1 − )k ≥ C2, (k > 2) (4.8) k which is what we wanted to show. Similarly for ψ = ψHR, we have δ ∗ 2k δ ∗ k δ ψHR(k, y − y ) = kh(I − AA ) (y − y ), (I − AA ) (y − y )i 1 Z k 2k k δ 2 ≥ k (1 − λ) (1 − λ) dkFλ(y − y )k 0 1 (4.8) Z k (4.4) Z 1+ δ 2 −1 δ 2 ≥ Ck dkFλ(y − y )k ≥ λ dkFλ(y − y )k , 1 0 k by the same token. In case ψ = ψQO, 1 Z k δ 2 2 δ 2 ψQO(k, y − y ) ≥ λgk(λ)rk(λ) dkFλ(y − y )k 0 1 Z k 2 δ 2 ≥ Ck λ dkFλ(y − y )k , 0 1 which follows from the fact that for 0 < λ ≤ k , we have