Eindhoven University of Technology
MASTER
Approximation of inverses of BTTB matrices for preconditioning applications
Schneider, F.S.
Award date: 2017
Link to publication
Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.
General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain APPROXIMATION OF INVERSES OF BTTB MATRICES
for Preconditioning Applications
MASTERTHESIS
by
Frank Schneider
December 2016
Dr. Maxim Pisarenco Department of Research ASML Netherlands B.V., Veldhoven
Dr. Michiel Hochstenbach Department of Mathematics and Computer Science Technische Universiteit Eindhoven (TU/e)
Prof. Dr. Bernard Haasdonk Institute of Applied Analysis and Numerical Simulation Universität Stuttgart
APPROXIMATION OF INVERSES of
BTTB MATRICES
m
for Preconditioning Applications
Frank Schneider
December 2016
Submitted in partial fulfillment of the requirements for the degree of Master of Science (M.Sc) in Industrial and Applied Mathematics (IAM) to the Department of Mathematics and Computer Science Technische Universiteit Eindhoven (TU/e) as well as for the degree of Master of Science (M.Sc) in Simulation Technology to the Institute of Applied Analysis and Numerical Simulation Universität Stuttgart
The work described in this thesis has been carried out under the auspices of - Veldhoven, The Netherlands.
ABSTRACT
The metrology of integrated circuits (ICs) requires multiple solutions of a large-scale linear system. The time needed for solving this sys- tem, greatly determines the number of chips that can be processed per time unit.
Since the coefficient matrix is partly composed of block-Toeplitz- Toeplitz-block (BTTB) matrices, approximations of its inverse are interesting candidates for a preconditioner.
In this work, different approximation techniques such as an ap- proximation by sums of Kronecker products or an approximation by inverting the corresponding generating function are examined and where necessary generalized for BTTB and BTTB-block matrices. The computational complexity of each approach is assessed and their uti- lization as a preconditioner evaluated.
The performance of the discussed preconditioners is investigated for a number of test cases stemming from real life applications.
v
ACKNOWLEDGEMENT
First and foremost I wish to thank my supervisor from ASML Maxim Pisarenco. Maxim has supported me not only by providing valuable feedback over the course of the thesis, but by being always there to answer all my questions. He guided the thesis while allowing me freedom to explore the areas that tempted me the most.
I also want to thank my supervisor from the TU/e Michiel Hochsten- bach who was an excellent resource of knowledge, academically and emotionally. Thank you for all the helpful feedback not only regard- ing the work and the thesis, but also regarding future plans.
I owe thanks to the members of my thesis committee, professor Barry Koren and Martijn van Beurden from the TU/e and professor Bernard Haasdonk from the university of Stuttgart. Thank you for your valuable guidance and insightful comments.
Thank you very much, everyone!
Frank Schneider
Eindhoven, December 28, 2016.
vii
CONTENTS i introduction1 1 motivation3 1.1 Photolithography ...... 4 1.1.1 Metrology ...... 5 1.2 Other Applications ...... 8 1.2.1 Deblurring Images ...... 9 1.2.2 Further Applications ...... 11 2 linear systems 13 2.1 Iterative Solvers ...... 13 2.1.1 CG Method ...... 13 2.1.2 Other Methods ...... 16 2.2 Preconditioning ...... 17 3 toeplitz systems 19 3.1 Multi-level Toeplitz Matrices ...... 21 3.2 Circulant Matrices ...... 23 3.3 Hankel ...... 23 4 problem description 25 4.1 Full Problem ...... 25 4.2 BTTB-Block System ...... 28 4.3 BTTB System ...... 29 5 thesis overview 31 ii preconditioners 33 6 overview over the preconditioning techniques 35 7 full c preconditioner 37 7.1 Application to Full Problem ...... 37 7.1.1 Inversion ...... 37 7.1.2 MVP ...... 37 8 circulant approximation 39 8.1 Circulant Approximation for Toeplitz Matrices . . . . 39 8.1.1 Circulant Preconditioners ...... 40 8.2 Circulant Approximation for BTTB Matrices ...... 42 8.2.1 Toeplitz-block Matrices ...... 43 8.2.2 Block-Toeplitz Matrices ...... 43 8.3 Application to BTTB-block Matrices ...... 45 8.3.1 Inversion ...... 45 8.3.2 MVP ...... 46 9 inverse generating function approach 47 9.1 Inverse Generating Function for Toeplitz and BTTB Matrices ...... 47
ix 9.1.1 Unknown Generating Function ...... 48 9.1.2 Numerical Integration for Computing the Fourier Coefficients ...... 49 9.1.3 Numerical Inversion of the Generating Function 50 9.1.4 Example ...... 51 9.1.5 Efficient Inversion andMVP...... 51 9.2 Inverse Generating Function for BTTB-block Matrices . 52 9.2.1 General Approach ...... 53 9.2.2 Preliminaries ...... 53 9.2.3 Proof of Clustering of the Eigenvalues ...... 57 9.2.4 Example ...... 63 9.3 Regularizing Functions ...... 63 9.4 Numerical Experiments ...... 65 9.4.1 Convergence of theIGF...... 65 9.4.2 IGF for aBTTB-block Matrix ...... 66 10 kronecker product approximation 69 10.1 Optimal Approximation for BTTB Matrices ...... 69 10.1.1 Algorithm ...... 71 10.1.2 Inverse and MVP ...... 71 10.2 BTTB-block Matrices ...... 73 10.2.1 One Term Approximation ...... 75 10.2.2 Multiple Terms Approximation ...... 77 10.3 Numerical Experiments ...... 79 10.3.1 Convergence of the Kronecker Product Approx- imation ...... 79 10.3.2 Decay of Singular Values ...... 80 10.3.3 Relation to the Generating Function ...... 80 11 more ideas 83 11.1 Transformation Based Preconditioners ...... 83 11.1.1 Discrete Sine and Cosine Transform ...... 83 11.1.2 Hartley Transform ...... 84 11.2 Banded Approximations ...... 85 11.3 Koyuncu Factorization ...... 86 11.4 Low-Rank Update ...... 88 iii benchmarks 91 12 benchmarks 93 12.1 Transformation-based Preconditioner ...... 99 12.2 Kronecker Product Approximation ...... 100 12.3 Inverse Generating Function ...... 101 12.4 Banded Approximation ...... 103 iv conclusion 105 13 future work 107 13.1 Inverse Generating Function ...... 107 13.1.1 Regularization ...... 107 13.1.2 Other Kernels ...... 107
x 13.2 Kronecker Product Approximation ...... 108 13.2.1 Using a Common Basis ...... 108 13.3 Preconditioner Selection ...... 108 14 conclusion 111 v appendix 113 a inversion formulas for kronecker product ap- proximation 115 a.1 One Term Approximation ...... 115 a.1.1 Sum Approximation ...... 115 a.2 Multiple Terms Approximation ...... 119 a.2.1 Sum Approximation ...... 119 bibliography 123
xi LISTOFFIGURES
Figure 1.1 Moore’s law...... 3 Figure 1.2 Photolithographic process...... 4 Figure 1.3 Close-up of a wafer...... 5 Figure 1.4 Effect of focus on the gratings...... 6 Figure 1.5 Indirect grating measurement...... 6 Figure 1.6 Shape parameters for a trapezoidal grating. . . 7 Figure 1.7 Example for aPSF...... 10 Figure 1.8 Blurring problem...... 10 Figure 2.1 Minimization function...... 14 Figure 2.2 Convergence of gradient descent and conjugate gradient (CG) method for different functions φ. 15 Figure 2.3 Preconditioner trade off...... 18 Figure 4.1 Sparsity patterns of the matrices C, G and G as well as the resulting matrix A...... 25 Figure 4.2 Sparsity pattern of C...... 26 Figure 4.3 Color plots of all levels of C...... 27 Figure 8.1 Color plots for a Toeplitz-block matrix and its circulant-block approximation...... 43 Figure 8.2 Color plots for a block-Toeplitz matrix and its block-circulant approximation...... 44 Figure 9.1 Illustration of the inverse generating function approach (marked in red)...... 48 Figure 9.2 Illustration of the inverse generating function approach for unknown generating functions, with the changes marked in red...... 49 Figure 9.3 Illustration of the inverse generating function approach with numerical Integration (highlighted in red)...... 50 Figure 9.4 Illustration of the inverse generating function approach using a sampled generating function. 51 Figure 9.5 Color plots for the inverse of the originalBTTB matrix, T[f]−1, the result of the inverse gener- ating function method T[1/f] and the difference between those two...... 52 Figure 9.6 Illustration of the inverse generating function for Toeplitz-block matrices...... 53 Figure 9.7 Color plots for the inverse of the original 2 × 2 BTTB-block matrix, T[F (x, y)]−1, the result of the inverse generating function method T[1/F (x,y)] and the difference of those two...... 64
xii Figure 9.8 Degrees of regularization...... 65 Figure 9.9 Convergence of the inverse generating func- tion (IGF) method towards the exact inverse. . 66 Figure 9.10 Distribution of eigenvalues for theIGF..... 68 Figure 10.1 Relative difference of the Kronecker product approximation (using all terms) and the orig- inalBTTB matrix, for 500 randomly created test cases...... 80 Figure 10.2 Decay of the singular value of a sample test case. 81 Figure 10.3 Relation of the Kronecker product approx- imation and the generating functions (taken from the test case 1b)...... 82 Figure 10.4 Convergence of the Generating Function. . . . 82 Figure 11.1 Color plots for aBTTB matrix and the ap- proximation resulting from discrete sine trans- form (DST)II...... 84 Figure 11.2 Color plots for aBTTB matrix and the approx- imation resulting from discrete cosine trans- form (DCT)II...... 84 Figure 11.3 Color plots for aBTTB matrix and the approx- imation resulting from a Hartley transforma- tion...... 85 Figure 11.4 Color plots for aBTTB matrix and a tridiago- nal approximation on both levels...... 85 H Figure 11.5 Relative difference of GM and UkSkVk for different values of k and four different test cases. 89 Figure 12.1 Box plots for the relative speed up of each pre- conditioner compared to the circulant precon- ditioner...... 98
xiii LISTOFTABLES
Table 4.1 Structure of C on each level, from highest (level Z) to lowest (level X)...... 27 Table 4.2 Convergence rates (number of iterations) for (4.1) using induced dimension reduction (IDR)(6). 28 Table 6.1 Applicability of different preconditioning meth- ods...... 35 Table 9.1 The average number of iterations for 3 × 3 BTTB matrix...... 68 Table 12.1 Color-code for the tables in the benchmark chap- ter...... 96 Table 12.2 Number of iterations if a selected precondi- tioner is used on a certain test case...... 97 Table 12.3 Number of iterations for transformation based preconditioners...... 99 Table 12.4 Number of iterations for preconditioners based on the Kronecker product approximation. . . 100 Table 12.5 Number of iterations for preconditioners based on the Kronecker product approximation with approximate singular value decomposition (SVD).102 Table 12.6 Number of iterations for preconditioners based on theIGF...... 103 Table 12.7 Number of iterations for preconditioners based on banded approximations...... 104
xiv NOMENCLATURE vectors
x1 x 2 x: The vector x = . with the elements x1, x2, · · · ∈ C. . x2
Vector Norms
n 1/p p ||x||p = |xi| is the p-norm of x for 1 6 p 6 . i=1 P ||x|| = max |xi| the -norm. ∞ 16i6n ∞ ∞ matrices
(n) A : is a matrix of size n × n, with the entries (A)i,j = Ai,j = ai,j for i, j = 1, ... , n.
(n1;n2) Ai,j;k,l : is referring to the (k, l)th entry of the (i, j)th block of matrix A, which as the size (n1 · n2) × (n1 · n2). Ai,j;:,: or Ai,j; is consequently referring to the (i, j)th block matrix of A.
A(name): endows the matrix with a certain name that helps un- derstanding its purpose.
Ti: A matrix with just one index is referring to an entry of a Toeplitz, circulant or similar matrix that can be described with just few entries. This is also true for example for two-level Toeplitz matrices, where one entry can be reffered to as Ti;j. Note, that i and j can be negative, following the specific nomen- clature of the class of matrix.
Furthermore:
I is the identity matrix, where Ii,j = δij. T T A is the transposed matrix of A, where Ai,j = Aj,i. A−1 is the inverse matrix of A, where AA−1 = A−1A = I.
xv H H A is the conjugated transposed matrix of A, where (A )i,j = Aj,i. The overbar denotes the complex conjugate (a + bi = a − bi). A square matrix is called symmetric if A = AT. A matrix A is called positive definite if xHAx > 0 and real, for all non-zero vectors x. A matrix A is called Hermitian or self-adjoint if A = AH.
λk denotes the eigenvalues of A, where Av = λkv. κ(A) is the condition number of A, which is defined as κ(A) = ||A−1|| · ||A|| (usually using the 2-norm).
Matrix Norms
||Ax||p ||A||p = sup , the by the vector norm ||x||p induced matrix ||x||p x6=0 norm. In particular: m kAk1 = max i=1 |ai,j|, which is simply the maximum 16j6n absolute columnP sum of the matrix. n kAk = max j=1 |ai,j|, which is simply the maximum 16i6m absolute row sum of the matrix. ∞ P p H kAk2 = λmax(A A).
1/2 m n 2 kAkF = i=1 j=1 |ai,j| , called Frobenius norm. P P miscellaneous
0 if i 6= j , δ the Kronecker delta, with δ = . ij ij 1 if i = j f ∈ O (() g) is equivalent to: for x →a < , ∃ C > 0∃ε > 0∀x ∈ {x : d(x, a) < ε} : |f(x)| 6 C · |g(x)|, known as big O notation. ∞ i the imaginary number, where i2 = −1. acronyms
BCCB block-circulant-circulant-block
BiCG biconjugate gradient
BiCGSTAB biconjugate gradient stabilized
BTTB block-Toeplitz-Toeplitz-block
xvi CCD charge-coupled device
CG conjugate gradient
DCT discrete cosine transform
DFT discrete Fourier transform
DST discrete sine transform
DTT discrete trigonometric transform
FFT Fast Fourier transform
GMRES generalized minimal residual
HPD Hermitian positive definite
IC integrated circuit
IDR induced dimension reduction
IGF inverse generating function
MVP matrix-vector product
ODE ordinary differential equation
PDE partial differential equation
PSF point spread function
SVD singular value decomposition
xvii
Part I
INTRODUCTION
This part introduces the main application that motivated this master thesis. It explains the process of fabrication of ICs via photolithography and also includes a description of a method of inspecting and monitoring the production quality of the fabricatedICs. This metrology process re- quires the solution of large linear systems of equations. Before the structure of this particular linear system is fur- ther described along with two related (reduced) linear sys- tems, the basic terms concerning linear systems and their iterative solvers are introduced. Furthermore, the idea of preconditioning is described along with the definition of Toeplitz systems. In the last chapter of this part, the main objectives of this master thesis are discussed along with the main results. This chapter concludes with an outline of the following parts and chapters.
MOTIVATION 1 Following Moore’s Law (see Figure 1.1) , the performance of inte- In 1965, Gordon E. grated circuits (ICs) has steadily increased and fueled what is known Moore proposed in as the "Digital Revolution". Due to the ever increasing complexity and an article that the number of availability of digital electronics, influential developments such as the transistors that can personal computer, the internet or the cellular phone have been made be packed into a possible and affected almost every area of our lives. given unit of space will double roughly every two years [40]. 1010 SPARC M7
109 Apple A7 AMD Phenom Core 2 Duo 108 Pentium 4 7 10 Pentium II Pentium 106 Intel 80486 Intel 80386 Motorola 68020 105 Intel 8086 Number of Transistors 104 Intel 8080 Intel 4004 103 1970 1980 1990 2000 2010 2020 Year of Introduction
Figure 1.1:M oore’s law. This figure shows the number of transistors of landmark microprocessors against their year of introduction. The line shows the proposed doubling in transistor count every two years.
An integrated circuit (IC) can be thought of as a very advanced, miniaturized electric circuit. Using transistors, resistors and capaci- tors as building blocks, one can implement the basic logical opera- tions: not, and, or, etc. On a higher level, this allows the construc- tion of complex circuits such as microprocessors or flash memories [42, 46].
Today, the world around us is full of integrated circuits and mi- croprocessors. One can find them in computers, smartphones, tele- visions, cars and almost every modern electrical device [42]. But the need for more complex and powerful electronic devices is everlasting, with new computationally expensive areas such as computer simula- tions rising in importance. This motivates advances in photolithogra- phy, the main process of fabricating these integrated circuits (ICs).
3 4 motivation
1.1 photolithography (1) The fabrication ofIC s is a multi-billion dollar industry that requires a tightly controlled production environment. Hundreds of theseIC s are produced at the same time on a thin slice of silicon, called a (sili- Si con) wafer and they are later cut apart into singleIC chips. The often (2) complex and interconnected designs of theIC s are copied on a sili- con wafer in a process known as photolithography.
The steps of printing one layer of anIC onto the wafer, are visual- SiO2 Si ized in Figure 1.2 (compare [27, 42, 46]): (3) (1) Prepare wafer: Prior to the use, the silicon wafer has to be cleaned chemically.
Photoresist (2) Deposit barrier layer: In the next step, the wafer is covered with SiO2 Si a thin barrier layer, which is usually silicon dioxide (SiO2). (3) Application of photoresist: After this, the wafer is coated with (4) a light-sensitive material called photoresist.
UV Light (4) Mask alignment and exposure to UV light: The mask carry- Mask ing the complex pattern of theIC is carefully aligned and the whole wafer is exposed to high-intensity ultraviolet light. The Photoresist SiO2 photoresist is only exposed to the UV light in areas where the Si mask is transparent, and the pattern of the mask gets “copied” (5) onto the photoresist.
(5) Development: After developing the photoresist (similar to the development of photographic films) it washes away in areas Pho toresist
SiO2 where it has been exposed to the UV light (or vice versa for Si negative photoresists), making the desired pattern visible on the (6) wafer.
(6) Etching: Chemical etching is used to remove any barrier mate- rial (SiO2) not protected by the coating photoresist. Photoresist SiO2 (7) Photoresist removal: In the last step, the photoresist is removed Si (7) from the wafer, leaving just the barrier layer, with the desired pattern.
This process is repeated for each layer of theIC. The number of 20 40 33 SiO2 layers varies greatly, but lies usually between and [ ]. Each Si layer is processed one after the other.
Figure 1.2: Photolithographic In order to produce a workingIC, the mask as well as each layer process. needs to be aligned with high precision in comparison to the wafer and the underlying layers. Since the sizes of the structures on the wafer are in the magnitude of nanometer, this requires a complex 1.1 photolithography 5 process called metrology.
Metrology can also be used to extract information on the quality of the photolithographic process, by measuring metrology targets or gratings that were printed between the actual chips.
1.1.1 Metrology
Step (4) in the photolithographic process requires not only a careful ASML is the largest and precise alignment of the mask, but also the correct focus for the supplier in the world exposure, both of which are highly non-trivial tasks performed by of photolithographic high-tech lithography systems. systems for the semiconductor Because of that, small gratings between the chips on the wafer industry. are included, as test structures for quality control and high-precision alignment, as seen in Figure 1.3.
Figure 1.3: Close-up of a wafer. The wafer (large) contains several chips, each with a complicated structure (top right). Between the chips gratings have been printed (bottom right) for the purpose of quality control and alignment (source: [46]).
Since these gratings pass through the exact same production cycle as the actual chips, they show the same production biases or short- comings. However, it is easier to use the gratings as metrology tar- gets because of their easy periodic structure compared to the chip’s complex architecture. The exact shape of the gratings contains infor- mation such as an incorrect focus (see Figure 1.4), over- or underex- posure, etc.
Because of the small size of the gratings (in the magnitude of 100 nm), classical optical microscopy is not usable. In 1873 Abbe [1] found 6 motivation
Figure 1.4: Effect of focus on the gratings. While the central gratings is a product of a wafer in focus, the other two gratings were the result of a lithographic process with an incorrect focus. They would both be considered of not reaching the quality standard and therefore be sorted out (source: [46]).
that for light with wavelength λ, the resolution of the resulting picture is at best λ d = , 2n sin Θ where n is the refractive index of the medium being imaged in and Θ is the half-angle subtended by the optical objective lens. This means that the maximal resolution is bounded by c · λ with a c close to 1. To increase the resolution, shorter wavelengths such as UV-light and X-rays can be used.
(a) (b) (Simulated) output of the CCD
Figure 1.5: Indirect grating measurement. Subfigure (a) depicts the process of indirect grating measurement using the scattering of light (Source [46]). A (simulated) output of theCCD is shown in subfigure (b) (source [33]).
On the other side, electron microscopy has its own drawbacks such as being slow and potentially destructive [46]. Therefore, indirect measurements are preferred (see Figure 1.5a). 1.1 photolithography 7
For the indirect grating measurement, light is directed (through filters and and optical system) at the gratings. Depending on the grat- ings geometry, the light is scattered in a certain way. Part of the scat- tered light is captured by aCCD. Figure 1.5 illustrates the method of indirect grating measurement and the light intensity measured by theCCD.
This however does not give a direct access to the geometrical shape height of the gratings. The actual interest of the metrology step is to find out width the geometrical parameters p of the gratings. angle
p For a trapezoidal grating for example, three shape parameters k Figure 1.6: can be used to describe the grating’s geometry: The height p1 of the Shape gratings, the average width p2 of a grating and the angle of the side parameters. wall p3 (see Figure 1.6).
To extract the geometrical parameters p of the gratings, an inverse model of the scattering process is required:
• Forward problem - Scattering simulation: Given a certain shape p of the gratings, simulate the light intensities measured by the sensor I(p).
• Inverse problem - Profile reconstruction: Given a measured light intensity at the sensor ICCD (see Figure 1.5b), reconstruct the geometrical parameters p. This is done by computing This minimization is realized using the min ||ICCD − I(p)|| , Gauss-Newton p algorithm that requires the where I(p) is the result of the forward problem given the pa- computation of the rameter p. first order derivatives. They are Since visible light is an electromagnetic wave with a wavelength be- approximated using finite differences tween 400 to 700 nm, its diffraction is described by Maxwell’s equa- requiring O (n) −iωt tions. Using the time-harmonic assumption Ee(x, y, z, t) = E(x, y, z)e computations of the integral form of the equation is as follows (see [6, 18, 31]): I(p). 8 motivation
1 E ds = ρ dV Gauss’s law ‹ 0 ˚ ∂Ω Ω
B ds = 0 Gauss’s law for magnetism ‹ ∂Ω
E dl = − iω B ds Faraday’s law ˛ ¨ ∂Σ Σ
E dl = µ J ds − µ iω E ds Ampère-Maxwell law ˛ 0 ¨ 0 0 ¨ ∂Σ Σ Σ
where:
E: the electric field, B: the magnetic field, ρ: the electric charge density, 0: the vacuum permittivity or electric constant, ω: the frequency, J: the electric current density Ω: a fixed volume with boundary surface ∂Ω, Σ: a fixed open surface with boundary curve ∂Σ, : denotes a closed line integral, ¸ : denotes a closed surface integral. ‚ Solving the discretized Maxwell’s equation in the case of light The exact structure scattering at gratings requires the solution of a linear system and characteristics of this linear system Ax = b is further described in Chapter 4 which is the most expensive step of the forward problem.
To solve the inverse problem, multiple instances of the forward problem have to be solved in the optimization process. This further motivates the search for an efficient solution of the forward problem in general and the resulting linear system in particular.
1.2 other applications
Besides the mentioned application, Toeplitz,Toeplitz-like and mul- tilevel Toeplitz systems arise in a variety of mathematics, scientific 1.2 otherapplications 9 computing and engineering applications (see [10, 41]).
Some of these applications arise from the fact that a discrete convo- lution can be written as an matrix-vector product (MVP) between a Toeplitz matrix and a vector:
Lemma 1.2.1: Discrete Convolution
Let h andx, be two vectors of size m and n respectively. The convolution h ∗ x, can be computed by theMVP: h1 0 . . . 0 0 . . h h ... . . 2 1 . . h3 h2 . . . 0 0 x 1 . . h3 . . . h1 0 x2 . · x hm−1 . . . . h2 h1 3 . . . . hm hm−1 . . h2 . x 0 0 . . . hm−1 hm−2 2 . . . . . . hm hm−1 0 0 0 . . . hm
The two-dimensional case will result in a two-level Toeplitz ma- trix, also called aBTTB matrix.
The problem of deblurring images is an example for such an appli- cation that stems from a discrete convolution.
1.2.1 Deblurring Images
A model for a blurred image b is usually written:
Ax = b ,(1.1) where x is the original image, A is the blurring matrix and b is the blurred image (see [23]).
The blurring can be described by a point spread function (PSF), such as the one in Figure 1.7.APSF describes the response of the optical system, e. g. the camera or lens, to a point source. Due to im- perfections in the camera or lens system, the intensity of the point source will be spread over multiple pixels, and image gets blurred. 10 motivation
Figure 1.7: Example for aPSF. A typical (Gaussian) point spread function (PSF) for a blurring problem.
The full blurring of an image with a givenPSF is consequently de- scribed by a convolution of the image with thePSF. In the discrete case, this leads to the fact that the blurring matrix A has a two-level Toeplitz structure (in the two-dimensional case). Depending on the boundary condition, also Toeplitz-like matrices are possible (such as block-circulant-circulant-block (BCCB) or matrices with a Hankel structure [23, VIP 9.]). The resulting of such a blurring model can be seen in Figure 1.8.
(a) Original Image (b) Blurred Image
Figure 1.8: Blurring problem. Subfigure (b) is the result of a blurring with thePSF of Figure 1.7 (source of original image: http: //sipi.usc.edu/database/database.php?volume=misc&image=12#top). 1.2 other applications 11
To extract the original (sharp) image, theBTTB-system( 1.1) has to be solved. For applications in the field of image processing and restoration and their relation to preconditioning, see also Kamm and Nagy [29], Koyuncu [32], Lin and Zhang [35].
1.2.2 Further Applications
Other applications that include a Toeplitz,Toeplitz-like or multi- level Toeplitz system occur in areas such as (see for example [10, 22, 23, 41, 45]:
• Numerical ordinary differential equation (ODE)s and partial differential equation (PDE)s • Statistics • Signal processing and filtering • Control theory • Stochastic automata and neutral networks
LINEARSYSTEMS 2 This chapter aims at introducing the basic definitions and methods regarding systems of linear equations. These concepts will be used in subsequent chapters.
A system of linear equations (or linear system) is given by
Ax = b ,(2.1) with the coefficient matrix A ∈ Cn×n, the right-hand side vector b ∈ Cn and the unknown solution x ∈ Cn. This system has a unique solu- tion, if the coefficient matrix A is invertible (also called nonsingular). Consequently, the solution x∗ is then
−1 x∗ = A b .
The solution can be calculated using direct methods such as the Gaus- An implementation sian elimination. However, for larger systems, this is computationally of the Gaussian expensive and iterative solvers are preferred [2]. elimination has a complexity of O n3. 2.1 iterative solvers
Given an initial guess x0, an iterative solver computes a sequence of approximations {xk} of the true solution x∗ until the residual rk = ||rk|| Axk − b satisfies ||b|| < tol .
2.1.1 CG Method
TheCG method was proposed in the 1950s for symmetric and pos- itive definite coefficient matrices A [25]. It is the most prominent it- erative solver for large sparse systems [52] and the basis for a lot of more advanced and specialized algorithms. A proof for the Solving( 2.1) is equivalent to the following minimization problem: equality can be found in [52].
def 1 T T min φ(x) = x Ax − b x (2.2) 2 which means that if φ(x) becomes smaller with each iteration, we also get closer to the solution x∗.
13 14 linear systems
φ(x) in( 2.2) describes a quadratic function, where the exact form is described by the matrix A and the vector b (see Figure 2.1).
400
200
0 −10 −5 10 0 5 0 5 −5 10 −10
Figure 2.1: Minimization function. Example for a function φ(x) with the corresponding contour lines.
The gradient of the minimization problem is the residual of the lin- The solution x∗ of ear system: the minimization problem fulfills the ∇φ(x) = Ax − b = r(x) . necessary condition ∇φ(x ) = 0 ⇐⇒ ∗ The basic idea of theCG method is that the minimum of φ(x) can Ax∗ = b . be found by taking steps in the direction of the negative gradient at the current point, i. e. xk+1 = xk − γ · ∇φ(xk) = xk − γ · Ax + γ · b . The stepsize γ can be chosen to minimize φ(xk+1) along the direc- tion of the gradient.
In contrast to the gradient descent method (also known as steepest de- scent method), theCG method does not directly use the gradient as the descent direction, but insists that each new descent direction con-
Two vectors pi and jugates all directions used before. pk are conjugate (with respect to A), T Figure 2.2 compares the convergence of the gradient descent method if p iApk = 0 . with theCG method. It is important to note that the form of the function φ - and therefore the matrix A - plays a vital role in the convergence speed of the methods (as illustrated with Figure 2.2b). 2.1 iterative solvers 15
(a) Contour plot for the same function (b) Contour plot for a function φ re- φ as in Figure 2.1 with the conver- sulting from a scaled identity ma- gence of the gradient descent (blue) trix. Both methods would converge and theCG method (green). within a single step, regardless of the starting point.
Figure 2.2: Convergence of gradient descent andCG method for different functions φ. The closer the contours resemble circles, the faster the methods converge
2.1.1.1 Convergence Rate
Definition 2.1.1: Order of Convergence
Let a sequence {xk} converge to x∗, and ek = xk − x∗ be the error at step k. If there exists a constant C > 0 such that for a p > 0:
||ek+1|| lim p = C k→ ||ek||
is satisfied,∞ then we say the order of convergence of {xk} is (at least) p. If p = 1 and C < 1 then the convergence rate is said to be linear.
The order of convergence allows a prediction to the number of iter- ations a solver needs until it reaches a satisfactory solution. A higher A sequence {xk} x order of convergence is preferred, since it means less computation converges to ∗ if lim |xk − x∗| = 0 . time. k→
∞ The performance of theCG method for solving a linear system depends on the condition number κ(A) of its coefficient matrix A [41, 43]. For example: 16 linear systems
ek denotes the Theorem 2.1.2: Convergence Rate ofCG Method error vector e = x − x∗ , k k If the coefficient matrix A of the linear system( 2.1) is Hermitian where xk is the k-th iterate of the CG positive definite with the condition number κ(A), then theCG method and x∗ is method converges in the following way: the exact solution of the linear system. p !k ||ek|| κ(A) − 1 6 2 p ||e0|| κ(A) + 1
This theorem implies linear convergence for theCG method. At the same time, theCG method converges faster for linear systems with a smaller condition number. More precise convergence rates can be stated if the distribution of the eigenvalues of A is known (see among others [41]). In the special case of clustered eigenvalues, we get:
Lemma 2.1.3: Convergence Rate for Clustered Spectrum
If the eigenvalues λk of A are such that
0 < δ 6 λ1 6 ... 6 λi 6 1 − 6 λi+1 6 ... 6 λn−j 6 1 + 6 λn−j+1 6 ··· 6 λn
for a δ > 0 and 1 > > 0, then theCG method converges in the following way:
i ||ek|| 1 + k−i−j 6 2 , k > i + j ||e0|| δ
This implies that a matrix A with eigenvalues that are tightly clus- tered around 1 and away from 0, with only as few exceptions as pos- sible, is desirable. This relates to a function φ, with almost circular contours (as shown in Figure 2.2b).
2.1.2 Other Methods
In general there are Besides the basicCG method, three other iterative solvers are impor- no convergence rates tant to this work. All three do not impose any restriction on the coef- known for these ficient matrix A, such as symmetry, and are therefore viable methods advanced methods. In Chapter 12 we for solving the linear system described in Chapter 4. will compare different • The biconjugate gradient stabilized (BiCGSTAB) method is preconditioners by an improved version of the biconjugate gradient (BiCG) method, the number of a generalization of theCG method for nonsymmetric coefficient MVPs they need until convergence. matrices [53]. The method was described by van der Vorst [58]. 2.2 preconditioning 17
• The induced dimension reduction (IDR) was invented by Wes- seling and Sonneveld [59].
• The generalized minimal residual (GMRES) was proposed by Saad and Schultz [49]. The main drawback of this method is its relatively high storage requirements [3].
2.2 preconditioning
The goal of preconditioning is to transform the original linear system (i. e.( 2.1)) into a different linear system that has the same solution x∗, but a better convergence rate.
This can be done by multiplying the whole linear system( 2.1) with the inverse of a preconditioner P , thus getting a left-preconditioned sys- tem:
P −1Ax = P −1b (2.3)
Since the linear system was multiplied by P −1 on both sides, the so- lution x∗ is still the same.
The right-preconditioned system:
AP −1y = b with P −1y = x (2.4) has the same solution as well, which can be seen easily.
In order to have a better convergence rate, the preconditioner P should be chosen, so that the new linear system can be solved easily. This in general means that the left sides of the preconditioned linear systems (P −1A for( 2.3) and AP −1 for( 2.4)) need to have a smaller condition number κ than in the original linear system [24].
ForCG-like methods the distribution of the eigenvalues of P −1A or AP −1 is also highly important, because it influences the conver- gence rate (see Section 2.1.1.1).
Note that in practice it is not necessary to compute the matrix- matrix product P −1A or AP −1, instead, the iterative solvers use the preconditioner in each iteration in aMVP. This results in an extra MVP of a vector with P −1, compared to the original system. The additional costs of the extraMVP in each iteration, needs to be com- pensated by a faster convergence, i. e. less iterations.
Loosely speaking, if P is a good approximation of A the solvers will converge fast. If P = A, then the linear system can be solved 18 linear systems
in one step. However, the inversion of A in order to construct the a) Hard Problem preconditioner is equal to solving the problem and computationally expensive. Therefore easily invertible approximations of A are inter- esting choices for P .
In conclusion, a perfect preconditioner should satisfy the following conditions:
P = I P = A C1 P should be a good approximation of A b) Medium Problem C2 P −1 should be easy to compute
C3 P −1 should be easily applied to a vector, i. e. theMVP P −1x should be cheap to compute
WhileC 1 influences the total number of iterations needed,C 2 de- termines the initial computation cost of the solver (the inverse is cal- P = I P = A culated exactly one time at the beginning). AdditionallyC 3 decides c) Easy Problem the computational cost of each iteration.
In general, the closer P approximates A, the more complex it is to compute its inverse and anMVP with it. Figure 2.3 illustrates this trade off, between a heavily preconditioned system (close to P = A) and a system without preconditioning (P = I). The blue line P = I P = A symbolizes the number of iterations needed to solve the system, the green line the time needed in each iteration and the black line is the Figure 2.3: product of both and represents the total time needed. The optimal Preconditioner preconditioner (marked with the dashed red line) that minimizes the trade off. total complexity of solving the system, depends on the complexity of the problem. TOEPLITZSYSTEMS 3 Definition 3.0.1: Toeplitz Matrix
A matrix T (n) ∈ Cn×n, with constant entries along each diago- nal, i. e. of the form: t0 t−1 . . . t−n+2 t−n+1 .. t1 t0 t−1 . t−n+2 (n) . . . T = . t t .. . , . 1 0 . . . t ...... t −2+n −1 t−1+n t−2+n . . . t1 t0
is called a Toeplitz matrix.
Each entry tij of a Toeplitz matrix only depends on the difference In contrast to of its indices i and j: general n × n matrices, a Toeplitz matrix is ti j = ti+1 j+1 := ti−j = tk k = −n + 1, ... , 0, . . . n − 1 , , well defined by only 2 · n − 1 (rather 2 than n ) entries tk. Example 3.0.2: Toeplitz matrix
9 1 3 6 2 9 1 3 T (4) = 4 2 9 1 1 4 2 9 is a Toeplitz matrix.
Definition 3.0.3: Toeplitz System
AToeplitz system, is a linear system
T x = b ,
where T is a Toeplitz matrix as defined in definition 3.0.1.
(n) We can interpret T as a principal submatrix of a × matrix A principal T ( ): submatrix is obtained by ∞ ∞ ∞ removing rows and columns with the 19 same indices from a larger matrix. 20 toeplitz systems
...... ... a0 a−1 a−2 a−3 a−4 ... ... a a a a a ... 1 0 −1 −2 −3 ( ) ... a a a a a ... T = 2 1 0 −1 −2 ∞ ... a3 a2 a1 a0 a−1 ... ... a4 a3 a2 a1 a0 ... ......
We can further assume that the diagonal coefficients {tk}k=− of ( ) this T matrix are the Fourier coefficients of a function f(∞x): ∞ ∞ π 1 t = f(x)e−ikx dx (3.1) k 2π ˆ −π
hence:
ikx f(x) = tke (3.2) ∞ k=− X We call f(x) the∞generating function of T ( ), as well as any princi- (n) (n) pal submatrix T (of size n × n). In the∞ same way, T [f], is the Toeplitz matrix T (n) ∈ Cn×n induced by the generating function f(x).
In many practical problems, the generating function is usually given first, not the corresponding Toeplitz matrices [10, 41]. This is for ex- ample true for:
• Numerical differential equations, where the equation gives f • Filter design, where the transfer function gives f • Image restoration, where the blurring function gives f
In the main application of this thesis (see Section 1.1 this is also true. The coefficient matrix is a result of the grating’s geometry and the refractive index of the grating’s materials (see also Chapter 4).
One of the special properties of Toeplitz matrices is that we can compute anMVP with T (n) in only O (2n log 2n) operations, us- ing the Fast Fourier transform (FFT). This works by embedding the Toeplitz matrix in a circulant matrix of twice the size, and then com- puting theMVP (see Section 8.1 for how this is done).
Another important aspect of a Toeplitz matrix is that while its inverse is not Toeplitz, it can be factorized by Toeplitz matrices, according to the Gohberg–Semencul formula [20]. 3.1 multi-level toeplitz matrices 21
Theorem 3.0.4: Gohberg–Semencul Formula
If the Toeplitz matrix T ∈ Rn×n is such that each of the systems of equations
T x = e1
T y = en
is solvable and the condition x1 6= 0 is fulfilled, then the ma- trix A is invertible, and its inverse is formed according to the formula
−1 −1 T T T = x1 Lower(x)Lower(Jy) − Lower(Z0y)Lower(Z0Jx)
where Lower(x) denotes a lower triangular Toeplitz matrix with x as the first column, J is the anti-diagonal matrix with ones on the anti-diagonal and zeros everywhere else and Z0 = Lower(e2).
3.1 multi-level toeplitz matrices
We can also define matrices, that possess a Toeplitz structure on one ore more levels of a matrix, therefore defining multi-level Toeplitz matrices. We can first define
Definition 3.1.1: Toeplitz-Block Matrix
T (m;n) ∈ Cm·n×m·n m × m A block matrix (TB) , where each of the blocks is a n × n Toeplitz matrix, is called a Toeplitz-block ma- trix. It has the form: T1,1; T1,2; ... T1,m; T2 1 T2 2 ... T2 m T (m;n) = , ; , ; , ; T (TB) . . . . , where i,j; is Toeplitz . . . .. . Tm,1; Tm,2; ... Tm,m;
not to be confused with: 22 toeplitz systems
Definition 3.1.2: Block-Toeplitz Matrix
T (m;n) ∈ Cm·n×m·n A block matrix (BT) , of the form: A0; A−1; ... A1−m; A1 A0 ... A2−m T (m;n) = ; ; ; (BT) . . . . , . . .. . Am−1; Am−2; ... A0;
where Ak is arbitrary.
A combination of the definitions 3.1.1 and 3.1.2 is:
A BTTB matrix is Definition 3.1.3: BTTB Matrix sometimes also called a two-level Toeplitz matrix. A block-Toeplitz matrix, where the blocks Tk; are themselves Toeplitz matrices, is called a block-Toeplitz-Toeplitz-block (BTTB) matrix: T0; T−1; ... T1−m; T1 T0 ... T2−m T (m;n) = ; ; ; (BTTB) . . . . , . . .. . Tm−1; Tm−2; ... T0;
where Tk is Toeplitz.
Example 3.1.4:BTTB matrix
4 2 7 8 6 5 1 0 6 9 2 1 6 4 2 3 8 6 3 1 0 3 9 2 8 6 4 8 3 8 6 3 1 4 3 9 7 3 1 4 2 7 8 6 5 1 0 6 1 7 3 6 4 2 3 8 6 3 1 0 (4;3) 9 1 7 8 6 4 8 3 8 6 3 1 T = (BTTB) 6 0 9 7 3 1 4 2 7 8 6 5 3 6 0 1 7 3 6 4 2 3 8 6 0 3 6 9 1 7 8 6 4 8 3 8 4 3 2 6 0 9 7 3 1 4 2 7 2 4 3 3 6 0 1 7 3 6 4 2 3 2 4 0 3 6 9 1 7 8 6 4
ABTTB matrix corresponds to a generating function with two vari- The elements of a ables , i. e. BTTB matrix are the Fourier coefficients of a bivariate function. 3.2 circulant matrices 23
π π 1 2 t = f(x, y)e−i(kx+ly) dx dy k;l 2π ˆ ˆ −π −π hence:
i(kx+ly) f(x, y) = tk;le ∞ ∞ k=− l=− X X instead of( 3.1) and∞( 3.2)∞ respectively.
3.2 circulant matrices
A special kind of Toeplitz matrix is the circulant matrix, where each Block-circulant, column is a circular shift of its preceding column : circulant-block and BCCB matrices can be defined similarly c0 c1 . . . cn−2 cn−1 to definitions .. 3.1.1,3.1.2 and 3.1.3 cn−1 c0 c1 . cn−2 . . . C = . c c .. . . n−1 0 . . . c ...... c 2 1 c1 c2 . . . cn−1 c0
A circulant matrix is fully defined by just n coefficients. In Chapter 8 further useful properties are described and used.
3.3 hankel
A related matrix type is the Hankel matrix. It is basically an upside- We can again define down Toeplitz matrix, which means that it has constant values along block-Hankel, its anti-diagonals: Hankel-block and block-Hankel- Hankel-block (BHHB) matrices h−n+1 h−n+2 . . . h−1 h0 and even . combinations from h . . h h h −n+2 −1 0 1 all the matrix types . . . H = . . . h h . above, such as a . 0 1 . . block-Hankel- h h h . . h −1 0 1 n−2 Toeplitz block (BHTB) matrix. h0 h1 . . . hn−2 hn−1
Analogously, to the Toeplitz matrices, a Hankel matrix is de- scribed by 2n − 1 coefficients.
PROBLEMDESCRIPTION 4 This chapter describes the full problem arising from the motivation illustrated in Chapter 1. Then two closely related and reduced prob- lems are derived that will be the main topic of this work.
4.1 full problem
As described in Section 1.1.1, a numerical solution of Maxwell’s equations describing light scattering on a 2D-periodic structure (the gratings) requires the solution of the linear system
Ax = b ,(4.1)
with A ∈ Cn×n the coefficient matrix, b ∈ Cn the right-hand side and x ∈ Cn the unknown solution.
It is known, that A can be decomposed via:
A = C − GM ,
n n where C, G, M ∈ C × are sparse matrices. Since the sparsity pat- In a sparse matrix terns of G and M are complementary, A is a dense matrix (see Fig- most elements are ure 4.1). Note that the matrix M has an identical structure as C. zero. If most elements are non-zero, the matrix is called dense.
= = − − · ·
A C G M
Figure 4.1: Sparsity patterns of the matrices C, G and G as well as the re- sulting matrix A.
25 A C G M 26 problem description
The sparsity pattern of C is shown in more detail in Figure 4.2.
0
C1;
2000
C2;
4000
C3;
6000
. .
CNz;
... 0 2000 4000 6000
Figure 4.2: Sparsity pattern of C.
Each block Cl; of C is of the form : A matrix of the form of Cl; can be called a Cl;1,1; Cl;1,2; Cl;1,3; BTTB-block matrix. Cl = C C C ,(4.2) ; l;2,1; l;2,2; l;2,3; Cl;3,1; Cl;3,2; Cl;3,3;
where each block Cl;i,j is aBTTB matrix with the following generat-
ing function, FCl;i,j :
b FC (x, y) = ni(x, y) nj(x, y) − 1 + δij for i, j ∈ x, y, z l;i,j (x, y)
where (x, y) are the piecewise constant material properties and ni(x, y), nj(x, y) are the normal-vector fields, which are piecewise constant for polygonal shapes.
Additionally, each Cl; is symmetric on a block level, i. e. Cl;i,j = Cl;j,i . 4.1 full problem 27
The structure of matrix C on each level is shown in color plots in Figure 4.3 and is summarized in Table 4.1.
Figure 4.3: Color plots of all levels of C. The image shows color plots of each level of C. On the top level, the block-diagonal structure can be seen easily. The next image shows the 3 × 3 symmetry that the matrix possesses on the next level. The bottom two levels each possess a Toeplitz structure, which is clearly visible in the color plots.
Table 4.1: Structure of C on each level, from highest (level Z) to lowest (level X).
Level Structure
Z Diagonal A 3 × 3 Symmetric YToeplitz XToeplitz
Because C is sparse and an approximation of A, C−1 is a good can- didate for a preconditioner. In fact, preliminary investigations have shown that by choosing C−1 as a preconditioner, the number of it- erations of the solver can be drastically reduced in comparison to choosing no preconditioner at all (which is equivalent to choosing the identity I as a preconditioner), see Table 4.2.
However, computing the inverse C−1 as well as anMVP with it, is quite expensive. Since the inverse of aBTTB matrix is notBTTB, the MVP of C−1 has a cost of O n2. This is computationally expensive in comparison to the cost of anMVP of the unpreconditioned system (here it is anMVP with C and therefore a cost of O (n log n)).
Therefore approximations of C with a cheaper inverse andMVP are wanted. Nevertheless, using C without any changes is still a possi- 28 problem description
Table 4.2: Convergence rates (number of iterations) for( 4.1) usingIDR( 6).
Case P = IP = C
Case 1a 732 225 Case 1b 297 102 Case 2a > 99998 381 Case 2b > 99998 3288 Case 3a 275 130 Case 3b 1817 212 Case 4 85135 310
ble preconditioner and an option for harder problems, see Section 2.2.
Note that our full So the full problemP 1 can be defined as following: problem only uses C as the coefficient P1 Find a good preconditioner (fulfillingC 1 toC 3) for a system matrix. In Cx = b, with C as defined in( 4.2). Section 11.4, we will discuss possible ways to include G 4.2 bttb-block system and M into the preconditioner, for a −1 better adaptation to Using C as a preconditioner requires computing the inverse C . the complete matrix Since C is a block diagonal matrix, the inverse is given by A. −1 C1; 0 ...... 0 . . 0 C−1 .. . 2; . −1 . . . . C = . .. C−1 .. . . 3; . . . . . .. .. 0 . 0 ...... 0 C−1 Nz;
This means, if we solve the following reduced problemP 2, we will also solve the full problem.
P2 Find a good preconditioner (fulfillingC 1 toC 3) for a system Dx = b , with
Dx = b ,
where D1,1; D1,2; D1,3; D = D D D 2,1; 2,2; 2,3; D3,1; D3,2; D3,3; 4.3 bttb system 29
n×n and each Di,j; ∈ C is aBTTB matrix.
Furthermore, the matrix D is symmetric on the block level, i. e.
Di,j; = Dj,i; .
A solution to this reduced problemP 2 can be used for other applica- tions, as the one described in Chapter 1.
4.3 bttb system
A simplification of the previous problemP 2 is the following:
P3 Find a good preconditioner (fulfillingC 1 toC 3) for a system Dx = b, with
Ex = b ,
where E ∈ Cn×n is aBTTB matrix with a piecewise constant generating function.
Solving this problem can be an important step on the path to solv- ing the full problemP 1 and can be used for different applications, some of which are mentioned in Section 1.2.
THESISOVERVIEW 5 Chapter 1 introduces the main application which motivates the re- search subject of this thesis. The metrology of integrated circuits (ICs) requires the solution of a high-dimensional dense linear system. The time needed to solve such a system with an iterative solver can be reduced by applying a preconditioner to the linear system. Besides this main application, further applications of this thesis are described, such as the deblurring of images.
Chapter 2 gives an introduction into the mathematics of linear sys- tems, iterative solvers and the field of preconditioning. Additionally, three conditions for a good preconditioner are derived.
In Chapter 3, the special structure of Toeplitz systems is described. The concept of an associated generating function is explained and special properties of Toeplitz systems are mentioned.
Chapter 4 focuses on the mathematical description of the main ap- plication mentioned in the first chapter. The special structure of the linear system is illustrated. Two linear systems with a simpler struc- ture are introduced that will be considered as intermediate problems.
After this short outline of the thesis and its content, Chapter 6 will start off the part about different preconditioning techniques. In this chapter, a quick overview of the preconditioners described in the fol- lowing is given.
The first preconditioner is described in Chapter 7. The complete matrix C is considered as a preconditioner and evaluated.
In Chapter 8, different circulant approximations of the Toeplitz structures are considered. This approximation is described in the con- text of Toeplitz, multi-level Toeplitz and Toeplitz-block matrices.
Chapter 9 considers T[1/f] as a preconditioner for T[f]. This pre- conditioner is then generalized to the Toeplitz-block and theBTTB- block case. A proof is provided, that if F isHPD, the spectrum of T[F −1]T[F ] is clustered. The chapter closes by proposing a regular- ization for cases where F is notHPD.
31 32 thesis overview
Chapter 10 describes the Kronecker product approximation that approximates a matrix by a sum of Kronecker products. Several op- tions to adapt this method to theBTTB-block case are suggested. Additionally, the relationship between this method and the generat- ing function is illustrated.
Chapter 11 collects several additional preconditioning ideas and describes them briefly.
Chapter 12 compares the performance of the suggested precondi- tioners for several test cases by comparing the required number of iterations until convergence.
Suggestions for future investigations are presented in Chapter 13. This includes aspects of different topics that could not be analyzed in the limited time available for this thesis.
Finally, Chapter 14 summaries the main results of this work and presents conclusions. Part II
PRECONDITIONERS
The following part describes various techniques for ap- proximating (multi-level) Toeplitz matrices. Each approx- imation method will be explained and discussed in terms of their applicability for the presented case. If necessary, changes and generalizations are made to adapt it to the main application of this work. They are explained along with an examination of the complexities of each pre- conditioner, in terms of inversion andMVP with its in- verse.
OVERVIEWOVERTHE PRECONDITIONING 6 TECHNIQUES
This whole part will discuss and propose several preconditioners that can be applied to the full problem described in Section 4.1. The pre- conditioners in this part are proposed for general Toeplitz,BTTB andBTTB-block systems, to make sure the results obtained in this part can be used in general.
Table 6.1 illustrates the different preconditioning techniques and if they can be applied to a simple Toeplitz matrix, aBTTB matrix and a BTTB-block matrix. The table shows the applicability of different pre- conditioning methods for Toeplitz,BTTB andBTTB-block systems. A black check mark denotes this method has been used in literature before, while the symbol of a light bulb denotes the application of this method was done in this work. A cross marks means that dur- ing this work no way of applying this method for this case could be found or was not considered.
Table 6.1: Applicability of different preconditioning methods. Preconditioning Toeplitz matrixBTTB matrixBTTB-block matrix Technique Circulant DST I - IV DCT I - IV Hartley Diagonal Banded with bandwidth > 2 Inverse Generating Function Kronecker s = 1 - s > 2 - with approximate SVD Koyuncu -
35 36 overview over the preconditioning techniques
To simplify the notation of the subsequent chapters, T(Toep) will de- note a simple Toeplitz matrix of size Nx × Nx. T(BTTB) will denote aBTTB matrix, with Ny × Ny blocks, each of size Nx × Nx. Addi- tionally, T(Block) denotes a 3 × 3 block matrix, whose blocks areBTTB matrices, with Ny × Ny blocks, each of size Nx × Nx. FULLCPRECONDITIONER 7 The first and most obvious choice is to use the complete matrix C as a preconditioner (P = C). Compared to the subsequent precondition- ers, that are based on C but are approximations of it, this choice is the best in terms of reducing iterations.
However, each iteration is computationally expensive, as is the in- version of C. Nevertheless, using the complete matrix C can be a good choice for hard problems and is additionally an interesting ref- erence point for the subsequent preconditioners.
7.1 application to full problem
Since C is used as a full matrix, without any changes, this approach is directly usable to the full problem.
7.1.1 Inversion
An exact inversion of aBTTB-block matrix can done using the G aus- 3 sian elimination algorithm. However, the complexity is O (NxNy) Nz . So far, no exact inversion formulas with smaller complexity are known forBTTB-block matrices.
7.1.2 MVP
Since the inverse of aBTTB matrix is notBTTB, C−1 does not pos- sess any structure that can be used for an optimizedMVP. This −1 2 means that the complexity of theMVP with C is O (NxNy) Nz
37
CIRCULANT APPROXIMATION 8
8.1 circulant approximation for toeplitz matrices
It is known, that a Toeplitz matrix T can be approximated well by a This relates to circulant matrix C [10, 12, 14, 54, 57]. conditionC 1 of a good preconditioner. Additionally, it is well known [10, 17] that any circulant matrix C ∈ Cn×n can be diagonalized, such that
H C = F (n) Λ(n)F (n) ,(8.1) where F (n) is the Fourier matrix of order n, i. e.
(n) 1 2πijk F = √ e n j,k n
(n) (n) and Λ = diag(F c) , where c is the first column of C, is a diag- diag(A) denotes a onal matrix holding the eigenvalues of C. diagonal matrix, were the diagonal is equal to the diagonal Via this decomposition it is easy to compute the inverse of a circu- of A. lant matrix as follows:
−1 −1 C−1 = F HΛF = F −1Λ−1 F H = F HΛ−1F .
Therefore, the inversion of an n × n circulant matrix can be done This relates to efficiently in O (n log n). conditionC 2 of the In the same way, theMVP can be computed in O (n log n): preconditioner. This satisfies C−1x = F HΛ−1F x = F Hdiag(F c)−1F x . conditionC 3. Since most FFT The mentioned properties of circulant matrices make them a suit- algorithms rely upon able choice as preconditioners. There are different methods of approx- the factorization of imating a Toeplitz matrix with a circulant one, that will be described n, the complexity increases, if n is in the following section. prime. However, specialized algorithms have been developed that guarantee a complexity of O (n log n) even when n is prime, e.g., [47]. 39 40 circulant approximation
8.1.1 Circulant Preconditioners
In this work, we will look at three different circulant preconditioners, each minimizing a certain norm:
•S trang’s preconditioner CS(T ) [54]: Minimizing kC − T k1 for C Hermitian and circulant, if T is Hermitian.
• T. Chan’s (optimal) preconditioner CC(T ) [14]: Minimizing ||C − T ||F.
•T yrtyshnikov’s (superoptimal) preconditioner CT (T )[57]: Min- −1 imizing ||I − C T ||F.
Each preconditioner will be discussed in the following sections.
8.1.1.1 Strang’s Preconditioner
Strang’s preconditioner [54] uses basically the first half of the first row of the original Hermitian Toeplitz matrix T and completes the rest of the first row with a flipped version of it, to make it circular. In other words:
h bn/2c+1 n−k+1 i CS(T ) = circ (T1j)j=1 transpose(flip((Ti1)i=2 )) ,
where CS(T ) denotes Strang’s preconditioner for the Hermitian Toeplitz matrix T , and circ(x) the circulant matrix, defined by x as the first row.
Example 8.1.1: Strang’s preconditioner
9 1 3 6 2 9 1 3 Take T from Example 3.0.2, T = , then 4 2 9 1 1 4 2 9
9 1 3 2 2 9 1 3 CS(T ) = circ( 9 1 3 2 ) = . 3 2 9 1 1 3 2 9
Incidentally, CS(T ) minimizes ||C − T ||1 and ||C − T || over all Her- mitian circulant matrices C, for Hermitian matrices T [12]. ∞ −1 It can be shown that CS(T ) T has a clustered spectrum (see for example Chan and Jin [10]), thus resulting in a fast convergence of the preconditionedCG method. 8.1 circulant approximation for toeplitz matrices 41
8.1.1.2 T. Chan’s Optimal Preconditioner
Another very popular choice is Chan’s optimal preconditioner CC [14], which minimizes ||A − C||F over all circulant matrices C for an Chan’s arbitrary matrix A. preconditioner is sometimes denoted cF(T ), highlighting Using the decomposition from( 8.1), we get: its relation to the discrete Fourier H H ||A − C||F = ||A − F ΛF ||F = ||F AF − Λ||F , transform. which is minimized for Λ = diag(F AF H) and therefore
(8.1) H H Cc(A) = F diag(F AF )F .
If A is a Toeplitz matrix, the entries of the first row of CC(A) can be given explicitly by
ia−n+i + (n − i)ai c = i = 0, ... , n − 1 , i n which is equal to averaging the corresponding diagonal of A [10, Eq. (2.6)].
Important properties of CC(A) are that it inherits the positive-definiteness −1 of A and that again the spectrum of CC(T ) T is clustered [10, 57].
Example 8.1.2: Chan’s (optimal) preconditioner
9 1 3 6 2 9 1 3 Take again T from Example 3.0.2, T = , then 4 2 9 1 1 4 2 9
9 1 3.5 3 3 9 1 3.5 CC(T ) = . 3.5 3 9 1 1 3.5 3 9
8.1.1.3 Tyrtyshnikov’s Superoptimal Preconditioner
−1 Similar to Chan’s preconditioner, we can try to minimize ||I − C A||F over all circulant matrices C for an arbitrary matrix A. 42 circulant approximation
It can be shown [10], that such a preconditioner is related to Chan’s preconditioner by
H H −1 CT (A) = CC(AA )CC(A ) .
Example 8.1.3: Tyrtyshnikov’s (superoptimal) preconditioner
9 1 3 6 2 9 1 3 Take again T from Example 3.0.2, T = , then 4 2 9 1 1 4 2 9
9.42 0.8142 3.3981 3.0040 3.0040 9.42 0.8142 3.3981 CT (T ) = . 3.3981 3.0040 9.42 0.8142 0.8142 3.3981 3.0040 9.42
−1 Similar to the previous preconditioners, CT (T ) T proves to have a clustered spectrum [10].
8.2 circulant approximation for bttb ma- trices
The goal of this section is to generalize the method of circulant ap- proximations toBTTB matrices. Each level will be handled separately, starting with the upper level, which is equivalent to a block-Toeplitz matrix.
ABTTB matrix can then be approximated with aBCCB matrix, by applying the Toeplitz-block and the block-Toeplitz approximations consecutively. Chan and Jin [9] showed, that approximating both levels separately with Chan’s optimal preconditioner is equivalent to solving
min kT(BTTB) − C(BCCB)kF , C(BCCB)
if C(BCCB) isBCCB. 8.2 circulant approximation for bttb matrices 43
8.2.1 Toeplitz-block Matrices
AToeplitz-block matrix is of the form T1,1; T1,2; ... T1,m; T2 1 T2 2 ... T2 m where Ti j is Toeplitz T (m;n) = , ; , ; , ; , ; (TB) . . . . , . . . .. . i, j = 1, 2, ... , m Tm,1; Tm,2; ... Tm,m;
Therefore, a natural choice is to approximate each Toeplitz matrix Ti,j; with its circulant approximation C(Ti,j;) [15].
The resulting matrix is a matrix of the form C(T1,1;) C(T1,2;) ...C(T1,m;) C(T2 1 ) C(T2 2 ) ...C(T2 m ) C(T (m;n)) = , ; , ; , ; (TB) . . . . . . . .. . C(Tm,1;) C(Tm,2;) ...C(Tm,m;)
Figure 8.1 shows an example of a Toepltz-block matrix and its approximation with a circulant-block matrix.
(a) Color plot for a sample Toeplitz- (b) Color plot for the circulant ap- block matrix, with 6 × 6 blocks each proximation of Figure 2.2b. The of size 4 × 4. The present Toeplitz Toeplitz structure of each block structure is clearly visible. has been approximated by a circu- lant one.
Figure 8.1: Color plots for a Toeplitz-block matrix and its circulant-block approximation.
8.2.2 Block-Toeplitz Matrices
The case of a block-Toeplitz matrix can be handled identically, after transforming it into a Toeplitz-block matrix first. 44 circulant approximation
A block-Toeplitz matrix has the form A0; A−1; ... A1−m; A1 A0 ... A2−m T (m;n) = ; ; ; A (BT) . . . . , where k; is arbitrary . . . .. . Am−1; Am−2; ... A0;
We can now define a permutation matrix P such that
(n;m) (m;n) H (m;n) T( ) := PT( ) P = T( ) . TB k,l;i,j BT k,l;i,j BT i,j;k,l
That means that after applying the permutation
H T(TB) = PT(BT)P ,(8.2)
a block-Toeplitz matrix T(BT) will become a Toeplitz-block matrix T(TB). Note that the number of blocks and the size of the blocks will be swapped between those two matrices.
This means that a suitable approximation method for block-Toeplitz matrices T(TB) can be done via the following steps:
1. Transformation: Transform the block-Toeplitz matrix to a Toeplitz- block matrix by applying the permutation defined in( 8.2): T(TB) = H PT(BT)P .
2. Toeplitz-block approximation: Apply the approximation C(T(TB)) as defined in Section 8.2.1.
3. Back transformation: Transform the solution from the previous H step back, by applying: C(T(BT)) = P C(T(TB))P .
The result of such an approximation is illustrated in Figure 8.2.
(a) Color plot for a sample block- (b) Color plot for the circulant approx- Toeplitz matrix, with 4 × 4 blocks imation of Figure 8.2a. each of size 6 × 6.
Figure 8.2: Color plots for a block-Toeplitz matrix and its block-circulant approximation. 8.3 application to bttb-block matrices 45
8.3 application to bttb-block matrices
As stated in the previous sections, we can approximate eachBTTB matrix, by aBCCB one. The resulting matrix is thereforeBCCB- block. As the following two sections will describe, no further approxi- mation on the 3 × 3 symmetric level is required, to achieve an efficient inversion andMVP in that case.
8.3.1 Inversion
To compute this, we can use the block-inversion formulas. Let A be a 3 × 3-block matrix, where each of the blocks areBCCB matrices of size NxNy × NxNy, i. e. , A1,1; A1,2; A1,3; A = A A A where Ai j is aBCCB matrix . 2,1; 2,2; 2,3; , ; A3,1; A3,2; A3,3;
This 3 × 3 matrix can be condensed into a 2 × 2 matrix, on which the For the full problem known block inversion formula (see for example (2.8.25) in [5]) can discussed in be used. Section 4.1 the matrix A described ! here is equal to one WQ A = , of the blocks of C. RS where ! A A W = 1,1; 1,2; , A2,1; A2,2; ! A Q = 1,3; , A2,3; R = A3,1; A3,2; , S = A3,3; .
Then by applying the block inversion formula, one gets
!−1 ! WQ W −1 + W −1QZRW −1 W −1QZ A−1 = = , RS −ZRW −1 Z where
Z = (S − RW −1Q)−1 46 circulant approximation
is the Schur complement.
The computation requires W −1, which can be computed by apply- ing the block inversion formula again, this time on W , thus
!−1 A A W −1 = 1,1; 1,2; A2,1; A2,2; −1 −1 0 −1 −1 0! A + A A1 2 Z A2 1 A A A1 2 Z = 1,1; 1,1; , ; , ; 1,1; 1,1; , ; , 0 −1 0 −Z A2,1;A1,1; Z
where
0 −1 −1 Z = (A2,2; − A2,1;A1,1;A1,2;)
is the Schur complement.
−1 The fact that BCCB Thus computing A requires the computation of Z and Z 0. Since matrices form an BCCB matrices form an algebra, Z and Z 0 areBCCB matrices as well. algebra can be Thus their computation can be done efficiently in the preprocessing checked easily by using the phase. decomposition via the Fourier The total computation of A−1 requires multiple multiplications, transformation, see inversions and additions ofBCCB matrices, so the total costs are (8.1) O ((NxNy) log(NxNy)).
8.3.2 MVP
As seen in the previous chapter, if A is aBCCB-block matrix, so is A−1. TheMVP of aBCCB matrix with can be computed in O ((NxNy) log(NxNy)). INVERSEGENERATING FUNCTIONAPPROACH 9
As described in Chapter 3 a Toeplitz matrix can be associated with a generating function (see( 3.2)). It has been shown ([11], [34]) that under certain assumptions for f, T[1/f] is a good preconditioner for T[f].
The first section will describe the approach using the inverse gen- erating function (IGF) for Toeplitz matrices. However, the method works almost identically for multi-level Toeplitz matrices, such as BTTB matrices. For theBTTB case, the main difference is that the generating function will be bivariate.
The subsequent section will focus on the generalization for the Toeplitz-block andBTTB-block case. It will also include a proof that the eigenvalues of the preconditioned system in that case are clus- tered as well.
9.1 inverse generating function for toeplitz and bttb matrices
In many applications, a generating function is given and from this, In the case of a the corresponding Toeplitz orBTTB matrix is computed (see up- BTTB matrix, the per arrow a) in Figure 9.1). Computing the inverse of the two-level generating function is bivariate, as Toeplitz matrix generated by f(x, y) is expensive and therefore a step shown in Figure 9.1. that should be avoided (see the downwards arrow on the right side However the in Figure 9.1). procedure is almost identical, whether a Toeplitz or a Instead the inverse of the generating function can be computed BTTB matrix is (step b)) and the Toeplitz matrix generated by 1/f can be computed. used. As shown by Chan and Ng [11] and Lin and Wang [34], this matrix is a good approximation of T[f]−1 if certain conditions are met. Under the assumption that f ∈ C TheIGF approach works in three steps, if the computation of T[f] 2π is positive, Chan and Ng [11] from f is included. The steps are illustrated in Figure 9.1 and are have shown that the eigenvalues of T[f] f A Building by computing the Fourier coefficients of (see T[1/f] T[f] are (3.1) or 3.1). clustered around one, which B Computing the inverse of f. (indirectly) satisfies conditionC 1 for a good preconditioner. Lin and Wang 47 [34] showed the same for BTTB matrices. 48 inverse generating function approach
C Computing the matrix generated by 1/f i. e. , T[1/f].
Fourier Transformation f(x, y) T[f(x, y)] A
Inversion B T[f(x, y)]−1
≈
C 1/f(x,y) T[1/f(x,y)] Fourier Transformation
Figure 9.1: Illustration of the inverse generating function approach (marked in red).
In the next sections, the required steps A to C for theIGF ap- proach, will be replaced, with numerical alternatives (Ã to C).˜ This is necessary in cases when the analytical way can not be used or is too computationally expensive. In the cases relevant to this work, all three alternatives have to be applied, in order to implement this ap- proach (see Figure 9.4). Note that all the suggested alternatives can be computed efficiently, with the use of theFFT
9.1.1 Unknown Generating Function
In some cases, the generating function is not (explicitly) known, but only the (multilevel) Toeplitz matrix. In these cases, the starting point of theIGF method is the matrix itself. Therefore, the direction of step A has to be reversed (see Figure 9.2) and consequently the first step has to be à Approximate the (actual) generating function, with an f˜, using the matrix elements. This can be done in a variety of ways, the most straightforward way being
Nx−1 ikx f˜ = tke ,(9.1) k=−(XNx−1)
for an Nx × Nx (one level) Toeplitz matrix. This approach can be easily generalized toBTTB and other multilevel T oeplitz matrices. Computing the approximation f˜ as in( 9.1) is equivalent to
˜ f = DNx−1 ∗ f , 9.1 inverse generating function for toeplitz and bttb matrices 49
where DNx−1 denotes the Dirichlet kernel of order Nx − 1 and f the exact generating function [7, pp. 1011–1016].
Approximation fe(x, y) T[f(x, y)] Ã
Inversion B T[f(x, y)]−1
≈
C 1 1/fe(x,y) T[ /fe(x,y)] Fourier Transformation
Figure 9.2: Illustration of the inverse generating function approach for un- known generating functions, with the changes marked in red.
Using alternative kernels that have the specific form f˜ = K ∗ f = Nx−1 ikx bke , for some coefficients bk, such as the Fejér kernel [7, −(Nx−1) pp.P1016–1020], are alternative choices and worth investigating in the future.
If the used kernel is of the form above, then f˜ can be computed using anFFT in O (Nx log Nx) operations.
9.1.2 Numerical Integration for Computing the Fourier Coefficients
In general, the Fourier coefficients of 1/f can not be computed an- alytically, which is why a numerical alternative is described in this section (see Figure 9.3). 50 inverse generating function approach
Fourier Transformation f(x, y) T[f(x, y)] A
Inversion B T[f(x, y)]−1
≈
˜ 1 C /f(x,y) Te[1/f(x,y)] Numerical Integration
Figure 9.3: Illustration of the inverse generating function approach with nu- merical Integration (highlighted in red).
In order to get the elements of T[1/f] the following integral (in the simple one level Toeplitz case) has to be computed
π 1 1 e−ikx dx . 2π ˆ f(x) −π
Instead of an analytical integration, this can be transformed using the rectangular (also called mid-point) rule, into a numerical integration, that requires only point evaluations of f(x), i. e.
sN −1 1 x 1 e−ik(2π j/sNx−π) j ,(9.2) sNx f(2π /sNx − π) j=0 X where s is the sampling rate of the rectangular rule. In general, more accurate (and more complicated) methods of numerical integration could be applied, such as higher order Newton-Cotes formulae [8, Sec. 4.3].
For the rectangular rule, the numerical integration step can again be computed using theFFT of order sNx
9.1.3 Numerical Inversion of the Generating Function
Sometimes the generating function f is not given explicitly, but only allows for computationally efficient function evaluations. However, if we use both numerical alternatives from the previous sections, func- 9.1 inverse generating function for toeplitz and bttb matrices 51 tion evaluations are sufficient to compute T[1/f] (see Figure 9.4). There- fore, we can change( 9.1) to
n−1 ikx¯j f˜grid(x¯j) = tke , k=−(Xn−1) j 1 wherex ¯j = (2π /sn − π) one of the sampling points. Thus, /f˜grid can be computed by inverting pointwise, and be used in( 9.2).
Approximation fe(x, y) T[fe(x, y)] Ã
Pointwise Inversion B˜ T[f(x, y)]−1
≈
˜ 1 C /f(x,y) Te[1/fe(x,y)] Numerical Integration
Figure 9.4: Illustration of the inverse generating function approach using a sampled generating function.
9.1.4 Example
Figure 9.5 shows an example for the result of theIGF method. The top image shows the inverse of the originalBTTB matrix, while the bottom figure shows theBTTB matrix generated by the inverse gen- erating function. It should be easily visible, that in this case both matrices are quite similar.
9.1.5 Efficient Inversion and MVP
Besides being a good approximation, which was shown in [11] and [34], the preconditioner also needs to fulfill the conditions regarding the inversion and itsMVP. ConditionC 2 is satisfied, since no in- verse needs to be computed. As shown in the previous sections all the steps of the inverse generating function method can be computed efficiently usingFFT s.
The last conditionC 3 requires an efficient MVP with T[1/f], which is satisfied trivially since it has the same structure as T[f] and therefore the same costs in an MVP. For aBTTB matrix with Ny × Ny blocks, 52 inverse generating function approach
T[f]−1
T[1/f]
T[f]−1 − T[1/f]
Figure 9.5: Color plots for the inverse of the originalBTTB matrix, T[f]−1, the result of the inverse generating function method T[1/f] and the difference between those two.
each of size Nx × Nx, the costs for an MVP are O (NxNy log NxNy).
9.2 inverse generating function for bttb- block matrices
In this section, theIGF approach will be generalized to (multi-level) Toeplitz-block matrices. First, the general approach of theIGF method for block matrices will be described. Subsequently, a proof is pro- vided, showing that the eigenvalues of the preconditioned system are clustered around one. This proof is currently in preparation [50] to be published and is similar to the ones provided by Chan and Ng [11] and Lin and Wang [34] and a generalization of them. 9.2 inverse generating function for bttb-block matrices 53
9.2.1 General Approach
Let us introduce a matrix-valued generating function f11(s) f12(s) . . . f1M(s) . . f21(s) f22(s) .. . F (s) = 9 3 . . . . ,( . ) . . .. . . . . fM1(s) fM2(s) . . . fMM(s) and associate it with the corresponding Toeplitz-block matrix gener- ated by F (s) Note that in definition (9.4) T[F ] T[f11(s)] T[f12(s)] ...T[f1M(s)] denotes a Toeplitz-block . . T[f21(s)] T[f22(s)] .. . matrix, while T[F ] T[F (s)] = 9 4 . . . . .( . ) could also be . . .. . . . . referring to a T[fM1(s)] T[fM2(s)] ...T[fMM(s)] block-Toeplitz matrix, see for −1 example [51, Eq. (1) We can define the matrix-valued inverse generating function F (s) and (2)]. However, −1 (if the inverse exists) and use T[F ] as a preconditioner, analogously both definitions are to before. This method is illustrated in Figure 9.6. similar, since they are a result of a permutation (see Approximation (8.2)). Therefore, eF(x, y) T[eF(x, y)] the results described à here also apply to standard block-Toeplitz matrices, and vice-versa, the ˆ −1 Matrix Inversion B T[F(x, y)] analysis of block-Toeplitz ≈ matrices can be used here. ˜ −1 C F(x, y) Te[eF(x, y)−1] Numerical Integration
Figure 9.6: Illustration of the inverse generating function for Toeplitz-block matrices.
9.2.2 Preliminaries
In this section, smaller lemmas and their proofs are reproduced, that are needed in Section 9.2.3, starting with some definitions that will simplify the nomenclature later on. If not stated otherwise, the 2- norm is meant, if || · || is written. 54 inverse generating function approach
Definition 9.2.1: min λmin(F (s)) s
For any matrix-valued function F ∈ L1, where F (s) = F (s)H, we define
min λmin(F (s)) ≡ sup {y ∈ R : λ1(F (s)) > y , for a. e. s ∈ [−π, π]} , s y
where λj (F (s)) j = 1, ... , n are the eigenvalues of F sorted in non-decreasing order.
Roughly speaking, this denotes the smallest eigenvalue of F over all s ∈ [−π, π].
Definition 9.2.2: max λmax(F (s)) s
For any matrix-valued function F ∈ L1, where F (s) = F (s)H, we define
max λmax(F (s)) ≡ inf {y ∈ R : λn(F (s)) y , for a. e. s ∈ [−π, π]} , s y 6
where λj (F (s)) j = 1, ... , n are the eigenvalues of F sorted in non-decreasing order.
Roughly speaking, this denotes the largest eigenvalue of F over all s ∈ [−π, π].
The next two lemmas refer to the linearity of T[F ] and the fact that the Hermitian structure is preserved under T[F ]. Both lemmas can be checked easily, by utilizing( 3.1).
Lemma 9.2.3: Linearity of T[F ]
T[F ] is a linear mapping, such that T[a · A + b · B] = a · T[A] + b · T[B].
Lemma 9.2.4: Hermitian structure of T[F ]
Let F be a matrix-valued Hermitian function, i.e., fuv = (fvu)∗. Then T[F ] will be Hermitian, i.e.,
uv vu ∗ T[f ]w = (T[f ]−w) .
In the case of a scalar-valued function f, the well-known Grenander and Szegö Theorem (see, for example, [10, p. 13]) provides much in- formation on the distribution of the eigenvalues of T[f]. An extension 9.2 inverse generating function for bttb-block matrices 55 for block-Toeplitz matrices (see [55]), provides a similar result for our case.
Lemma 9.2.5: Distribution of Eigenvalues
Let F be Hermitian and λ be an eigenvalue of T[F ]. Then it holds that
min λmin (F (s)) λ max λmax (F (s)) . s 6 6 s
Proof. We can directly use [39, Thm. 3.1] or [51, Sec. 2] in combina- tion with the fact that the block-Toeplitz matrix in those papers can be transformed by a similarity transformation into a Toeplitz-block matrix (following definition( 9.4)).
We will use the following result in Section 9.2.3 to show the cluster- ing of the eigenvalues of the preconditioned system, in particular, to quantify the impact of the small norm perturbation on the spectrum.
Lemma 9.2.6: Bauer–Fike Theorem forHPD matrices
Let A be Hermitian positive definite (HPD). Let µ be an eigen- value of A + E, then there exists a λ, which is an eigenvalue of A, such that:
|λ − µ| 6 kEk .
Proof. For a proof see [48, pp. 59–60] in combination with [48, Thm. 1.8]
Some lemmas will close this section, that will be used numerous times during the proof of the clustering of the eigenvalues.
Lemma 9.2.7: Sum of Hermitian matrices
If A and B is Hermitian, so is A + B.
Proof. (A + B)H = AH + BH = A + B
The next lemma is a general inequality between the spectral norm and the Frobenius norm.
Lemma 9.2.8: 2-Norm and Frobenius norm
||A||2 6 ||A||F 56 inverse generating function approach
n 2 2 Proof. We can write the Frobenius norm as ||A||F = ||Aej||2. j=1 n P 2 At the same time, for an arbitrary x with ||x||2 = 1, i.e. |xj| = 1, j=1 we have P
2 n n 2 2 ||Ax||2 = || xjAej||2 6 |xj|||Aej||2 j=1 j=1 X X n n n 2 2 2 2 6 |xj| ||Aej||2 = ||Aej||2 = ||A||F , j=1 j=1 j=1 X X X where the first inequality is the triangular inequality and for the sec- ond one we used the Cauchy-Schwarz inequality. 2 Since this is true for arbitrary x it is also true for max ||Ax||2 = ||x||2=1 2 ||A||2.
Lemma 9.2.9: Sub-multiplicativity
All induced matrix norms are sub-multiplicative, ı.e. ||AB|| 6 ||A||||B||.
Proof.
kABk = max k(AB)xk = max kA(Bx)k kxk=1 kxk=1 6 max kAkkBxk = kAk max kBxk = kAkkBk kxk=1 kxk=1
Lemma 9.2.10: Eigenvalues of the inverse
If the matrix A has the eigenvalue λ, then A−1 has the eigen- value λ−1.
Proof. If λ is an eigenvalue of A, then Av = λv. If we multiply both sides with A−1, we get A−1Av = λA−1v from which directly follows −1 1 A v = λ v.
Lemma 9.2.11: Rank
rank(AB) 6 min(rank(A), rank(B)) 9.2 inverse generating function for bttb-block matrices 57
Proof. We can identify the vector product Ax with the linear trans- form A(x). Then rank(AB) = rank(A(B(x))) 6 rank(A) = rank(A). T We can do the same to get rank(DC) 6 rank(D). If we set C = A T T T and D = B , we get rank(DC) = rank (AB) 6 rank(B ). We later need the fact that a similarity transformation A 7→ P −1AP does not change the eigenvalues.
Lemma 9.2.12: Similarity transform
If λ is an eigenvalue of A, then it is also an eigenvalue of A˜ = P −1AP for any invertible matrix P .
Proof. Let Av = λv, then forv ˜ = P −1v, A˜v˜ = (P −1AP )P −1v = P −1Av = λP −1v = λv˜.
9.2.3 Proof of Clustering of the Eigenvalues
We first show that the rank of T[P −1] T[P ] is bounded by 2KM, if all entries of P (s) are trigonometric polynomials of degree K or smaller. We follow the proof of Lemma 2 from [11] and generalize it to the Toeplitz-block case.
Lemma 9.2.1
uv Let p , 1 6 u, v 6 M be trigonometric polynomials of degree K in C2π, i.e.,
K uv uv iks p (s) = pbk e . k=−K X Define p11 p12 ... p1M p21 p22 ... p1M P = . . . . , . . .. . pM1 pM2 ... pMM
and assume its invertibility. −1 −1 Then for n > 2K, rank(T[P ] T[P ] − I) 6 2KM, where T[P ], and T[P ] ∈ CnM×nM and I denotes the identity matrix of ap- propriate size. 58 inverse generating function approach
Proof. Let
R(s) = P (s)−1 ,(9.5)
with its entries
uv uv iks r (s) = rk e . ∞ b k=− X (9.5) implies ∞
R(s) P (s) = I .
Therefore
M um mv ils r (s) p (s) = δu−v = δu−v δl e ,(9.6) ∞ m=1 l=− X X
where δi is the Kronecker delta that is 1∞if i = 0 and 0 otherwise. On the other hand,
M M ! K ! um mv um ik0s mv iks r (s) p (s) = rk0 e pk e ∞ b b m=1 m=1 k0=− k=−K X X X X M K ! ∞ um mv i(k0+k)s = rk0 pk e ∞ b b m=1 k0=− k=−K X X X M K ! 0 ∞ um mv ils [k = l − k] = rl−k pk e ∞ b b m=1 l=− k=−K X X X M K ! ∞ um mv ils = rl−k pk e .(9.7) ∞ b b l=− m=1 k=−K X X X Comparing the coefficients of e∞ils in the right-hand sides of( 9.6) and (9.7) we see that
M K um mv 1, if u = v and l = 0, brl−k pbk = δu−v δl = . m=1 k=−K 0, otherwise. X X Hence for n > 2K, the entries of T[P −1] T[P ] − I are all zeros except entries in the first and last K columns of each Toeplitz-block. Thus −1 rank(T[P ] T[P ] − I) 6 2KM. We now slightly deviate from [11, Lem. 3] and instead follow [34, Lem. 2] to show that T[F −1] − T −1[F ] can be written as a low-rank matrix Gn and a matrix Hn of small norm. 9.2 inverse generating function for bttb-block matrices 59
Lemma 9.2.2
uv Define F (s) as in( 9.3) where each f ∈ C2π, 1 6 u, v 6 M, and Fmin > 0. Then for all ε > 0, there exist positive integers N and K such that for all n > N,
T[F −1] − T −1[F ] = G + H,
where rank(G) 6 2KM and kHk < ε.
uv Proof. 1. Since f ∈ C2π, 1 6 u, v 6 M, following the Weierstrass approximation theorem (see [16, pp. 4–6]) given ε > 0, there exists a trigonometric polynomial
K uv uv iks p (s) = pbk e , k=−K X such that
kfuv(s) − puv(s)k ≡ max |fuv(s) − puv(s)| ε, 1 u, v M. s 6 6 6
2. Since F and P are Hermitian, F − P is also Hermitian. We can use the linearity, Hermitian structure and the distribution of the eigenvalues (see 9.2.5) to show, that the matrices T[F ] and T[P ] can be made arbitrarily close
kT[F ] − T[P ]]k = kT[F − P ]k
= max |λk (T[F − P ]) | k max | min λmin (F (s) − P (s)) |, 6 s | max λmax (F (s) − P (s)) | . s
We first derive an upper bound for the second element of the maximum above
2 2 | max λmax (F (s) − P (s)) | 6 max |λk (F (s) − P (s)) | s s,k max kF (s) − P (s)k2 6 s max kF (s) − P (s)k2 6 s F = max |fuv(s) − puv(s)|2 s uv X max |fuv(s) − puv(s)|2 6 s uv X2 6 Mε . 60 inverse generating function approach
For the first element we get
2 2 2 | min λmin (F (s) − P (s)) | 6 max |λk (F (s) − P (s)) | 6 Mε , s s,k
2 by using the same steps as for | max λmax (F (s) − P (s)) | . s 2 2 Thus kT[F ] − T[P ]]k 6 Mε . 3. Since T[F ] is invertible, T[P ] is also invertible for a sufficiently small ε. We can also derive that P isHPD, if F isHPD and kF − P k 6 cε for some c. We can now write
T[F −1] − T −1[F ] =(T[F −1] − T[P −1]) + [(T[P −1] − T −1[P ]) + (T −1[P ] − T −1[F ])] := G + H ,
where
G =T[P −1] − T −1[P ] = T[P −1] T[P ] − I T −1[P ], H = T[F −1] − T[P −1] + T −1[P ] − T −1[F ] .
4. Since P consists of trigonometric polynomials of degree K, we can use Lemma 9.2.1 to show that G is low-rank