<<

TOPICS IN HIGH-DIMENSIONAL

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of the Ohio State University

By Yeonjong Shin Graduate Program in

The Ohio State University 2018

Dissertation Committee: Dongbin Xiu, Advisor Ching-Shan Chou Chuan Xue c Copyright by Yeonjong Shin 2018 ABSTRACT

Several topics in high-dimensional are discussed. The fundamental problem in approximation theory is to approximate an unknown , called the target function by using its data/samples/observations. Depending on different scenarios on the data collection, different approximation techniques need to be sought in order to utilize the data properly. This dissertation is concerned with four different approximation methods in four different scenarios in data.

First, suppose the data collecting procedure is resource intensive, which may require expensive numerical simulations or experiments. As the performance of the approximation highly depends on the data set, one needs to carefully decide where to collect data. We thus developed a method of designing a quasi-optimal point set which guides us where to collect the data, before the actual data collection procedure. We showed that the resulting quasi-optimal set notably outperforms than other standard choices.

Second, it is not always the case that we obtain the exact data. Suppose the data is corrupted by unexpected external errors. These unexpected corruption errors can have big magnitude and in most cases are non-random. We proved that a well-known classical method, the least absolute deviation (LAD), could effectively eliminate the corruption er- rors. This is a first systematic and of the robustness of LAD toward the corruption.

Third, suppose the of data is insufficient. This leads an underdetermined system which admits infinitely many solutions. The common approach is to seek a sparse solution via the `1-norm due to the work of [22, 24]. A key issue is to promote the sparsity using as few data as possible. We consider an alternative approach by employing `1−`2, motivated

by its contours. We improve the existing theoretical recovery results of `1−`2, extended it

ii to the .

Lastly, suppose the data is sequentially or continuously collected. Due to its large volume, it could be very difficult to store and process all data, which is a common challenge in big data. Thus a sequential function approximation method is presented. It is an iterative method which requires ONLY vector operations (no matrices). It updates the current solution using only one sample at a time, and does not require storage of the data set. Thus the method can handle sequentially collected data and gradually improve its accuracy. We establish sharp upper and lower bounds and also establish the optimal sampling probability measure. We remark that this is the first work on this new direction.

iii ACKNOWLEDGMENTS

With the completion of this dissertation, one big chapter of my life as a student finally has come to an end. At the moment of writing my dissertation acknowledgments, I find myself somewhat speechless and unsure of where to begin. It has been a long adventurous journey and it would not be possible without the help of many individuals.

First and foremost, I express my deepest gratitude to my Ph.D. thesis advisor, Professor

Dongbin Xiu. During my graduate years, he has been a kind and thoughtful mentor and an enthusiastic and insightful advisor. His scientific insight is exceptional and his ability to share it is second to none. I am truly grateful for the support and opportunities he has provided. He has not only taught me many invaluable academic lessons, but many important life lessons that I will treasure for a lifetime. It was a great pleasure to be his student and to work with him.

Secondly, I extend my gratitude to three of my undergraduate professors. Professor

Hyoung June Ko is at the Department of Mathematics, Yonsei University in South Korea.

He is the one who taught me how to study and understand mathematics and the underlying common concept flowing through all fields of mathematics. Also he has taught what the true treasures in my life are. Professor Jeong-Hoon Kim is also at the Department of

Mathematics, Yonsei University. He is the one who helped me a lot in preparing and applying the graduate schools. Without his help, I might not have an opportunity to study abroad. Professor Yoon Mo Jung was at Yonsei University, now at the Department of Mathematics, Sungkyunkwan University in South Korea. He was my undergraduate research advisor. It was him who guided me to the field of computational mathematics and scientific computing. He has been provided so many realistic advice and valuable comments and shared his experience of a path I have yet to tread. Also he is the one who introduces

iv Professor Dongbin Xiu to me at Yonsei University, back in 2013.

Thirdly, I express my appreciation to my dissertation committee members, Professor

Ching-Shan Chou and Professor Chuan Xue, for sparing their precious time. Also I thank to Dr. Jeongjin Lee who recognized my mathematical talent and encouraged me to study more advanced mathematics in my teenage years.

Last but not the least, my utmost thanks and immense gratitude go to my family.

My parents, Kangseok Shin and Junghee Hwang in South Korea, and my older brother

Yeonsang Shin who is an architect in Tokyo, Japan, have always been there for me and believed in me in every step of my life. My parents-in-law, Jangwon Lee and Mihyung Yoo in South Korea, and my brother-in-law Daesop Lee in California, have helped me in various ways in my life in the United States. Altogether my entire family have supported me in all those years and made my life a much better and a much more comfortable one. Without their support and love, this would probably never have been written.

Above all I would like to thank my beloved wife, a great pianist, Yunjin Lee, who is a doctoral student at the University of Texas at Austin. Ever since we were undergraduates together at Yonsei University, she has been supportive, encouraging, and taking care of me. From writing thoughtful cards, to listening to my thoughts, fears, and excitement, to doing everything in her power to make sure I had the strength I needed to face the many challenges along the way, she has always been there doing anything she could to lighten the load, and make me smile along the way.

v VITA

1988 ...... Born in Seoul, South Korea

2013 ...... Bachelor of Science in Mathematics, Yonsei University

2013 ...... Bachelor of Arts in Economics, Yonsei University

Present ...... Graduate Research Associate, The Ohio State University

PUBLICATIONS

[7] Y. Shin, K. Wu and D. Xiu, Sequential function approximation using randomized samples, J. Comput. Phys., 2017 (submitted for publication).

[6] K. Wu, Y. Shin and D. Xiu, A randomized quadrature method for high dimensional approximation, SIAM J. Sci. Comput., 39(5), A1811-A1833 (2017).

[5] Y. Shin and D. Xiu, A randomized for multivariate function approximation, SIAM J. Sci. Comput., 39(3), A983-A1002 (2017).

[4] L. Yan, Y. Shin and D. Xiu, Sparse approximation using `1-`2 minimization and its applications to stochastic collocation, SIAM J. Sci. Comput., 39(1), A229-A254 (2017).

[3] Y. Shin and D. Xiu, Correcting data corruption errors for multivariate function approximation, SIAM J. Sci. Comput., 38(4), A2492-A2511 (2016).

vi [2] Y. Shin and D. Xiu, On a near optimal sampling strategy for , J. Comput. Phys., 326, 931-946 (2016).

[1] Y. Shin and D. Xiu, Nonadaptive quasi-optimal points selection for least squares , SIAM J. Sci. Comput. 38(1), A385-A411 (2016).

FIELDS OF STUDY

Major : Mathematics

Specialization: Approximation theory

vii TABLE OF CONTENTS

Page

Abstract ...... ii

Acknowledgments ...... iv

Vita...... vi

List of Figures ...... xii

List of Tables ...... xvi

Chapters

1 Introduction ...... 1

1.1 Expensive data and Least squares ...... 3

1.2 Few data and Advanced sparse approximation ...... 6

1.3 Big data and Sequential approximation ...... 7

1.4 Corrupted data and Least absolute deviation ...... 10

1.5 Application: Uncertainty Quantification ...... 11

1.6 Objective and Outline ...... 13

2 Review: Approximation, regression and orthogonal .... 14

2.1 Function approximation ...... 14

2.2 Overdetermined linear system ...... 17

2.3 Underdetermined linear system ...... 19

2.3.1 Sparsity ...... 20

2.4 ...... 21

viii 2.5 Tensor Quadrature ...... 24

2.6 Christoffel function and the pluripotential equilibrium ...... 26

2.6.1 Pluripotential equilibrium measure on Bounded domains ...... 27

2.6.2 Pluripotential equilibrium measure on Unbounded domains . . . . . 27

2.7 Uncertainty Quantification ...... 30

2.7.1 Stochastic Galerkin method ...... 31

2.7.2 Stochastic collocation ...... 32

3 Optimal Sampling ...... 34

3.1 Quasi-optimal subset selection ...... 35

3.1.1 S-optimality ...... 39

3.2 Greedy algorithm and implementation ...... 41

3.2.1 Fast greedy algorithm without determinants ...... 43

3.3 Polynomial least squares via quasi-optimal subset ...... 44

3.4 Orthogonal polynomial least squares via quasi-optimal subset ...... 45

3.4.1 Asymptotic distribution of quasi-optimal points ...... 46

3.5 Near optimal subset: quasi-optimal subset for Christoffel least squares . . . 51

3.5.1 Asymptotic distribution of near optimal points ...... 54

3.6 Summary ...... 57

3.6.1 Quasi optimal sampling for ...... 57

3.6.2 Near optimal sampling for Christoffel least squares ...... 58

4 Advanced Sparse Approximation ...... 59

4.1 Review on `1-`2 minimization ...... 59

4.2 Recovery properties of `1-`2 minimization ...... 61

4.3 Function approximation by Legendre polynomials via `1-`2 ...... 64

4.4 Function approximation by Legendre polynomials via a weighted `1-`2 . . . 68 4.5 Summary ...... 71

5 Sequential Approximation ...... 72

ix 5.1 Randomized Kaczmarz algorithm ...... 72

5.2 Sequential function approximation ...... 73

5.3 Convergence and error analysis ...... 75

5.4 Randomized tensor quadrature approximation ...... 80

5.4.1 Computational aspects of discrete sampling ...... 83

6 Correcting Corruption Errors in Data ...... 89

6.1 Assumptions ...... 91

6.2 Auxiliary results ...... 92

6.2.1 Sparsity related results ...... 92

6.2.2 Probabilistic results of the samples ...... 93

6.3 Error Analysis ...... 94

7 Numerical examples ...... 102

7.1 Optimal sampling ...... 104

7.1.1 Matrix stability ...... 105

7.1.2 Function approximation: Quasi-optimal sampling ...... 108

7.1.3 Function approximation: Near-optimal sampling ...... 115

7.1.4 Stochastic collocation for SPDE ...... 117

7.2 Advanced Sparse Approximation ...... 122

7.2.1 Function approximation: Sparse functions ...... 122

7.2.2 Function approximation: Non-sparse functions ...... 128

7.2.3 Stochastic collocation: Elliptic problem ...... 131

7.3 Sequential Approximation ...... 134

7.3.1 Continuous sampling ...... 134

7.3.2 Discrete sampling on the tensor Gauss quadrature ...... 142

7.4 Correcting Data Corruption Errors ...... 156

Bibliography ...... 164

x Appendices

A Proofs of Chapter 3 ...... 171

A.1 Proof of Theorem 3.2.1 ...... 171

A.2 Proof of Theorem 3.2.2 ...... 172

A.3 Proof of Theorem 3.4.1 ...... 173

A.4 Proof of Theorem 3.4.2 ...... 175

B Proofs of Chapter 4 ...... 179

B.1 Proof of Theorem 4.2.1 ...... 179

B.2 Proof of Theorem 4.2.2 ...... 182

B.3 Proof of Theorem 4.2.3 ...... 186

C Proofs of Chapter 5 ...... 190

C.1 Proof of Theorem 5.3.1 ...... 190

C.2 Proof of Theorem 5.3.3 ...... 194

D Proofs of Chapter 6 ...... 196

D.1 Proof of Theorem 6.2.1 ...... 196

D.2 Proofs for the probabilistic results of the samples ...... 198

D.2.1 Proof of Theorem 6.2.2 ...... 198

D.2.2 Proof of Theorem 6.2.3 ...... 199

xi LIST OF FIGURES

Figure Page

3.1 Empirical distribution of the quasi-optimal set in 1D for determined system

by four differnet types of orthogonal polynomials ...... 49

3.2 Empirical distribution of the quasi-optimal set in 1D for over-determined

system by Legendre and Chebyshev ...... 50

3.3 Empirical distribution of the quasi-optimal set in 1D for over-determined

system by Jacobi(1,1) and Jacobi(0,2) ...... 50

3.4 Convergence of the empirical measures of the quasi-optimal sets ...... 51

3.5 Empirical distribution of the near optimal set for over-determined system . 55

7.1 Condition of the optimal model matrix by Legendre at d = 2 . . . 106

7.2 Condition numbers of the optimal model matrix by Legendre at d = 4 . . . 106

7.3 Condition numbers of the optimal model matrix by Hermite at d = 2 . . . . 107

7.4 Condition numbers of the optimal model matrix by Laguerre at d = 2 . . . 107

7.5 Errors of quasi-optimal approximation results by Legendre at d = 2 . . . . . 109

7.6 Errors of quasi-optimal approximation for the Franke by Legendre at d = 2 111

7.7 Results of quasi-optimal approximation for the Franke by Legendre at d = 2 111

7.8 Errors of quasi-optimal approximation results by monomials at d = 2 . . . . 112

7.9 Errors of quasi-optimal approximation for the Franke by monomials at d = 2 113

7.10 Results of quasi-optimal approximation for the Franke by monomials at d = 2 114

xii 7.11 Quasi-optimal function approximation results by Legendre at d = 5 . . . . . 114

7.12 Near-optimal function approximation results by Legendre at d = 2 . . . . . 116

7.13 Near-optimal function approximation results for the Franke by Legendre at

d =2...... 117

7.14 Near-optimal function approximation results by Legendre at d = 4 . . . . . 118

7.15 Near-optimal function approximation results by Hermite at d = 3 ...... 119

7.16 Convergence of optimal SPDE approximation at d = 10 ...... 120

7.17 Errors of optimal SPDE approximation at d = 10 ...... 121

7.18 `1-`2 approx. recovery rates for a sparse function w.r.t. the number of sam- ples in d =1...... 124

7.19 `1-`2 approx. errors for a sparse function w.r.t. the number of samples in d = 1124

7.20 `1-`2 approx. recovery rates for a sparse function w.r.t. the sparsity in d = 1 125

7.21 `1-`2 approx. errors for a sparse function w.r.t. the sparsity in d = 1 . . . . 125

7.22 `1-`2 approx. recovery rates for a sparse function w.r.t. the number of sam- ples in d =3...... 126

7.23 `1-`2 approx. errors for a sparse function w.r.t. the number of samples in d = 3127

7.24 `1-`2 approx. recovery rates and errors for a sparse function w.r.t. the sparsity in d =3 ...... 128

7.25 `1-`2 approx. errors for a sparse function w.r.t. the number of samples in d = 10 at s = 10 ...... 129

7.26 `1-`2 approx. errors for a sparse function w.r.t. the number of samples in d = 10 at s = 20 ...... 129

7.27 `1-`2 approx. errors for Runge function w.r.t. the number of samples at d = 1 130

7.28 `1-`2 approx. errors for non-sparse functions w.r.t. the number of samples at d =2...... 131

7.29 Stochastic Collocation by `1-`2 approximation at d = 3 ...... 132

7.30 Stochastic Collocation by `1-`2 approximation at d = 10 ...... 133 7.31 Sequential approximation via RBF basis at d = 2 ...... 136

7.32 Sequential approximation via Legendre at d = 2 ...... 137

xiii 7.33 Sequential approximation via Legendre at d = 10 ...... 139

7.34 Approximation errors in the converged solutions in d = 10 ...... 139

7.35 Error convergence verus the iteration count in d = 20, 40 ...... 141

7.36 Optimal sampling sequential approximation at d = 2 ...... 142

7.37 Optimal sampling sequential approximation at d = 40 ...... 142

7.38 Sequential approximation on tensor quadrature gird by Legendre at d = 2 . 144

7.39 Sequential approximation on tensor quadrature gird by Hermite at d = 2 . . 145

7.40 Sequential approximation on tensor quadrature gird by trigonometric poly-

nomials at d =2 ...... 147

7.41 Sequential approximation on tensor quadrature grid by Legendre at d = 10 148

7.42 Numerical and theoretical errors of sequential approximation on tensor quadra-

ture grid by Legendre at d = 10 ...... 149

7.43 Numerical and theoretical errors of sequential approximation on tensor quadra-

ture grid by Legendre at d = 40 ...... 150

7.44 Sequential on tensor quadrature grid by Legendre at d = 40

via a non-optimal sampling ...... 151

7.45 Sequential approximation on tensor quadrature grid by Hermite at d = 10 . 152

7.46 Sequential approximation on tensor quadrature grid by Hermite at d = 40 . 152

7.47 Numerical and theoretical errors of sequential approximation on tensor quadra-

ture grid by Legendre at d = 100 ...... 153

7.48 Sequential approximation on tensor quadrature grid by Hermite at d = 100 154

7.49 Sequential approximation on tensor quadrature grid at d = 500 ...... 155

7.50 Correcting data corruption using LAD by Radial basis for Franke function

in d =2...... 158

7.51 Correcting data corruption using LAD by Legendre basis for Gaussian and

continuous functions in d =4...... 159

7.52 Correcting data corruption using LAD by Legendre basis for corner peak and

product peak functions in d = 4 ...... 159

xiv 7.53 Correcting data corruption using LAD by Legendre basis for Gaussian in

d = 4 with different corruption levels ...... 160

7.54 Correcting data corruption using LAD by Legendre basis for product peak

in d = 4 with different corruption levels ...... 161

7.55 Correcting data corruption using LAD by Legendre basis for Gaussian in

d =10...... 162

7.56 Correcting data corruption using LAD by Legendre basis for corner peak in

d =10...... 163

xv LIST OF TABLES

Table Page

7.1 Convergence rate of sequential approximation at d = 2 ...... 137

7.2 The number of iterations used to reach the convgered solutions with the

cardinality of the polynomial space at d = 10 ...... 140

7.3 Convergence rate of sequential approximation at d = 10 ...... 140

7.4 The number of iterations used to reach the convgered solutions with the

cardinality of the polynomial space at d = 20 and d = 40 ...... 141

xvi CHAPTER 1

INTRODUCTION

Modern approximation theory is concerned with approximating an unknown target function using its data/samples/observations. Modern problems in approximation are driven by applications in a large number of fields, including biology, medicine, image, engineering, uncertainty quantification, etcetera, just to name a few. And more importantly, those are being formulated in very high dimensions.

What makes the high dimension so special in approximation theory? Let us begin by briefly answering this question. In theory (under mild assumptions), there exists the best approximation for any given approximants. Such approximation is obtained by orthogonally projecting the target function onto the linear space of approximants. The best approxima- tion is determined by the orthogonal projection coefficients or the best coefficients. However evaluating such coefficients requires the computation of integrals and computing such inte- grals is a challenging task. Thus in practice, the best coefficients are unknown. Therefore, many efforts have been devoted to find an estimation or an approximation to the best co- efficients. One straightforward approach is to approximate integrals using the quadrature rule (cubature for multivariate). The quadrature rule is the which approximates the integrals. This approach (if applicable) produces near best approxima- tions. Thus its solution is often used as a reference solution. In one dimension, Gaussian quadrature is one of the most commonly used one and the resulting rules are optimal [91], i.e., the minimal number of data is needed for a fixed accuracy. In high dimensions, the most straightforward way is to extend one dimensional quadrature rules to high-dimensional space by using tensor construction. However the number of data needed for the rule grows

1 exponentially with dimension. This becomes too large in a very high dimension and makes the approximation problem intractable. For extensive reviews and collections of cubature rules, see, for example [32, 50, 51, 89, 98, 44, 45].

Therefore, it is the high dimension that makes the approximation challenging as the volume of data and/or the complexity of approximation grows rapidly in dimension. This is so-called the curse of dimensionality. Hence other tractable high-dimensional approximation methods are needed.

This dissertation is concerned with the fundamental problem of function approximation

d in high dimensions. Let f(x) be an unknown function defined in a domain D in R , d ≥ 1.

We are interested in approximating f(x) using its sample values f(xi), i = 1, . . . , m.

Remark 1.0.1. In this dissertation, we employ the linear regression model for approxima- tion. Let φi(x), i = 1, . . . , p, be a set of basis in D. We then seek an approximation which has the form of a linear combination of basis functions

p X fe(x) = ciφi(x). i=1

T m T p Let y = [f(x1), . . . , f(xm)] ∈ R be the data vector, c = [c1, . . . , cp] ∈ R be the

coefficient vector, and A = (aij) = (φj(xi)), 1 ≤ i ≤ m, 1 ≤ j ≤ p, be the model matrix. In the oversampled case, m > p, the standard approach seeks to find an approximation

fe whose coefficients minimize the error, i.e.,

min ky − Ack. (1.1) c

When the vector 2-norm is used, this results in the well known least squares (LSQ) method, whose literature is too large to mention here. When the vector 1-norm is used, this results in the least absolute deviations (LAD) method. The LAD method has been studied rather extensively, dated back to the early work of [1, 6, 73, 87, 11, 68]. Since the reduction of

LAD to linear programming was established ([26]), the computation of LAD becomes a straightforward procedure. Much of the statistical properties of LAD has also been studied, cf. [7, 76, 74]. Also, see [12, 75] and the references therein. Despite of these progresses, the

2 LSQ method has remained the more popular choice in practical regression, largely due to its convenience of inference procedure and connection to (ANOVA) (cf,

[81, 42, 40, 33]).

In the undersampled case, m < p, the common approach seeks to find a spare approx- imation fe whose coefficients are sparse. Typically, it is obtained by solving a constrained optimization problem, i.e.,

min F(c), subject to Ac = y (1.2) c where F is an objective function measuring sparsity. Among other choices of F, one of the most common choice is the `1-norm, F(c) = kck1, following the remarkable success of compressive sensing (CS). Although sparsity is naturally measured by `0 norm, `1 norm is widely adopted in practice, as its use leads to convex optimization. Following the early fundamental work on CS, c.f., [24, 22, 34], a large amount of literature on the `1 minimization has emerged. We shall not conduct a review of the CS literature here, as this dissertation does not contain any `1 norm approach. The objective of this dissertation is on the introduction to four different high-dimensional function approximation methods developed by the author in [84, 85, 103, 86, 96, 82, 83].

Each method is designed for a specific scenario in data. In the rest of this chapter, we describe these data related scenarios and its corresponding methods along with the literature reviews.

1.1 Expensive data and Least squares

Linear regression is one of the most used approximation model for estimation, prediction, and learning of complex systems. Least squares method (LSQ), dated back to Gauss and

Legendre, is the simplest and thus most common approach in linear regression. Countless work has been devoted to the least squares method, and the amount of literature is too large to mention.

The performance of LSQ obviously depends on the sample points. The choices of the

3 samples typically follow two distinct approaches: random samples and deterministic sam- ples. In random sampling one draws the samples from a probability measure, which is often defined by the underlying problem, whereas in deterministic sampling the samples follow certain fixed and deterministic rules. The rules aim to fill up the space, in which the sam- ples are allowed to take place, in a systematical manner. For example, quasi Monte Carlo methods, lattice rules, orthogonal arrays, etc. The study of these generally falls into the topic of design of experiments (DOE), see, for example, [3, 5, 14, 47, 49, 60, 77, 95], and the references therein.

Here we address a rather different question regarding the choice of sample points. Let

ΘM be a set of candidate sample points, where M > 1 is the total number of points.

We assume ΘM is a large and dense set with M  1 such that an accurate regression result can be obtained. For an accurate result, however, it is required one to collect a large number (M  1) of sample data. It can be resource intensive if the data collection procedure requires expensive numerical simulations or experimentations. Let m be a number

1 ≤ m < M, denoting the number of samples one can afford to collect. We then seek to find a subset Θm ⊂ ΘM such that the LSQ result based on Θm is as close as possible to that

based on ΘM . However, theoretically, it is impossible to obtain the optimal subset without having all data information on ΘM . This dissertation introduces a non-adaptive quasi-optimal subset, originally proposed in

[84], which seeks to make the LSQ result on Θm close to that on ΘM without the knowledge of the data. The determination of the quasi-optimal subset is based on the optimization of a quantity that measures the determinant and the column space of the model matrix.

The non-adaptive feature of the quasi-optimal subset is useful for practical problems, as it allows one to determine, prior to conducting expensive simulations or experimentations, where to collect the data samples. Moreover, a greedy algorithm is presented to compute the quasi-optimal set efficiently.

We then apply the quasi-optimal sampling strategy to function approximation. The polynomial least squares solution is sought and it is obtained by solving an over-determined linear system of for the coefficients. Let the size of the model matrix be M ×

4 p, where M is the number of samples and p is the number of unknown coefficients in the polynomial or the number of basis. A well accepted rule-of-thumb is to use linear over-sampling by letting M = αp, where α = 1.5 ∼ 3. On the other hand, some recent mathematical analysis revealed that such a linear over-sampling is asymptotically unstable for polynomial least squares using Monte Carlo sampling and quasi Monte Carlo, cf. [31,

65, 64, 62]. In fact, for asymptotically stable polynomial regression, one should have at least M ∝ p log p and in many cases M ∝ p2.

In the setup of function approximation, we introduce two different sampling strategies

[84, 85]. One is a direct application of the quasi-optimal sampling to the ordinary least squares. The other is the combination of the quasi-optimal sampling and a method termed

Christoffer least squares (CLS) [66]. CLS is a weighted least squares problem, where the weights are derived from the Christoffel function of orthogonal polynomials. Instead of using the standard MC samples from the measure of the underlying problem, one samples instead from the pluripotential equilibrium measure. Analysis and extensive numerical tests in both bounded domain and unbounded domains were presented in [66] and demonstrated much improved stability property. Again, an important feature of these strategies is nonadap- tiveness. Prior to collecting the data at the samples, we apply the quasi-optimal sampling method to obtain the quasi-optimal subset, for any given number of samples specified by users. The data of the target function are then collected at this subset only and the cor- responding (weighted) polynomial least squares problem is solved for the approximation of the target function.

Extensive numerical examples are provided in Chapter 7.1 showing that the quasi- optimal sampling methods can deliver accurate approximations with O(1) oversampling.

Compared to the well accepted linear oversampling rate (1.5 ∼ 3), these represented no- table improvements. Also the quasi-optimal set notably outperforms than other standard design of experiment choices.

5 1.2 Few data and Advanced sparse approximation

In high dimensional approximations, we often face the number of samples is (severely) less than the cardinality of the linear space from which the approximant is sought. Consequently, this leads to an underdetermined system that admits an infinite number of solutions. The common approach, following the remarkable success of compressive sensing (CS), is to seek a sparse approximation via `1 norm minimization. In this dissertation, we introduce an alternative approach to construct sparse approxi- mations for underdetermined systems. Specifically, we employ `1−`2 minimization method which corresponds to (1.2) with F(c) = kck1 − kck2. The method was originally proposed

and studied in [59, 105]. Here we introduce a further study [103] of `1−`2 minimization.

The motivation for using `1−`2 is that since the contours of the `1−`2 are closer to those of the `0 norm, compared to those of the `1 norm, the use of `1−`2 minimization should be more sparsity promoting. Numerical examples in [59, 105] suggest that this is indeed the case. The theoretical estimates of the recoverability in [59, 105], however, can not verify the numerical findings, as the estimates for `1−`2 minimization tend to be worse than those for the standard `1 minimization. Nevertheless, the more sparsity promoting property of the `1−`2 minimization seems to be evident in a large number of numerical tests in [59, 105], which also discussed and analyzed in details the optimization based on DCA (difference of convex algorithm [92]) for the `1−`2 minimization problem. It has been recognized in the CS literature that non-convex methods can be sparsity promoting (cf.

[27]), e.g., reweighted `1 minimization ([25]), `p minimization with p < 1 ([28, 57]). From this point of view, it is not surprising that the `1−`2 minimization, as a non-convex method, appears to be more sparsity promoting. Comparisons between the `1−`2 minimization and other non-convex methods, such as `p minimization with p < 1, can be found in [59]. This dissertation contains the following results. First, a set of recoverability estimates for exact recovery of sparse signals is presented. The new estimates improve those in [59, 105], although they are still unable to verify the improvement of the `1−`2 minimization over the `1 minimization. New theoretical estimates of the `1−`2 recoverability for general non-

6 sparse signals are presented, which are not available before. Again, these estimates are not better than those for the standard `1 minimization.

We then extend the `1−`2 minimization technique to orthogonal polynomial approxima- tion. In particular, we focus on the use of Legendre polynomials and the case of multiple dimensions. This is similar to the study of the `1 minimization for Legendre polynomials in [36, 61, 79, 102]. Both the straightforward `1−`2 minimization and Chebyshev weighted

`1−`2 minimization are discussed together with the recoverability results for both sparse polynomials and general non-sparse polynomials. The estimates suggest that the Cheby- shev weighted `1−`2 minimization is more efficient in low dimensions. In high dimensions, the standard non-weighted `1−`2 minimization with uniform random samples should be preferred. We remark that the similar result holds for the `1 minimization of Legendre polynomials [102].

Based on various numerical results presented in Chapter 7.2, it is observed that the `1−`2

minimization performs consistently better than the standard `1 minimization. The available theoretical estimates, in [59, 105] and the improved ones in here ([103]), do not validate this

finding. Nevertheless, our study suggests that the `1−`2 minimization can be a viable option for sparse orthogonal polynomial approximations for underdetermined systems. This is especially useful in stochastic collocation, as the samples are often extraordinarily expensive to acquire.

1.3 Big data and Sequential approximation

In extremely high dimensions, any classical approximation technique becomes intractable due to its gigantic complexity of approximation, p  1. We thus introduce a sequen- tial function approximation method proposed in [86, 96, 82]. The resulting algorithm is particularly suitable for high dimensional problems when data collection is cheap.

This method is motivated by the randomized Kaczmarz algorithm developed in [88].

Kaczmarz method was first proposed in [56] as an iterative algorithm for solving a consistent overdetermined system of equations Ax = b. In its original form, the Kaczmarz algorithm

7 updates the k-th solution to

(k) (k+1) (k) bi − hx ,Aii x = x + 2 Ai, kAik2 where Ai is i-th row of the matrix A and is chosen in a cyclic manner to sweep through the rows of A. Due to its simplicity, the Kaczmarz algorithm has found use in applications such as tomography and image processing, cf., [48, 54]. In the randomized Kaczmarz

2 method, the i-th row is randomly drawn with probability proportional to kAik2. It was shown that the randomized Kaczmarz (RK) method converges exponentially in expectation

([88]). Many works followed, analyzing the properties and improving the performance of the RK method, cf. [70, 106, 29, 38, 58, 15, 94, 71], all of which focus on linear system of equations.

In this dissertation, we introduce a sequential function approximation method [86] which extends the idea of RK method beyond linear systems of equations and apply it to function approximation in multiple dimensions. We remark that function approximation by the RK method was demonstrated in [88] in one of the examples. It was for a bandlimited function in Fourier basis. This corresponds to the case that the target function resides in the linear subspace of the approximation and also with a finite number of samples. Here, however, we consider a more general setting, where the target function does not need to reside in the linear subspace of the approximation.

The sequential function approximation method [86] has a distinct feature — the model matrix A of size m × p is never formed. This is particularly useful for high dimensional problems with d  1. In this case, the dimensionality of the linear subspace of approximants is also large, i.e., p  1. When data collection is cheap and the amount of data is large, known as the so-called big data problem, we have m  p  1. Even the declaration of the model matrix A could easily exceed the system memory of any computer. The sequential approximation method provides a viable option for this, as it iteratively seeks the approximation one data point at a time. This implies that every iteration of the method requires only operations on a vector of size 1 × p, thus inducing O(p) operations. The implementation of the method becomes irrespective of the size of the data, allowing one to

8 handle extremely large number of data and/or continuously arriving data.

A general convergence analysis is presented. The analysis reveals that by drawing the sample data randomly from a probability measure, the method converges under mild con- ditions. Both the upper bound and lower bound for the convergence rate are established, which translate into the bounds for the approximation error. The convergence and error are measured against the best L2 approximation of the target function, which is the or- thogonal projection onto the linear subspace of approximants. The established convergence rate and error bounds are in expectation with respect to different sampling sequences. We also present an almost surely convergence result. Under a more restrictive condition, the bounds for the convergence rate and the approximation error are valid almost surely. Fur- thermore, we introduce an optimal sampling probability which results in the optimal rate of convergence, 1 − 1/p. With the optimal sampling, the error analysis is expressed as an , rather than inequalities which are common in standard error estimates. When orthogonal polynomials are used as the basis functions, this optimal sampling measure has a close connection with the equilibrium measure of the pluripotential theory ([9]). This connection has been discussed in the context of polynomial least squares ([52, 66]).

Upon the introduction of this sequential approximation method, we apply this method on the full tensor quadrature girds [96]. As mentioned in the early introduction, due to the underlying tensor structure, the number of samples grow exponentially. This limits its applicability to very high dimensions. However, the sequential approximation method allows us to take advantage of the desirable mathematical properties of the full tensor quadrature points in an efficient manner. Both the theoretical analysis and the numerical examples indicate that highly accurate approximation results can be reached at K ∼ γp

iterations, where p is the dimensionality of the polynomial space and γ = 5 ∼ 10. Let the

total number of the full tensor points be md. This implies that only a very small porition of the full tensor Gauss points are used, as γp  md when d  1. The operation count of the method is O(Kp), which becomes O(p2) and can be notably smaller than the operation count of the standard least squares method O(p3). Also, since the sequential method always operates on row vectors of length p, it requires only O(p) storage of real numbers and avoids

9 matrix operations which would require O(p2) storage. This feature makes the method highly efficient in very high dimensions d  1.

1.4 Corrupted data and Least absolute deviation

We now turn our attention to the quality of data. So far, we have assumed that the collected

data is exact (or maybe upto some white noises). In practice, however, this is not always

the case. Let us consider the case when the data vector is corrupted by unexpected external

errors. That is, the data vector is

b = f + es, (1.3)

m where es ∈ R is the corruption error vector with sparsity s. Let the model matrix A be overdetermined, m > p. And let sparsity s be reasonably small, i.e., all but a relatively

small number s entries of es are non-zero. However, the locations of the s non-zero entries are arbitrary and their magnitude (in absolute value) can be large. This can be a likely

situation in practice when the machines collecting the samples, via either experimentation

or computation, become faulty with a small probability and produce samples with erratic

errors. (Sometimes these are referred to as “soft errors”, as opposed to the standard noisy

errors which are modeled as i.i.d. random variables.) Note that the underlying mechanism

causing these corruption errors could be completely deterministic, e.g., via faulty machinery,

however, the location and amplitude of the errors are unknown.

In this dissertation, we introduce results by [83] which show that these corruption

errors can be effectively eliminated (with overwhelming probability) by conducting `1- minimization (LAD method), i.e.,

min kb − Ack1. (1.4) c

By “effectively eliminated”, we mean the approximation fe(x) constructed using the solution of (1.4) has an error that can be bounded by the approximation errors of the `1- and `2 models using the uncorrupted data. We remark that it has long been recognized that the

LAD method is more robust (than the LSQ method) with respect to outliers. In a way,

10 one may regard the corruption errors discussed here as one kind of outliers. From this perspective, the work of [83] provided a first systematic and quantitative analysis of the

“robustness” of LAD towards the outliers. However the corruption is a much more general concept than outliers as it can occur much more frequently (e.g., 30 - 40%).

The analysis presented here [83] is motivated by the work of [23, 20], where it was shown that the `1-minimization for overdetermined systems can remove sparse errors. The major, and important, difference between the work of [83] and that of [23, 20] is that [23, 20] deals with the linear system of equations f = Ac, and more importantly, assumes the system is consistent. That is, the system f = Ac can be solved exactly in the overdetermined case m > p. Here we concern with linear regression and function approximation. Consequently, the system f = Ac is never consistent and has an error e = f − Ac 6= 0. The error e is

directly related to the approximation error of fe(x)−f(x), which is usually not sparse. With the additional sparse corruption errors es added to f, the overall error e + es is not sparse. Therefore, the theory of [23, 20] does not readily apply.

The theoretical analysis for (1.4) was developed in [83] and it was shown there that

the approximation fe(x) constructed by the solution of (1.4) has an error bounded by the errors of the LAD and LSQ approximations obtained by using the uncorrupted data. In

this dissertation, we aim to give a comprehensive introduction to these results. Extensive

numerical examples are provided in Chapter 7.4.

1.5 Application: Uncertainty Quantification

The high dimensional approximation theory is deeply related to uncertainty quantification

(UQ). And two of approximation methods presented in this dissertation can directly apply

to UQ problems. Here we briefly introduce what UQ is and how UQ is related to the

approximation theory.

Wikipedia says that uncertainty quantification is the science of quantitative character-

ization and reduction of uncertainties in both computational and real world applications.

Based on the U.S. Department of Energy, 2009, p. 135, uncertainty quantification studies

11 all sources of error and uncertainty. More precisely, UQ is the end-to-end study of the reliability of scientific inferences. A comprehensive introduction to UQ can be found in [90].

In recent years, stochastic computation has drawn enormous attention because of the needs for UQ in complex physical systems. Many approaches have focused on solving large and complex systems under uncertainty. One of the most popular approaches is to employ orthogonal polynomials, following the work of generalized polynomial chaos (gPC),[46, 101].

In the most standard gPC expansion, the basis functions are orthogonal polynomials satis- fying Z E [φiφj] = φi(x)φj(x)dµ(x) = δij, (1.5) where E stands for expectation operator and δij is the Kronecker delta function. The type of orthogonal polynomials is then determined by the via the or- thogonality condition. For example, Gaussian distribution leads to Hermite polynomials, uniform distribution leads to Legendre polynomials, etc. Many variations of the gPC ex- pansion exist since its introduction in [101]. For detailed reviews, see [99]. A brief review can be found in Chapter 2.7.

The basic idea of the gPC approach is to construct orthogonal polynomial approxima- tion of the solution of the stochastic system. To obtain the gPC expansion numerically,

Galerkin and collocation methods are widely used. Chapter 2.7 compactly reviews the gen- eral frameworks. Galerkin approaches require one to modify existing deterministic codes.

On the other hand, collocation approaches allow one to reuse existing deterministic codes.

This makes the collocation approach highly popular in practical implementation.

In the stochastic collocation, we first choose a set of samples of the input random vari- ables. Then, given these samples, the problem becomes deterministic, which can be solved by the existing code repetitively. Finally, a polynomial approximation of the solution can be constructed via the solution samples. This is where high-dimensional multivariate polyno- mial approximation becomes important. The popular approaches for stochastic collocation include sparse grids ([100, 39, 4, 41, 67]), discrete expansion ([80, 97]), sparse

`1-minimization approximation ([35, 102, 104]), etc. The least squares has also been ex-

12 plored ([30, 93, 2, 55]).

1.6 Objective and Outline

The objective of this dissertation is to give a comprehensive introduction to four different high-dimensional approximation methods, developed by the author in [84, 85, 103, 86, 96,

82, 83]. We remark that each method is designed for a specific data related scenario.

The dissertation is organized as follow. In Chapter 2, we review preliminary mathe- matics which will frequently be used throughout this dissertation. This includes some of concepts or results from approximation theory, compressive sensing, pluripotential theory, orthogonal polynomials, tensor-quadrature, and uncertainty quantification. Chapter 3 is about an optimal sampling strategy for the least squares method. Assuming collecting data is expensive, we discuss the optimal locations to collect data, prior to the actual data col- lection procedure, in order to achieve the best least square performance. This chapter is based on [84, 85]. In Chapter 4, assuming only few number of data is available, we discuss an advanced sparse approximation using `1−`2 minimization. This is based on the work of [103]. In Chapter 5, a sequential function approximation method is introduced. This method is designed for high dimensional problems with d  1 and it is especially suitable

when the data is sequentially or continuously collected. It is an iterative method which

requires only vector operations (no matrices) and it never forms a matrix. The contents of

this chapter is based on [86, 96, 82]. Chapter 6 discusses a way to eliminate the effect of

the corruption errors in data. In this case, we assume the data is corrupted by unexpected

external errors. This chapter is aimed to present a mathematical analysis which shows that

the least absolute deviation method can effectively eliminate the corruption errors. It is

based on the work of [83]. At last, extensive numerical examples for all four methods are

provided in Chapter 7.

13 CHAPTER 2

REVIEW: APPROXIMATION, REGRESSION AND ORTHOGONAL POLYNOMIALS

In this chapter, we review preliminary mathematics that will be frequently used through out this dissertation.

2.1 Function approximation

d 2 Let D ⊂ R , d ≥ 1 be a domain, equipped with a (probability) measure ω. Let Lω(D) be the standard Hilbert space with the inner product

Z 2 hg, hiω = g(x)h(x)dω, ∀g, h ∈ Lω(D), (2.1) D

and the corresponding induced norm k · kω. We assume the measure ω is absolutely contin- uous and often explicitly write

dω = w(x)dx. (2.2)

2 Let f(x) be an unknown function in Lω(D). Let φi(x), i = 1, . . . , p, be a set of basis in D and Π be the linear space of those basis, i.e.,

Π = span{φj(x) | 1 ≤ j ≤ p}. (2.3)

Then we seek an approximation in Π which can be written as

p X T fe(x; c) = ciφi(x), c = [c1, ··· , cp] . (2.4) i=1

2 For given a measure ω (2.2) and a linear space Π (2.3), one can always find the best Lω

14 approximation of f in Π. It is defined as

PΠf = argmin kf − gkω. g∈Π

Equivalently, this can be written as a problem of finding expansion coefficients which min-

2 imize the error with respect to Lω, i.e.,

p X T cˆ = argmin kf − ciφikω, where c = [c1, ··· , cp] . (2.5) p c∈R i=1

2 Then the best Lω approximation of f can be written as

p X T PΠf = cˆiφi, where cˆ = [ˆc1, ··· , cˆp] . i=1 ˆ Let V be a matrix of size p × p whose (i, j)-entry is hφi, φjiω and fφ be a vector of size p × 1 whose i-th entry is hf, φiiω. By differentiating (2.5), it readily follows that the best

2 Lω projection coefficients cˆ is −1ˆ cˆ = V fφ. (2.6)

ˆ We remark that if the basis {φj(x), j = 1, . . . , p} is orthonormal, cˆ is simply fφ. If the basis

φj is not orthonormal, there exists a linear transformation T (matrix) which transforms the non-orthonormal basis φj it into an orthonormal basis, for example, Gram Schmidt process. That is,

−1 T T T [φ1(x), φ2(x), . . . , φp(x)] = [ψ1(x), ψ2(x), . . . , ψp(x)] , (2.7) where

hψi, ψjiω = δij, 1 ≤ i, j ≤ p. (2.8)

2 In terms of the orthonormal basis ψj, the best Lω approximation is written as

p X ˆ ˆ PΠf = fjψj(x), fj = hf, ψj.iω (2.9) j=1

ˆ ˆ ˆ T −1ˆ ˆ Let fψ = [f1, ··· , fp] . Using the transformation (2.7), we have T fφ = fψ. Also, cˆ = −T ˆ T fψ. It then follows from (2.6) that

−1ˆ −T ˆ −T −1ˆ T −1ˆ V fφ = cˆ = T fψ = T T fφ = (TT ) fφ

15 which implies V = TTT .

If we adopt matrix-vector notation to simplify the exposition,

T T Φ(x) = [φ1(x), . . . , φp(x)] , Ψ(x) = [ψ1(x), . . . , ψp(x)] (2.10)

as the general basis and the orthonormal basis, respectively, the orthogonal projection can

be written as

ˆ −T ˆ T −1ˆ PΠf = hΦ(x), cˆi = hΨ(x), fψi, cˆ = T fψ = (TT ) fφ, (2.11)

where h·, ·i denotes vector inner product.

2 ˆ ˆ Remark 2.1.1. The best Lω projection coefficients cˆ is available only if either fψ or fφ is ˆ ˆ available. And fψ or fφ is usually unavailable as it requires knowledge of the unknown target

2 function f(x). Hence, the best Lω projection coefficients cˆ are not available in practice.

We thus are interested in approximating f(x) using its sample values f(xi), i = 1, . . . , m. Let

A = (aij), aij = φj(xi), 1 ≤ i ≤ m, 1 ≤ j ≤ p, (2.12)

be the model matrix, and

T m y = [f(x1), . . . , f(xm)] ∈ R

be the samples of f, and

b = y + e,

be the data vector, where either e = 0 (exact data) or e may be the noise or the external

corruption error. For an easy exposition, let us assume (for now) e = 0, i.e, b = y.

Remark 2.1.2. The data y consists of two kinds of information. One is the location where

the sample is collected, i.e., Θ = {xi}1≤i≤m. The other is the function value on each

location. Altogether, we have {(xi, f(xi)}1≤i≤m.

Remark 2.1.3. For the rest of this dissertation, the model matrix A is assumed to be full rank.

16 2.2 Overdetermined linear system

In overdetermined linear system, the number of data m is greater than the number of basis p, i.e., m > p. The standard approach seeks to find an approximation fe whose coefficients minimize the error, i.e.,

c∗ = argmin ky − Ack (2.13) c∈Rp Different norms measuring the error result in different approximation methods.

Least squares method

When the Euclidean norm (2-norm) is used, the minimization (2.13) results in the well- known ordinary least squares (OLS) method. That is,

m 2 ∗ 2 X   c = argmin ky − Ack2 = argmin f(xi) − fe(xi; c) (2.14) p p c∈R c∈R i=1 where fe(x; c) is defined in (2.4). One can generalize the ordinary least squares to the weighted least squares (WLS) method. Let W = diag(w1, . . . , wm) be a diagonal matrix with positive entries. Then the weighted least squares minimizes the weighted error, i.e.,

√ √ ∗ c = argmin k W y − WAck2. (2.15) c∈Rp

Its solution is well known and has an analytical closed form

c∗ = (AT WA)−1AT W y. (2.16)

Note that when W is an identity matrix, the WLS (2.15) becomes the OLS (2.14).

We remark that the least squares can be considered as a discrete projection. Let consider

m a discrete measure σ on the data locations Θ = {xi}i=1 such that

m X σ(x) = wiδ(x − xi). i=1 With this discrete measure σ, one can define the discrete indefinite inner product

m X hg, hiσ = wig(xi)h(xi) (2.17) i=1 17 2 which does not satisfy non-degenerate condition on Lω(D). And let denote its corresponding semi-norm by k · kσ. Then (2.15) can be viewed as minimizing the error with respect to σ, i.e.,

√ √ Z p 2 2 X 2 k W y − W Ack2 = kf(x) − fe(x; c)kσ = (f(x) − ciφi(x)) dσ(x). (2.18) Θ i=1 If the discrete indefinite inner product (2.17) satisfies non-degenerate condition on a

2 linear subspace Π of Lω, this discrete measure σ indeed induces an inner product and its corresponding norm on Π. Then the least squares is a discrete projection of f onto Π. Such

Π exists, however, since this is not directly related to our topic, we do not discuss it further here. The interesting readers can consult [63].

Least absolute deviation

When 1-norm is used in (2.13), this results in the least absolute deviation (LAD) method.

That is, m ∗ X c = argmin ky − Ack1 = argmin f(xi) − fe(xi; c) . (2.19) p p c∈R c∈R i=1 Though the idea of LAD is just as straightforward as that of LSQ, the LAD is not as simple to compute. Unlike least squares, least absolute deviations do not have an analytical solution. Also its solution is not necessarily unique. However, since the LAD can be recasted as a linear program, its solution can be found by the following linear program  m  X Ac − u − y ≤ 0 min ui subject to (2.20) c,y  i=1 Ac + u − y ≥ 0.

At the optimum point, u is equal to the absolute errors of the LAD solution. By using any linear programming solver, the LAD problem (2.19) can be solved. For example, `1-magic [19].

The overdetermined `1 minimization problem of (2.19) is equivalent to a constrained

`1-norm minimization problem (2.22). This equivalence relation is a central mechanism for the correcting corruption errors of Chapter 6.

18 (m−p)×p Let F ∈ R be the kernel of the model matrix A. One way to construct it is via QR factorization. Let

A = QR = QˆRˆ be the full QR factorization and the skinny QR factorization of A, respectively, where, by following the traditional notation, we write

m×m m×p Q = [Q,ˆ Qe] ∈ R ,R = [Rˆ; 0] ∈ R .

We can then set F = QeT and immediately obtain

FA = QeT QˆRˆ = 0. (2.21)

It is then easy to verify that the following problem (2.22) is equivalent to (2.19).

min kgk1 subject to F g = F y. (2.22)

Proof of equivalence between (2.19) and (2.22). Note that any g satisfying F g = F y has a

form of g = y − Ac for some c. Suppose c∗ is a solution to (2.19) and set g∗ = y − Ac∗.

Then g∗ satisfies F g∗ = F (y − Ac∗) = F y. Since c∗ is the minimizer of (2.19), we have

∗ ∗ kg k1 ≤ kgk1 for any g satisfying F g = F y, which implies g is a minimizer of (2.22). Conversely, let g∗ be a minimizer of (2.22). Since F g∗ = F y, there exists c∗ satisfying

g∗ = y − Ac∗. For any c, let g = y − Ac. Then it satisfies F g = F y. Since g∗ is a minimizer,

∗ ∗ ∗ we have ky − Ac k1 = kg k1 ≤ kgk1 = ky − Ack1 which implies c is a minimizer of (2.19). Hence, the equivalence is shown.

For further discussion on LAD, the interesting readers can consult the early work of

[1, 6, 73, 87, 11, 68] and [7, 76, 74] for the statistical properties. Also, see [12, 75] and the references therein.

2.3 Underdetermined linear system

When the number of data is smaller than the number of basis, that is, m < p, the minimiza- tion problem (2.13) allows infinitely many solutions, regardless of the choice of the norm,

19 as all solutions satisfy Ac = y.

Since A is full rank of m, a set of all solutions is

p p {c ∈ R | y = Ac} = {cp + z ∈ R | z ∈ N (A)}

where cp is any particular solution satisfying Acp = y and N (A) is the null space of A. Therefore one can choose a specific solution which has a feature of interest among the

solution set. Typically such specific choice can be made by an optimization problem

min F(c), subject to Ac = y (2.23) c

where F is an objective function which measures the feature of interest one wants to enforce

on the solution.

2.3.1 Sparsity

Sparsity plays an important role in many fields of science. For example, sparsity leads to

dimensionality reduction and efficient modeling. Thus among many other features, sparsity

is one of the most common and popular choice. Mathematically, sparsity is measured by

p `0-norm. For v ∈ R ,

kvk0 = |{i | vi 6= 0, for 1 ≤ i ≤ p}|.

That is, kvk0 is the number of non-zero entries of v. Ideally, one wants to solve (2.23) with

F(c) = kck0. Solving this combinatorial optimization problem directly is infeasible even for modest size problems.

Following the remarkable success of compressive sensing (CS), the use of `1-norm (F(c) =

kck1) becomes so popular. A large amount of results have been obtained for CS. We review some of definitions and results relevant to our topic. By following the standard notations,

we denote the sparsity by s.

Definition 2.3.1. For T ⊆ {1, . . . , p} and each number s, s-restricted isometry constant of

A is the smallest δs ∈ (0, 1) such that

2 2 2 (1 − δs)kxk2 ≤ kAT xk2 ≤ (1 + δs)kxk2 (2.24) 20 |T | for all subsets T with |T | ≤ s and all x ∈ R . The matrix A is said to satisfy the s-RIP

(restricted isometry property) with δs.

Here AT stands for the submatrix of A formed by the columns chosen from the index set T .

0 0 Definition 2.3.2. If s + s ≤ n, the s, s -restricted orthogonality constant θs,s0 of A is the smallest number such that

0 0 |hAx, Ax i| ≤ θs,s0 kxk2kx k2, (2.25)

where x and x0 are s-sparse and s0-sparse, respectively.

The following two properties shall be used in the later chapter. The first one relates the

constants δs and θs,s0 ([23]),

θs,s0 ≤ δs+s0 ≤ θs,s0 + max(δs, δs0 ). (2.26)

The second one ([18]) states, for any a ≥ 1 and positive integers s, s0 such that as0 is an

integer, √ θs,as0 ≤ aθs,s0 . (2.27)

Lemma 2.3.3 (Lemma 2.1, [21]). We have

0 0 |hAx, Ax i| ≤ δs+s0 kxk2kx k2 (2.28)

for all x, x0 supported on disjoint subsets T,T 0 ⊂ {1, . . . , m} with |T | ≤ s, |T 0| ≤ s0.

Proof of Lemma 2.3.3. The proof is readily followed by combining Definition 2.3.2 and

(2.26).

2.4 Orthogonal Polynomials

2 We seek to find an approximation fe ≈ f in a finite dimensional linear space. Let Π ⊂ Lω(D)

be the linear subspace with dim Π = p, and {φj(x), j = 1, . . . , p} be a basis. Note that

21 a set of basis determines only a single linear subspace. However a linear subspace can be spanned by various choices of basis, e.g., rotations.

One of the most widely used linear subspaces is polynomial space. Let Λ be an index

d set for multi-index α = (α1, . . . , αd) ∈ N0 with |α|1 = α1 + ··· + αd and |α|∞ = maxi |αi|. For each index set Λ, a polynomial space is defined

α α1 αd PΛ = {x = x1 ··· xd , α ∈ Λ}. (2.29)

There are two most popular polynomial spaces: the total degree polynomial space, and

d tensor product space. The total degree space Pn is the linear space of d-variate polynomials of degree up to n ≥ 1, defined as

d α Pn = span{x , |α|1 ≤ n}, (2.30) whose dimension is n + d (n + d)! dim d = = . (2.31) Pn d n!d! The tensor degree space of degree n is

TP α Pd,n = span{x , |α|∞ ≤ n}, (2.32)

whose dimension is

TP d dim Pd,n = (n + 1) .

Classical orthogonal polynomials, such as those of Legendre, Laguerre and Hermite, but

also discrete ones, due to Chebyshev, Krawtchouk and others, have found widespread use in

all of mathematics, science, engineering and computations. A comprehensive review

can be found in [91, 99]. In this dissertation, orthogonal polynomials will be used as our

primary basis functions in approximating f.

In principle, one can always orthogonalize any polynomial basis via certain orthogonal-

ization procedure, e.g., Gram-Schmidt. In order to define what we mean by orthogonality,

an inner product has to be introduced. As it can be seen in (2.1), for different choice of

(probability) measures ω (2.2), different inner products are defined. In univariate case, the

22 explicit form of the orthogonal polynomials is known for many measures ω. For exam- ple, uniform measure corresponds to Legendre polynomials, Gaussian measure to Hermite polynomials, etc. In multivariate case, one straightforward way to construct orthogonal polynomials is to tensorize univariate orthogonal polynomials.

d Let x = (x1, ··· , xd) ∈ D ⊂ R for d > 1. Let x ∈ R and {ψj(x)}j≥1 be univariate orthogonal polynomials with respect to one dimensional measure $(x)dx. That is,

Z ψi(x)ψj(x)$(x)dx = δij. R Remark 2.4.1. In principle, one can employ different univariate orthogonal polynomials

in different coordinate. However, for simplicity, we employ the same univariate orthogonal

polynomials in all coordinate.

Let consider a d-variate measure ω which is a product of univariate measure, i.e.,

dω = $(x)dx = $(x1) ··· $(xd)dx1 ··· dxd. (2.33)

Let Λ be a multi-index set and PΛ be its corresponding polynomial space, defined (2.29).

For each i = (i1, ··· , id) ∈ Λ, the multivariate orthogonal polynomials ψi(x) are constructed

by tensor products of their corresponding one dimensional polynomials ψii (xi), i.e.,

d Y ψi(x) = ψij (xj), i ∈ Λ. (2.34) j=1

For notational convenience, we will employ a linear ordering between the multi-index i and

the single index i. That means, for any 1 ≤ i ≤ |Λ|, there exists a unique multi-index i such

that

i ←→ i = (i1, . . . , id). (2.35)

Throughout this dissertation, we shall frequently interchange the use of the single index i

and the multi-index i, with the one-to-one correspondence (2.35) assumed. By the tensor

structure and univariate orthogonality, it can be readily checked that

Z ψi(x)ψj(x)dω = δij. D

23 For reference, an overview of the basic results on orthogonal polynomial of several variables can be found in [37].

2.5 Tensor Quadrature

The fundamental problem in numerical integration is to compute an approximate solution to a high-dimensional integral Z f(x)dx

2 to a given degree of accuracy. This is directly related to compute the best Lω projection coefficients cˆ (2.5).

First, we briefly review the well-known Gaussian quadrature in 1D as it is the most popular choice. Generally a quadrature rule is a set of pairs of points and its weights

{(xi, wi)}1≤i≤m such that

m X Z b Q(f) = wif(xi) ≈ f(x)dx. i=1 a Its accuracy is often measured by the exactness of polynomial degree. A quadrature rule has the exactness of degree upto n if

Z b f(x)dx = Q(f), a whenever f(x) is a polynomial of degree at most n, but

Z b f(x)dx 6= Q(f), a for some polynomial f(x) of degree n + 1.

For any distinct points {xj}1≤j≤m, one can always find its corresponding weights {wj}1≤j≤m such that the quadrature rule {(xj, wj)}1≤j≤m has the exactness of degree m−1. The weight can be found by solving a linear system of equations

T Aw = b, w = [w1, ··· , wm] ,

R i−1 i−1 where the i-th entry of b is bi = x dx and the (i, j)-th entry of A is aij = (xj) .

24 As points are distinct, A is invertible. Thus w = A−1b is the quadrature weights whose

exactness is m − 1.

In the Gaussian quadrature rule, the points are designed to maximize the exactness of

polynomial degree. Given m points, the maximum exactness is 2m − 1 and the Gaussian

quadrature achieves such maximum exactness. The Gaussian quadrature incorporates with

a family of orthogonal polynomials. Let ψj(x) be orthogonal polynomials of degree j and the orthogonality holds with respect to $(x)dx, i.e.,

Z ψi(x)ψj(x)$(x)dx = δij. (2.36)

Let {zj}1≤j≤m be zeros of ψm, i.e, ψm(zj) = 0 for all j, and let

m !−1 X 1 2 2 wj = ψk(zj) , γk = kψkkω. (2.37) γk k=1

Then {(zj, wj)}1≤j≤m is a Gaussian quadrature rule. Depending on a choice of orthogo- nal polynomial, a different Gaussian quadrature is obtained. For example, when Legendre

polynomials are used, this results in Gauss-Legendre quadrature. Similarly when Hermite

polynomials are used, this results in Gauss-Hermite quadrature. Details of Gauss quadra-

tures are well documented in the literature, see, for example, [91].

Based on 1D quadrature, we now construct the tensor quadrature. For each dimension (i) (i) i, i = 1, . . . , d, let (z` , w` ), ` = 1, . . . , m, be a set of one-dimensional quadrature points and weights such that m Z X (i) (i) w` f(z` ) ≈ f(xi)$i(xi)dxi `=1 for any integrable function f. We then proceed to tensorize the univariate quadrature rules.

Let

n (i) (i)o Θi,m = z1 , . . . , zm , i = 1, . . . , d, (2.38) be the one dimensional quadrature point set. The tensor product is taken to construct a d-dimensional point set

ΘM = Θ1,m ⊗ · · · ⊗ Θd,m. (2.39)

25 d Obviously, M = #ΘM = m . As before, an ordering scheme can be employed to order the points via a single index, i.e., for each j = 1,...,M,

  z = z(1), . . . , z(d) , j ←→ j ∈ {1, . . . , m}d. (2.40) j j1 jd

That is, each single index j corresponds to a unique combination of j := (j1, . . . , jd) with

|j|∞ ≤ m. Each point has the scalar weight

w = w(1) × · · · × w(d), j = 1,...,M, j j1 jd

with the same j ↔ j correspondence. Then {(zj, wj)}1≤j≤M is a tensor quadrature. Note that when Gaussian quadrature is used for one-dimensional rule, the resulting tensor quadra-

ture is called the tensor Gauss quadrature.

2.6 Christoffel function and the pluripotential equilibrium

d 2 Here we briefly introduce some results from the pluripotential theory. Let Pn ⊂ Lω(D) d d+n be the total degree polynomial space of degree upto n. So that p = dim Pn = d . Let d {φj(x), j = 1, . . . , p} be a basis of Pn. The “diagnoal” of the reproducing kernel of Pn is defined by p X 2 Kn(x) := φi (x). (2.41) i=1

Note that Kn does not depend on the choice of the orthonormal basis. The (normalized) Christoffel function from the theory of orthogonal polynomials ([72]) is

p λn(x) := . (2.42) Kn(x)

Let define a sampling measure µD such that

dµD = v(z)dz (2.43) where v(z) is a sampling weight function. In bounded domains, this sampling measure is the equilibrium measure of the domain. In unbounded domains, this is a measure whose √ weight function is a scaled version of the w-weighted equilibrium measure. Please see

26 below for the details.

2.6.1 Pluripotential equilibrium measure on Bounded domains

Let D is a compact set of nonzero d-dimensional Lebesgue measure and nonempty interior,

R 2 and w is a continuous function on the interior of D such that 0 < D p (z)w(z)dz < ∞ for any nontrivial algebraic polynomial p(z). Then the pair (D, w) is called “admissible”.

Suppose w(z) is an admissible weight function on a compact domain D. It is known that ω(x) lim λn(x) = , a.e. x ∈ D, (2.44) n→∞ v(x) where v(x) is the Lebesgue weight function (a probability density) of the pluripotential

equilibrium measure of D ([9])

dµD(x) = v(x)dx.

In some specific domain D, the equilibrium measure µD is well knwon. For example, the equilibrium measure for D = [−1, 1] is the arcsine measure with “Chebyshev” density and

its explicit formula is 1 vD(z) = √ , z ∈ [−1, 1]. π 1 − z2 For d ≥ 2, the equilibrium measure for D = [−1, 1]d is the product measure of the univariate measure, i.e., dµ 1 v(z) = D = , z ∈ [−1, 1]d. (2.45) dz Qd q 2 j=1 π 1 − zj When D is the unit simplex, the equilibrium density is

d Y −1/2 vD(z) = C[(1 − kzk1) zj] j=1 where C is a normalization constant.

2.6.2 Pluripotential equilibrium measure on Unbounded domains

Suppose D is an origin-centered unbounded conic domain with nonzero d-dimensional

Lebesgue measure, and w = exp(−2Q(z)), with Q satisfying (i) lim|z|→∞ Q(z)/|z| > 0,

27 and (ii) there is a constant t ≥ 1 such that

Q(cz) = ctQ(z), ∀z ∈ D, c > 0. (2.46)

P d As special cases, the one-sided exponential weight w(z) = exp(− j zj) on D = [0, ∞) and P 2 d the Gaussian density function w(z) = exp(− j zj ) on D = R satisfy (2.46) with t = 1 and t = 2, respectively.

Let µD,Q be the weighted pluripotential equilibrium measure and denote the Lebesgue density of the corresponding weighted equilibrium measure as

dµD,Q(z) = vD,Q(z)dz,

(assuming such a density exists).

(n) n For each polynomial degree n ∈ N, let φi (x) be the family orthonormal under w , i.e.,

Z (n) (n) n φi (x)φj (x)w (x)dx = δij. (2.47) D d n √ 2 For the whole space R with w (x) = exp(−k nxk2), the corresponding Hermite poly- (n) d √ d n nomial satisfies hi (x) = n 4 hi( nx). For the tensor half space [0, ∞) with w (x) = (n) d exp(−knxk1), the corresponding Laguerre polynomial satisifies Li (x) = n 2 Li(nx). Thus, (n) d/2t 1/t we can write these as φi (x) = n φi(n x), where the homogeneity exponent t = 2 for the whole space and t = 1 for the tensor half space. Then the reproducing kernel associated with the total degree polynomial space of degree n with wn is defined by

(n) X (n) 2 Kn = (φi (x)) . (2.48) 1≤i≤p

Then the following result is known.

d Theorem 2.6.1. [[9, 10, 8]] Let D ⊂ R be an unbounded convex cone with w. Then

1 n (n) dµD,Q lim w (x)K (x) = vD,Q(x) = . (2.49) n→∞ p n dx

Note that the weighted equilibrium measure µD,Q here has compact support even though √ D is unbounded. Thus the sampling weight function is a scaled version of the w-weighted

28 equilibrium measure

−d/t −1/t −d/t dµD.Q −1/t v(z) = n vD,Q(n z) = n (n z), n = max |α| dz α∈Λ

For convenience, let

dµD = v(z)dz be the sampling measure for unbounded domain. Since the measure µD,Q has compact support, we require scaling factor n1/t to effectively expand the support to contain the unbounded domain.

Explicit formulae for multivariate weighted equilibrium measures µD,Q on real-valued sets D are not yet fully known. For a few special cases, the forms of the measures were conjectured in [66]. More specifically,

d d • Tensor “half space” D = [0, ∞) ⊂ R with

 d  X p ω(z) = exp − zj ,Q(z) = − log w(z). j=1

The equilibrium measure is v u Pd d dµD,Q u(4 − j=1 zj) v (z) = = Ct , (2.50) D,Q dz Qd j=1 zj

where C is a normalization constant. The measure µD is supported only on the set 4T d with

 d  d  d X  T = z ∈ R |zj ≥ 0, kzk1 = |zj| ≤ 1 . (2.51)  j=1 

d • Whole space D = R with

w(z) = exp(−|z|2),Q(z) = − log pw(z).

The equilibrium measure is

dµ d/2 v (z) = D,Q = C 2 − |z|2 , (2.52) D,Q dz

29 where C is a normalization constant. The measure µD is supported only on the set √ 2Bd, where Bd is the d-dimensional unit ball.

2.7 Uncertainty Quantification

Most of the contents in this chapter can be found in [99]. Let consider a system of partial

` differential equations (PDEs) defined in a spatial domain D ⊂ R , ` = 1, 2, 3, and a time domain [0,T ] with T > 0,   ∂  u(x, t, Z) = L(u; Z),D × (0,T ] × Ω,  ∂t  B(u; Z) = 0, ∂D × [0,T ] × Ω, (2.53)    u = u0,D × {t = 0} × Ω,

where L is a (nonlinear) differential operator, B is the boundary condition operator, u0 is

d the initial condition, and Z = (Z1, ··· ,Zd) ∈ Ω ⊂ R , d ≥ 1 is the set of independent random variables. The solution is then a random quantity,

dim u u(x, t, Z): D¯ × [0,T ] × Ω 7→ R , (2.54)

where dim u ≥ 1 is the dimension of u. For ease of presentation, we shall set dim u = 1.

The fundamental assumption here is that (2.53) is a well-posed system almost surely.

That is, for each realization of the random variables Z, the problem of (2.53) is well posed

in the deterministic sense, almost surely.

p Let ω be the probability measure for Z, {φi(Z)}i=1 be a set of basis for the random p variable Z and Π be the linear space of Z spanned by {φi(Z)}i=1. The key idea is to

construct an approximation vp ∈ Π which has the form of a linear combination of the basis functions of Π, i.e., p X vp(x, t, Z) = ci(x, t)φi(Z). (2.55) i=1 2 The best Lω approximation is defined by

2 PΠu(x, t, Z) = argmin E|g(x, t, Z) − u(x, t, Z)| (2.56) g∈Π

30 where the expectation E is taken over Z. Note that the following are equivalent Z 2 2 2 E|g(x, t, Z) − u(x, t, Z)| = kg(x, t, Z) − u(x, t, Z)kω = |g(x, t, z) − u(x, t, z)| dω(z). Ω (2.57)

2 When the basis are orthonormal, the best Lω approximation can be explicitly written as

p X PΠu(x, t, Z) = uˆi(x, t)φi(Z), uˆi(x, t) = E[u(x, t, Z)φi(Z)]. i=1

2 Though this is the optimal (in the Lω sense) approximation in Π, it is not of practical use since the projection requires knowledge of the unknown solution.

2.7.1 Stochastic Galerkin method

The stochastic Galerkin procedure is a straightforward extension of the classical Galerkin approach for deterministic equations. For any given x and t,(x and t are fixed), we seek an approximation vp ∈ Π, p X G vp(x, t, Z) = ci (x, t)φi(Z) i=1 G G G T whose coefficients c (x, t) = [c1 (x, t), ··· , cp (x, t)] satisfy for 1 ≤ i ≤ p,    ∂   E vp(x, t, Z) − L(vp; Z) · φi(Z) = 0,D × (0,T ] × Ω,  ∂t  (2.58) E [B(vp; Z) · φi(Z)] = 0, ∂D × [0,T ] × Ω,    E [(vp(x, 0,Z) − u(x, 0,Z)) · φi(Z)] = 0,D × {t = 0} × Ω,

That is, we seek a solution in Π such that the residue of (2.53) is orthogonal to the space

Π. Upon evaluating the expectations in (2.58), the dependence in Z disappears. The result is a system of (usually coupled) deterministic equations. The size of the system is p, the dimension of the approximation space Π. For a fixed (x, t) = (xfix, tfix), one can obtain a numerical solution vp(xfix, tfix,Z) only at (xfix, tfix) by solving (2.58) via a well-established deterministic code.

When the orthogonal polynomials are used as the basis and the distribution of the

31 Z is the orthogonal probability measure, that is, Z ∼ ω and

Z φi(z)φj(z)dω(z) = δij, the method refers to the generalized polynomial chaos (gPC) Galerkin method. Further- more, if L is a linear differential operator, (2.58) becomes for 1 ≤ i ≤ p,   ∂ G Pp h G i  ci (x, t) − E L(cj (x, t); Z)φj(Z)φi(Z) = 0,D × (0,T ] × Ω,  ∂t j=1  (2.59) E [B(vp; Z) · φi(Z)] = 0, ∂D × [0,T ] × Ω,    G ci (x, 0) − E [u(x, 0,Z)φi(Z)] = 0,D × {t = 0} × Ω.

Typically, this results in a coupled system of equations.

2.7.2 Stochastic collocation

The collocation methods are a popular choice for complex systems where well-established deterministic codes exist. In deterministic , collocation methods are those that require the residue of the governing equations to be zero at discreet nodes in the computational domain. The nodes are called collocation points. The same definition can be extended to stochastic simulations.

Let IZ be the support of Z and consider the same problem of (2.53). For any given x and t, let p X νp(x, t, Z) = ci(x, t)φi(Z) i=1

be a numerical approximation of u. In general, νp(·, ·,Z) ≈ u(·, ·,Z) in a proper sense in

IZ , and the system (2.53) cannot be satisfied for all Z after substituting u for νp.

Let Θ = {z1, ··· , zm} ⊂ IZ be a set of m either prescribed nodes or random realizations of Z in the random space, where m ≥ 1 is the number of nodes. Then in the collocation

32 method, for all j = 1, ··· , m, we enforce (2.60) at the node zj by solving for u(x, t, zj),   ∂  u(x, t, zj) = L(u; zj),D × (0,T ],  ∂t  B(u; zj) = 0, ∂D × [0,T ], (2.60)    u = u0,D × {t = 0}.

Because the value of the random parameter Z is fixed, (2.60) is a deterministic problem.

Therefore, solving the system poses no difficulty provided one has a well-established deter-

(j) ministic algorithm. For fixed (x, t) = (xfix, tfix), let u = u(xfix, tfix, zj), for j = 1, ··· , m, be the solution of (2.60). We remark that the more complex problem of (2.60), the more expensive to obtain a sample solution of it. This results in a deterministic uncoupled system of equations. In other words, we have a problem of constructing an accurate approximation

(j) to u(xfix, tfix,Z) using its samples {u = u(xfix, tfix, zj)}1≤j≤m by determining the expansion

T coefficients c = [c1(xfix, tfix), ··· , cp(xfix, tfix)] . Let

A = (aij), aij = φj(zi), be the model matrix and

u = [u(1), ··· , u(m)]T ,

be the vector of solutions at (xfix, tfix, zi), j = 1, ··· , m. Then this can be compactly written as

Ac ≈ u. (2.61)

This is typically achieved by utilizing the multivariate approximation theory. The popular

approaches for stochastic collocation include sparse grids interpolation ([100, 39, 4, 41, 67]),

discrete expansion ([80, 97]), sparse `1-minimization approximation ([35, 102, 104]), etc. The least squares has also been explored ([30, 93, 2, 55]).

33 CHAPTER 3

OPTIMAL SAMPLING

In this chapter, a sampling methodology for least squares (LSQ) regression is introduced.

The strategy designs a quasi-optimal sample set in such a way that, for a given number of samples, it can deliver the regression result as close as possible to the result obtained by a

(much) larger set of candidate samples. The quasi-optimal set is determined by maximizing a quantity measuring the mutual column orthogonality and the determinant of the model matrix. This procedure is non-adaptive, in the sense that it does not depend on the sample data. This is useful in practice, as it allows one to determine, prior to the potentially expensive data collection procedure, where to sample the underlying system. We also introduce its efficient implementation via a greedy algorithm

We then apply the quasi-optimal sampling strategy on polynomial least squares. Two strategies are presented. One is a direct application for ordinary least squares. The other is a combination of quasi-optimal sampling and a method, termed Christoffel least squares

(CLS) [66]. Since the quasi-optimal set allows one to obtain almost optimal regression results under any affordable number of samples, the new strategies result in polynomial least squares methods with high accuracy and robust stability at almost minimal number of samples. Extensive numerical examples in Chapter 7.1 demonstrate that it notably outperforms than other standard choices of sample sets.

An immediate application of the quasi-optimal set is stochastic collocation for UQ, where data collection usually requires expensive PDE solutions. It is demonstrated that the quasi- optimal LSQ can deliver very competitive results, compared to gPC Galerkin method, and yet it remains fully non-intrusive and is much more flexible for practical computations.

34 3.1 Quasi-optimal subset selection

A general linear regression model can be written as

y = Aβ + ,

m m m×p where y ∈ R is the data vector,  ∈ R is the error vector, A ∈ R is the model p matrix, and β ∈ R are the regression parameters. Here m stands for the number of data/observations/samples, p stands for the number of parameters or factors in the regres- sion model. In this chapter, we assume m ≥ p.

For a diagonal matrix W whose entries are positive, the well known weighted least

squares solution is

βˆ = (AT WA)−1AT W y,

which minimizes the squared sum of the residue

√ √ ˆ β = argmin W Aβ − W y . β∈Rp 2

For simplicity of presentation, we consider ordinary least squares (OLS), i.e., W is the identity matrix of size m × m. We remark that all results remain to be valid for the √ √ weighted least squares by substituting A → WA and y → W y.

Optimal subset

M×p M Let AM ∈ R and yM ∈ R with M  1, and

ˆ βM = argmin kAM β − yM k2 (3.1) β∈Rp be the least squares solution to an “ideal” system, in the sense that the system is sam- pled at a large number (M) of locations with data yM and can result in the “best” lin- ear regression approximation. Since data collection for many practical systems can be an resource-intensive procedure, requiring time consuming numerical simulations and/or ex- pensive experimentations, we recognize that the ideal system (3.1) is often unattainable due to the limited resource one possesses.

35 To circumvent the difficulty, we consider a sub-problem

βˆm = argmin kAmβ − ymk2, (3.2) β∈Rp where m < M (often m  M) is the number of sample data one can afford to collect, given the resource, and Am is the sub-matrix of AM and ym is the sub-vector of yM . The performance of the sub-problem (3.2) obviously depends on the selection of the m samples.

Our goal is to determine the optimal m-point set, within the M-point candidate set, that delivers the best performance. This can be mathematically formulated as follow.

  m×M Let Sm = S(1); ··· ; S(m) ∈ R be a row selection matrix, where each row S(i) =

[0, ··· , 0, 1, 0, ··· , 0] is a length-M vector with all zeros but a one at ki location, 1 ≤ ki ≤ M.

We further require that the locations of the entry 1 in each row are distinct, i.e., {ki}1≤i≤M

M×` are distinct. Then for any matrices C ∈ R with ` ≥ 1, SmC results in a m × `

submatrix consisting of the distinct k1, . . . , km rows of C. The sub-problem (3.2) can be written equivalently as

ˆ βm(Sm) = argmin kSmAM β − SmyM k2, (3.3) β∈Rp where the solution dependence on Sm is explicitly shown. We now define the optimal subset as follows.

ˆ Definition 3.1.1 (Optimal subset). Let βM be the OLS solution of the ideal system defined in (3.1), and βˆm be the OLS solution of the sub-problem (3.3), where 1 ≤ m < M. The m-point optimal subset is defined as the solution of

†  ˆ ˆ  Sm = argmin kβM − βm(Sm)k2 . (3.4) Sm

The optimal subset then delivers a OLS solution that is the closest to the OLS solution of the full problem. The definition is not highly useful, as its solution (if exists) obviously depends on the full data vector yM . In the remainder of this paper, we shall use Definition 3.1.1 to motivate our design of a quasi-optimal m-point subset that is non-adaptive, i.e., the resulting subset is independent of yM and can be determined prior to data collection.

36 Quasi-optimal point set

Here we present a quasi-optimal sample set. It is motivated by the definition of optimal subset in Definition 3.1.1. An important feature is that the quasi-optimal subset selection is non-adaptive, i.e., it is independent of the data yM .

Remark 3.1.2. Since the true optimality in Definition 3.1.1 can not be achieved without having the full data vector yM , this strategy is termed quasi-optimal.

Motivation

We start by estimating the difference between the least squares solutions of the sub-problem and the full problem.

Theorem 3.1.1. Let Sm be a m-row selection matrix with 1 ≤ m < M and

ˆ βm(Sm) = argmin kSmAM β − SmyM k2 , β (3.5) ˆ βM = argmin kAM β − yM k2 , β

be the least squares solutions of the m-row sub-problem and the M-row full problem, respec-

tively. Let AM = QM R be the QR factorization of AM . Then,

ˆ ˆ −1 kβM − βm(Sm)k ≤ kR k · kgk, (3.6)

where g is the least squares solution to

⊥ SmQM x = SmP yM (3.7)

satisfying

T T ⊥ ((SmQM ) SmQM )g = (SmQM ) SmP yM , (3.8)

⊥ T and P = (I − QM QM ) is projection operator to the orthogonal complement of the column

space of QM .

37 Proof. We note that SmAM = SmQM R. Then,

ˆ ˆ T −1 T T −1 T kβM − βmk = k(AM AM ) AM yM − ((SmAM ) SmAM ) (SmAM ) SmyM k

T −1 T T T T −1 T T = k(R R) R QM yM − (R (SmQM ) SmQM R) (R (SmQM ) )SmyM k

−1 T −1 T −1 T = kR QM yM − R ((SmQM ) SmQM ) (SmQM ) SmyM k

−1 T T −1 T ≤ kR kkQM yM − ((SmQM ) SmQM ) (SmQM ) SmyM k

= kR−1k · kgk,

where

T T −1 T g = −QM yM + ((SmQM ) SmQM ) (SmQM ) SmyM .

T Let PyM = QM QM yM be the orthogonal projection of the data vector yM onto the column

space of QM . Then yM can be decomposed as

⊥ T T yM = PyM + P yM = QM QM yM + (I − QM QM )yM .

We then have

T T −1 T T ⊥ g = −QM yM + ((SmQM ) SmQM ) (SmQM ) Sm(QM QM yM + P yM )

T −1 T ⊥ = ((SmQM ) SmQM ) (SmQM ) SmP yM .

And the proof is complete.

Note that if Sm is constructed in such a way that that SmQM retains the same or-

thogonality as QM , the right-hand-side of (3.8) will be zero and consequently g = 0. In

this case, the sub-problem solution βˆm becomes equivalent to the full size problem solution ˆ βM . Obviously, in general this will not be case for the reduced system with m < M and consequently g 6= 0. To achieve smaller kgk, we then seek to (a) maximize the column

orthogonality of SmQM so that the right-hand-side of (3.8) is minimized; and (b) maximize

the determinant of SmQM so that the non-zero solution of (3.8) is smaller. Note that both

condition (a) and (b) are irrespective of the data yM .

38 3.1.1 S-optimality

Based on the previous discussion, we now introduce S-optimality criterion. Let A be a m×p matrix, and A(i) its i-th column. We define

√ ! 1 det AT A p S(A) := . (3.9) Qp (i) i=1 A

Obviously, we assume A(i) 6= 0 for all columns i = 1, . . . , p. 1  | det A|  p If A is a square matrix with m = p, then S(A) = Qp (i) ; and if m < p, S(A) = 0. i=1 kA k The following result also holds.

Theorem 3.1.2. For m × p matrix A with m ≥ p, S(A) = 1 if and only if all columns of

A are mutually orthogonal.

m×m Proof. Let A = QR be the QR factorization of A, where Q ∈ R is orthogonal and m×p R ∈ R is upper triangular   r11 r12 ··· r1p      0 r22 ··· r2p    . . . .   . . .. .        R =  0 0 ··· rpp . (3.10)      0 0 ··· 0     . . . .   . . . .   . . . .    0 0 ··· 0

T Qp 2 (i) 2 2 Then, det A A = i=1 rii, and kA k = j=1 rji, for 1 ≤ i ≤ p. (i) 2 2 Qp (i) 2 Qp 2 T (⇒) For each i, ||A || ≥ rii. Thus, i=1 kA k ≥ i=1 rii = det A A. Suppose √ Qp (i) T S(A) = 1, then i=1 kA k = det A A. Therefore, rji = 0 if 1 ≤ j < i for all 1 ≤ i ≤ p. This implies R is diagonal and consequently the columns of A are orthogonal.

(⇐) Now suppose the columns of A are mutually orthogonal. Then it can be written as

  A = A(1) ··· A(p)

39   kA(1)k 0 ··· 0        0 kA(2)k · · · 0  = A(1) A(p)   (1) ··· (p)  . . . .  kA k kA k  . . .. .   . . .    0 0 · · · kA(p)k

= QR,

√ T Qp (i) which is a natural QR decomposition. We then have det A A = i=1 kA k and S(A) = 1.

With this S-optimality criterion, we are in a position to define the quasi-optimal set.

Definition 3.1.3 (Quasi-optimal subset). Let AM and yM be the model matrix and data

over the M-point candidate sample set, as defined in (3.1), and AM = QM R be the QR factorization of AM . Let 1 ≤ m < M be a given integer. The quasi-optimal m-point subset

is defined by selecting the m rows out of AM such that

† Sm = argmax S(SmQM ). (3.11) Sm

If the columns of AM are mutually orthogonal, then the QR factorization is not necessary and

† Sm = argmax S(SmAM ). (3.12) Sm

Let Ik = {i1, . . . , ik} be a sequence consisting k unique indices, where 1 ≤ ij ≤ M for

all j = 1, . . . , k, and in 6= i` for any n 6= `. Also, let QIk = [Q(i1); ... ; Q(ik)] denote the

matrix constructed by the Ik rows out of the QM . Note the trivial case of QM = Q{1,...,M}. Then, the row selection problem (3.11) can be written equivalently as an index selection

problem

† Im = argmax S (QIm ) . (3.13) Im

We remark that (3.11) seeks to find the m-row submatrix SmQM with maximum S(SmQM ) value. From the definition of S in (3.9) and its property in Theorem 3.1.2, it is then clear

that maximizing this value leads to both larger determinant and better column orthogonal-

ity. This should lead to smaller difference between the sub-problem OLS solution and the 40 full-problem solution, via the bound in term of kgk (3.8) in Theorem 3.1.1.

We also emphasize that the maximization of S(SmQM ) does not necessarily imply the strict minimization of g in (3.8). Therefore, it is possible to define a different quasi-optimal subset using a different criterion to minimize g. Here we propose to use the S(SmQM ) value, as it is relatively appealing mathematically and fairly straightforward to compute

(see the next section).

3.2 Greedy algorithm and implementation

Selecting m rows out of M rows using (3.11) is an NP-hard optimization problem. To circumvent the difficulty, we propose to use a greedy algorithm, as commonly done in optimization problems. By using the index selection format (3.13), the greedy algorithm is as follows.

Given the current selection of k ≥ p rows, which results in a index sequence Ik = {i1, . . . , ik}, find the next row index such that  ik+1 = argmax S QIk∪{i} , Ik+1 = Ik ∪ {ik+1}, (3.14) i∈IM \Ik

where IM = {1,...,M} is the set of all indices.

Initial index set

The greedy algorithm requires an initial choice of the first p indices to start, as since

S(A) = 0 for any m×p matrices A with m < p. In order to determine the first p indices, we

apply the above greedy algorithm to the column-truncated square version of the submatrices.

For any k < p, let QeIk be the k × k square matrix consisting the Ik rows and the first k

columns of QM .

For a given index Ik with 1 ≤ k < p, find   ik+1 = argmax S QeIk∪{i} , Ik+1 = Ik ∪ {ik+1}. (3.15) i∈IM \Ik

This algorithm starts with an initial choice of index i1, which can be a random choice. It stops when k = p and resumes the standard greedy algorithm (3.14). 41 Evaluations of S(A)

The greedy algorithm (3.14), reps. (3.15), requires one to, at each kth step, search through the remaining candidate rows, construct a (k +1)×p matrix for each such row and compute its corresponding S value, and then find the row with the maximum S value. This can be an expensive procedure, as the S value requires computation of the determinant of every such matrices. Here we provide a procedure to avoid the computations of the determinants.

It is based on the following results.

Theorem 3.2.1. Let A be a m × p full rank matrix with m ≥ p and r be a length-p vector.

Consider the (m + 1) × p matrix Ae = [A; rT ]. Then, the following holds

1 + rT (AT A)−1r (S(A))2p = (det AT A) . (3.16) e Qp (i) 2 2 i=1(kA k + ri )

The proof can be found in Appendix A.1. This result indicates that, for any m × p

matrix A, the S value of the (m + 1) × p matrix with an arbitrary additional row rT

can be computed via (3.16). This indicates that in the greedy algorithm (3.14), to search

through the remaining rows in the candidate rows at the k-th step, one can find the row

that gives the maximum value of the right-hand-side of (3.16) without the (det AT A) term,

as it is a common factor for the current (k × p) matrix. This avoids the computation of the determinant altogether.

For the computation of the initial p rows in the greedy algorithm (3.15), we need to compute the S value of square matrices with one extra row (added from the remaining candidate rows) and one extra column (added from the remain columns). To avoid the explicit computation of the determinants, we introduce the following result.

Theorem 3.2.2. Let A be a m × m full rank square matrix, r and c be length-m vectors,

and γ be a . Consider the (m + 1) × (m + 1) matrix   A c   Ae =   . rT γ

42 Then, the following holds

(1 + rT b) (cT c + γ2 − α) S(A)2(m+1) = det(AT A) · (3.17) e Qm (i) 2 2 T 2 i=1(kA k + ri ) c c + γ where

b = (AT A)−1r, g = (AT A)−1AT c, (3.18)  brT  α = (cT A + γrT ) I − (g + γb). 1 + rT b

The proof can be found in Appendix A.2.

3.2.1 Fast greedy algorithm without determinants

By using Theorem 3.2.1, the greedy algorithm (3.14) can be rewritten in the following equivalent form.

Given the current selection of k ≥ p rows, which results in a index sequence Ik = {i1, . . . , ik}, find the next row index such that

ik+1 = argmax δi, Ik+1 = Ik ∪ {ik+1}, (3.19) i∈IM \Ik

where 1 + Q (QT Q )−1QT (i) Ik Ik (i) δi = , (3.20) Qp (kQ(j)k2 + Q2 ) j=1 Ik i,j

where Q(i) is the i-th row of QM and Qi,j is its (i, j)-th entry.

Similarly, the greedy algorithm (3.15) to obtain the first p rows can be modified by using

Theorem 3.2.2.

For a given index Ik with 1 ≤ k < p, find ˆ ik+1 = argmax δi, Ik+1 = Ik ∪ {ik+1}, (3.21) i∈IM \Ik

where (k+1) 1 + Q β kQ k2 + Q2 − α e(i) i eIk i,k+1 i δˆi = · , (3.22) Qk (kQ(j)k2 + Q2 ) kQ(k+1)k2 + Q2 j=1 eIk e(i),j eIk i,k+1 and

β = (QT Q )−1QT , g = (QT Q )−1QT Q(k+1), i eIk eIk e(i) eIk eIk eIk eIk

43 ! βiQe(i) α = (Q(k+1)T Q + Q Q ) I − (g + Q β ). i eIk eIk i,k+1 e(i) i,k+1 i 1 + Qe(i)βi

3.3 Polynomial least squares via quasi-optimal subset

We now discuss the quasi-optimal subset in term of (orthogonal) polynomial least squares

regression, along with its special properties.

In polynomial regression, we are interested in approximating an unknown function f(x)

using its samples f(xi), i = 1, ··· , m. Let Π be a p-dimensional polynomial space and

{φj(x), 1 ≤ j ≤ p} be its polynomial basis. Given a set of points Θ = {xi}1≤i≤m, the model matrix A is constructed as follow

A = (aij), aij = φj(xi), 1 ≤ i ≤ m, 1 ≤ j ≤ p.

To emphasis its dependency on the point set Θ, we often denote the model matrix by A(Θ).

By employing least squares, one can construct an approximation

p X ∗ fe(x) = ci φi(x) ∈ Π j=1

∗ ∗ ∗ T where the coefficients c = [c1, ··· , cp] minimizes the squared sum of the errors

∗ T c = argmin kAc − yk2, y = [f(x1), ··· , f(xm)] . c∈Rp

Note that each row of A corresponds to a point and each column of A corresponds to a polynomial in the basis. Thus the quasi-optimal sample set in Definition 3.1.3 can be readily expressed in terms of the sample points in polynomial regression.

Let ΘM = {x1, . . . , xM } be the candidate sample point set, and Θm = {xi1 , . . . , xim } ⊂

ΘM be an arbitrary m-point subset with m ≥ p. Let A(Θm) = Q(Θm)R be the QR factorization of the model matrix A(Θm). Then, the quasi-optimal set, following Definition 3.1.3, is

† Θm = argmax S(Q(Θm)). (3.23) Θm⊂ΘM The greedy algorithm can also be readily constructed, similar to (3.14).

44 Given the current selection of k ≥ p samples, which result in the set Θk =

{xi1 , . . . , xik }, find the next sample point such that

xik+1 = argmax S(Q(Θk ∪ {x})), Θk+1 = Θk ∪ {xik+1 }. (3.24) x∈ΘM \Θk

Again, this greedy algorithm requires a choice of the first p samples, as S(Q(Θm)) = 0 whenever m < p. The greedy algorithm above can be adapted to this situation by truncating the p columns to its first m columns to form a m × m square sub-matrix Qe(Θm) and then applying (3.24), similar to the procedure in (3.15). Note in this situation, we require an ordering of the multi-variate polynomial basis in Π. Different ordering will lead to different choices of the first p samples.

3.4 Orthogonal polynomial least squares via quasi-optimal subset

2 Let Π be a linear subspace of Lω(D) and {φi(x)}1≤i≤p be its orthogonal basis which satisfies the mutual orthogonality

Z φi(x)φj(x)dω = δij, 1 ≤ i, j ≤ p. D

Remark 3.4.1. A popular choice of Π is the total degree polynomial space of degree upto

d d d+n n, Pn, defined in (2.30) whose dimension is p = dim Pn = d . Accordingly its orthogonal basis functions are orthogonal polynomials.

When orthogonal basis are used and the sample points are drawn from the underlying orthogonal probability measure ω in iid manners, the columns of the model matrix A are

approximately orthogonal. We then apply the model matrix A directly on the scheme,

avoiding the QR factorization.

Let A(Θm) be the model matrix constructed by a family of orthogonal polynomials and the iid samples from the underlying orthogonal probability measure. Then, the quasi-

optimal set, following Definition 3.1.3, is

† Θm = argmax S(A(Θm)). (3.25) Θm⊂ΘM

The greedy algorithm can also be readily constructed, similar to (3.24). 45 Given the current selection of k ≥ p samples, which result in the set Θk =

{xi1 , . . . , xik }, find the next sample point such that

xik+1 = argmax S(A(Θk ∪ {x})), Θk+1 = Θk ∪ {xik+1 }. (3.26) x∈ΘM \Θk

The first p samples can be chosen in a similar manner as in Chapter 3.3. That is, we apply the above greedy algorithm to the column-truncated square version of the submatrices.

sq For any k < p, let A (Θk) be the k ×k square matrix consisting the rows corresponding

Θk and the first k columns of AM = A(ΘM ). Then the first p points are selected as follow:

For a given set of k points Θk with 1 ≤ k < p, find

sq xk+1 = argmax S (A (Θk ∪ {x})) , Θk+1 = Θk ∪ {xk+1}. (3.27) x∈ΘM \Θk

This algorithm starts with an initial choice of point x1, which can be a random choice. It stops when k = p and resumes the standard greedy algorithm (3.26).

† † Let Sm be the row selection operator corresponding to the point set Θm (3.25). Then the quasi-optimal subset results in a small-size least squares problem

† † min kS AM c − S yM k2 (3.28) c m m

3.4.1 Asymptotic distribution of quasi-optimal points

d Let {φj(x)} be orthogonal polynomials in D ⊂ R satisfying Z φi(x)φj(x)dω = δij (3.29) D

d d Suppose the total degree polynomial space Pn is spanned by {φj(x)}1≤j≤p where p = dim Pn. Suppose also that the sample points are drawn from the underly orthogonal probability measure ω. In this setup, we present some unique properties regarding the asymptotic distribution of the quasi-optimal set. To this end, we focus on the empirical distribution of a finite point set. For any k-point set Θk = {x1, . . . , xk}, we define its empirical distribution as k 1 X ν (x) := δ (x). (3.30) k k xi i=1

46 Asymptotic distribution of determined systems in 1D

First, we examine the property of a set of points which maximizes S in the continuous setting and in bounded interval [−1, 1]. More specifically, we focus on the determined systems, i.e., the case of square model matrix A.

Let φj(x) be an orthogonal polynomial of degree j and (n − 1) be the highest degree of orthogonal polynomials. That is

Z φi(x)φj(x)dω = δij, 0 ≤ i, j < n. (3.31)

Then the columns of the model matrix consist of φ0(x), . . . , φn−1(x), with p = n. We seek

to find a n-points set Θn = {x1, . . . , xn} such that the S value of the square (n × n) model matrix A is maximized, i.e.,

S Θn = argmax S(A(Θn)). (3.32) Θn⊂[−1,1]

Note that since the matrix is square, we have

1  | det A|  n S(A) = . (3.33) Qn (i) i=1 kA k

This resembles the so-called Fekete points, which seeks to find the n-point set that maximizes

the determinant of the Vandermonde-like matrix, which is indeed the square model matrix

in this case. That is,

F Θn = argmax | det(A(Θn))|. (3.34) Θn⊂[−1,1] The Fekete points are used to obtain stable polynomial (cf. [13]). It is well known that the empirical measure of the Fekete points converges to the equilibrium measure, which is the arcsin distribution in [−1, 1],

1 νF (x) = lim νF (x) = √ . (3.35) n→∞ n π 1 − x2

For notational convenience, let

n 2 X 2 kφk(Θn)k2 = φk(xi) i=1

47 where Θn = {xi}1≤i≤n and φk(x) is a . The following theorem shows that the asymptotic distribution of the quai-optimal set from (3.32) is the same as that of the

Fekete point (3.34).

Theorem 3.4.1. Let the n-point set Θn be determined by (3.32), where the matrix A is constructed using orthonormal basis (3.29) of degree up to (n − 1) with measure µ(x) in x ∈ [−1, 1]. Assume the basis functions satisfy that kφkk∞ grows at most polynomially with

S F respect to the degree k and kφk(Θn)k2 are uniformly bounded from below for all n. Let Θn F be the Fekete points (3.34) and assume its absolute determinant | det A(Θn )| is uniformly bounded from below for all n. Then, the asymptotic distribution of the empirical measure

S νn of the quasi-optimal set Θn satisfies

S weak-* F ν (x) = lim νn(x) = ν (x) (3.36) n→∞

where νF is the equilibrium measure for the Fekete points (3.35).

The proof can be found in Appendix A.3. In Fig. 3.1, we show the convergence of

the empirical measures for a variety of orthogonal polynomials in [−1, 1]. These include

Legendre polynomial, Chebyshev polynomial, Jacobi polynomials with parameters (α, β) =

(1, 1) and (0, 2). The quasi-optimal points are computed using the greedy algorithm (3.26)

over a candidate set of 100, 000 uniformly distribution points in [−1, 1]. On the left of

Fig. 3.1, we observe that at polynomial order n = 30, the empirical distributions of the quasi-

optimal points from the four different orthogonal polynomials are all close to the asymptotic

distribution, as indicated by Theorem 3.4.1. On the right of Fig. 3.1, the maximum norm

differences between the empirical distributions and the asymptotic distribution are shown, at

increasing degree of polynomials. We observe clearly algebraic convergence of the empirical

measures towards the asymptotic distribution.

Asymptotic distribution of over-determined systems

We present the asymptotic behavior of the quasi-optimal subset for over-determined systems

m×p with model matrix A ∈ R , m > p. We set the approximation space to be the total degree

48 1 Asymptotic Dist 0.9 Legendre Chebyshev 0.8 Jacobi(1,1) Jacobi(0,2) 0.7

−1 0.6 10

0.5

0.4

0.3

0.2 Legendre Chebyshev 0.1 Jacobi(1,1) Jacobi(0,2) −2 0 10 −1 −0.5 0 0.5 1 5 10 15 20 25 30

Figure 3.1: Empirical distribution of the quasi-optimal set for determined system. Left: the empirical measures at order n = 30 for four different type of orthogonal polynomials; Right: Maximum norm differences between the empirical measures and the asymptotic distribution at increasing degrees of orthogonal polynomials.

d n+d polynomial space whose dimension is p = dim Pn = d for d ≥ 1. Let the orthonormal polynomials {φj(x)}1≤j≤p satisfying (3.29) be its basis. In order to discuss the asymptotic behaviors, we now extend the definition of S to matrices with infinite number of rows.

Definition 3.4.2. Let A∞ be a matrix of size ∞ × p and Sk be a k-row selection matrix, where k > p. Then ! S(A∞) = lim inf sup S(SkA∞) . (3.37) k→∞ Sk Theorem 3.4.2. Assume that for each degree n ≥ 1, there exists a candidate set of infinite

points Θ∞,n such that S(A(Θ∞,n)) = 1. Then there exists a sequence {Θk,n} where k > p = n+d d such that

lim S(A(Θk,n)) = 1 (3.38) k→∞

where Θk,n is the k-point quasi-optimal subset. Let νk,n be its empirical measure. Then, for

the sequence of measures νk,n, n ∈ N, there exists a convergent subsequence νki,nj such that

∗ ν (x) = lim lim νk ,n (x) = ω(x), (3.39) j→∞ i→∞ i j

where ω is the measure in the orthogonality (3.29).

The proof can be found in Appendix A.4. Numerical tests were conducted for four 49 different orthogonal polynomials, as shown in Fig. 3.2 for Legendre polynomial (left) and

Chebyshev polynomial (right), and in Fig. 3.3 for Jacobi polynomial with (α, β) = (1, 1)

(left) and (α, β) = (0, 2) (right). At order n = 30, it can be seen that the empirical mea- sures are close to the corresponding continuous measure µ from the respective orthogonality relations. The maximum norm differences between the empirical measures and the corre- sponding continuous measures ω are shown in Fig. 3.4, for increasing order of polynomials.

Algebraic convergence can be observed for all four different cases.

1 1 Empirical Dist Empirical Dist 0.9 Asymptotic Dist 0.9 Asymptotic Dist

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

Figure 3.2: Empirical distribution of the quasi-optimal set for over-determined system at polynomial order n = 30. Left: Legendre polynomial; Right: Chebyshev polynomial.

1 1 Empirical Dist Empirical Dist 0.9 Asymptotic Dist 0.9 Asymptotic Dist

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

Figure 3.3: Empirical distribution of the quasi-optimal set for over-determined system at polynomial order n = 30. Left: Jacobi polynomial with (α, β) = (1, 1); Right: Jacobi polynomial with (α, β) = (0, 2).

50

−1 10

−2 10 Legendre Chebyshev Jacobi(1,1) Jacobi(0,2)

5 10 15 20 25 30

Figure 3.4: Maximum norm difference between the empirical measures of the quasi-optimal sets and the measures in the orthogonality (3.29), for over-determined systems obtained by different orthogonal polynomials at increasing degrees.

3.5 Near optimal subset: quasi-optimal subset for Christoffel least squares

We now present another least squares strategy which combines the Christoffel least squares method (CLS) [66] and the quasi-optimal point selection method (S-optimality).

Let us first briefly review the CLS [66]. Let {φj(x)} be orthogonal polynomial satisfying Z φi(x)φj(x)dω = δij D

d d and Pn = span{φj(x), 1 ≤ j ≤ p} where p = dim Pn. The Christoffel least squares method is a weighted least squares whose weights come from the (normalized) Christoffel function

λn(x) (2.42) p λn(x) = Pp 2 . i=1 φi (x)

And more importantly, the samples are drawn from either the equilibrium measure µD (2.43) in iid manners (not from the underlying orthogonal probability measure ω). Let A be the model matrix constructed by sampling orthogonal polynomial basis {φj(x), 1 ≤ j ≤ p} at

T m iid samples Θm = {xi}1≤i≤m from µD. Let y = [f(x1), ··· , f(xm)] . Then CLS solves 51 the following weighted least squares problem (2.15)

√ √ min k W Ac − W yk2, c∈Rp

where

M W = diag(w1, . . . , wM ), wi = λn(xi), {xi}i=1 ∼ µD. (3.40)

In the setting of CLS, the quasi-optimal subset selection procedure can easily be applied.

Let ΘM = {x1, . . . , xM } be the candidate sample point set drawn from the equilibrium measure µD (2.43). And let Θm = {xi1 , . . . , xim } ⊂ ΘM be an arbitrary m-point subset with √ m ≥ p. Let WA(Θm) be the matrix constructed by a family of orthogonal polynomials, a set of samples Θm, and its corresponding Christoffel function weights. We then have a “near optimal” point set defined as follow:

√ S Θm = argmax S( WA(Θm)). (3.41) Θm⊂ΘM

Remark 3.5.1. We remark that by using the CLS method, the full size M-point model

matrix for the least squares problem achieves much better stability, as detailed by [66].

Once the much better conditioned model matrix is constructed by the CLS method, the S-

optimality strategy ([84]) allows one to determine the near optimal smaller subset with m points such that the m-point least squares solution is almost the closest to the (unavailable) full set CLS solution. It is in this sense that we term the current method “near optimal”, although true optimality can not be proved.

The greedy algorithm is readily constructed, similar to (3.26).

Given the current selection of k ≥ p samples, which result in the set Θk =

{xi1 , . . . , xik }, find the next sample point such that √ xik+1 = argmax S( WA(Θk ∪ {x})), Θk+1 = Θk ∪ {xik+1 }. (3.42) x∈ΘM \Θk

Again, since the quantity S is defined only for matrices of size k × p with k ≥ p, the greedy algorithm requires specification of the initial p points to start. Here we discuss two options to determine the initial p points. One option is to follow a similar manner as in

52 √ sq (3.27). That is, for any k < p, let WA (Θk) be the k × k square matrix consisting the √ √ rows corresponding Θk and the first k columns of WAM = WA(ΘM ).

For a given set of k points Θk with 1 ≤ k < p, find √  sq  xk+1 = argmax S WA (Θk ∪ {x}) , Θk+1 = Θk ∪ {xk+1}. (3.43) x∈ΘM \Θk

This algorithm starts with an initial choice of point x1, which can be a random choice. It stops when k = p and resumes the standard greedy algorithm (3.42).

Another choice is to use the concept of Fekete points, which is used in polynomial inter- polation theory. The Fekete points seek to maximize the determinant of the Vandermonde- like matrix. Using the same principle, we can find the first p sample points Θp which √ sq maximize the determinant of the p × p square matrix WA (Θp), i.e.,

√ sq Θp = argmax | det WA (Θp)|. (3.44) Θp⊂ΘM

The motivation for using the Fekete points is that it is known that the asymptotic distribution of the Fekete points in bounded domain is the asymptotic measure [10]. This is the same property as the near optimal samples from Theorem 3.5.1 and 3.5.2. More importantly, the Fekete points can be easily computed using a greedy algorithm, resulting in the approximate Fekete points (AFP) [13]. The algorithm for computing the AFP is remarkably simple. This results in a much reduced computational effort for determine the initial p points. Our numerical examples demonstrate that the use of the AFP as the initial p points results in negligible difference in the final result, with the use of the square matrices approach slightly better.

S S Let Sm be the row selection operator corresponding to the point set Θm (3.41). Then the near-optimal subset results in a small-size least squares problem

√ √ S S min kS WAM c − S W yM k2 (3.45) c m m

53 3.5.1 Asymptotic distribution of near optimal points

We now discuss the asymptotic distributions of the near optimal samples. For a set of m

S points Θm selected by (3.41), its empirical distribution is

1 X ν (x) := δ (x). m m z S z∈Θm

We are interested in the asymptotic limit of νm as m → ∞. Let us restrict to the total degree polynomial space (2.30) and use the subscript n to explicitly denote the degree. Thus

d n+d pn = dim Pn = d . Let

p ψi,n(x) = λn(x)φi(x), i = 1, . . . , pn. (3.46)

where λn(x) is the (normalized) Christoffel function (2.42). They form a non-polynomial orthonormal basis satisfying

Z ψi,n(x)ψj,n(x)ven(x)dz = δij, ven(x) = ω(x)/λn(x). (3.47) D

For bounded domains, the following result holds.

Theorem 3.5.1 (Bounded). Suppose that for each n, there exists a candidate set of infinite √ points Θ∞,n with model matrix A∞,pn such that S( WA∞,pn )) = 1. Then there exists a

sequence {Θk,n}k>pn with empirical measure νk,n, which contains a subsequence νkl,n that ∗ converges to measure νn satisfying Z Z ∗ lim ψi,n(x)ψj,n(x)νkl,n(x)dx = ψi,n(x)ψj,n(x)νn(x)dx = δij. (3.48) l→∞ D D

∗ The sequence {νn} converges to v, i.e.,

dµ lim ν∗(x) = v(x) = D , a.e. (3.49) n→∞ n dx where dµD is the equilibrium measure of D.

Proof. From Theorem 3.4.2 and (3.48), we have

Z Z ∗ δij = ψi,n(x)ψj,n(x)νn(x)dx = φi(x)φj(x)ωn(x)dx, (3.50) D D 54 where

∗ ωn(x) = λn(x)νn(x). (3.51)

From (2.44) and the uniqueness of orthogonal polynomial probability measure,

∗ ω(x) ∗ ω(x) = lim ωn(x) = lim λn(x)ν = lim ν (x), a.e. (3.52) n→∞ n→∞ n v(x) n→∞

∗ ∗ Therefore, ν := limn→∞ νn = v.

Numerical tests were conducted to verify the theoretical estimates. Here we present the results obtained by Legendre polynomial, Chebyshev polynomial, and Jacobi polynomials with parameters (α, β) = (1, 1) and (0, 2). The near optimal points are computed using the greedy algorithm from [84], or Chapter 3.2.1, over a candidate set of 100,000 uniformly distribution points in [−1, 1]. On the left of Figure 3.5, we observe that at polynomial order n = 100, the empirical distributions of the points from the four different orthogonal polynomials are all close to the asymptotic distribution. The maximum norm difference between the empirical distributions and the asymptotic distribution are shown in Fig. 3.5, for increasing order of polynomials. Algebraic convergence can be observed for all four different cases.

1 Asymptotic Legendre 0.9 −1 Chebyshev 10 0.8 Jacobi(1,1) Jacobi(0,2) 0.7

0.6

0.5 −2 10 0.4

0.3

0.2 Legendre Chebyshev 0.1 Jacobi(1,1) Jacobi(0,2) −3 10 0 1 2 −1 −0.5 0 0.5 1 10 10

Figure 3.5: Empirical distribution of the near optimal set for over-determined system. Left: the empirical measures at order n = 100 for four different type of orthogonal polynomials; Right: Maximum norm differences between the empirical measures and the asymptotic distribution at increasing degrees of orthogonal polynomials.

55 The following theorem is about the asymptotic distribution in unbounded domains.

Theorem 3.5.2 (Unbounded). Suppose that for each n, there exists a candidate set of ¯ infinite points Θ∞,n ⊂ Sw such that the model matrix A∞,pn using the scaled point set √ 1/t ¯ n+d n Θ∞,n satisfies S( WA∞,pn )) = 1. Here pn = d is the cardinality of the polynomial 1/t ¯ space. Then there exists a sequence {n Θk,n}k>pn with empirical measure {νk,n}k>pn , ∗ which contains a convergent subsequence {νkl,n}l≥1 that converges to measure νn satisfying Z Z ∗ lim ψi,n(x)ψj,n(x)νkl,n(x)dx = ψi,n(x)ψj,n(x)νn(x)dx = δij. (3.53) l→∞ D D

∗ d/t ∗ 1/t ∗ Let µn(z) := n νn(n z). Then, the sequence {µn}n≥1 converges to the weighted equilib- rium measure dµD,Q, i.e.,

dµ lim µ∗ (z) = v(z) = D,Q , a.e. (3.54) n→∞ n dz

Proof. From the assumption that for each n, there exists a candidate set of infinite points √ Θ¯ ∞,n ⊂ Sw such that S( WA∞,n)) = 1. By Theorem 3.4.2, we have Z Z ∗ pn ∗ δij = ψi,n(x)ψj,n(x)νn(x)dx = φi(x)φj(x) νn(x)dx. D D Kn(x)

By setting w (x) := pn ν∗(x), we can write n Kn(x) n Z δij = φi(x)φj(x)wn(x)dx. D

It then follows from the uniqueness of the orthogonal probability measure that

pn ∗ lim νn(x) = lim wn(x) = w(x). (3.55) n→∞ Kn(x) n→∞

1/t ∗ Substituting the change of variable x = n z into the measure νn(x)dx produces

∗ d/t ∗ 1/t ∗ ∗ d/t ∗ 1/t νn(x)dx = n νn(n z)dz = µn(z)dz, where µn(z) := n νn(n z).

From Theorem 2.6.1, we have

K(n)(z)wn(z) dµ lim n = D,Q . (3.56) n→∞ pn dz

56 Therefore by combining (3.55) and (3.56), it follows

pn ∗ pn ∗ lim νn(x) = lim µn(z) = 1. n→∞ n→∞ (n) n Kn(x)w(x) Kn (z)w (z) This implies

dµ lim µ∗ (z) = D,Q . n→∞ n dz

3.6 Summary

Here we summarize least squares methods using orthogonal polynomials and the quasi- optimal sampling strategies. We again remark that the entire procedure for solving either

(3.25) or (3.41) is independent of the data vector yM . Upon conducting the iid samples from either the underlying orthogonal probability measure ω (3.29) or the equilibrium measure

µD (2.43), one solves either (3.25) or (3.41) to decide the m-point subset. It is only after † this step that one is required to collect the function data at the m select points (either Θm

S or Θm). Therefore, the problem of either (3.25) or (3.41) results in a “smaller” least squares problem of size m × p, and the full data vector yM is never required.

3.6.1 Quasi optimal sampling for ordinary least squares

The quasi optimal sampling strategy consists of the following steps.

• For a given bounded domain D with a measure ω, draw M  1 number of i.i.d.

random samples from the underlying orthogonal probability measure ω (2.2).

• Construct the model matrix AM .

• For any given number m < M, identify the quasi-optimal subset by applying the

S-optimality strategy (3.25) on the model matrix AM .

• Solve the ordinary least squares problem (3.28) but only to the subset problem with

m points.

57 3.6.2 Near optimal sampling for Christoffel least squares

The near optimal sampling strategy consists of the following steps:

• For a given domain D with a measure ω, draw M  1 number of iid random samples

from the equilibrium measure µD (2.43). √ • Construct the weighted model matrix WAM (3.40).

• For any given number m < M, identify the quasi-optimal subset by applying the √ S-optimality strategy (3.42) on the weighted model matrix WAM .

• Solve the weighted least squares problem (3.45) but only to the subset problem with

m points. √ It is worth noting that the construction of the weighted model matrix WAM can be easily accomplished by normalizing each row of the original unweighted model matrix AM √ and then rescaling it by a factor of p, i.e.,

√ A(i) A(i) → p , i = 1,...,M kA(i)k where A(i) is the i-th row of AM . This is because each row of the weighted model matrix √ WA is M √ √ p wiA(i) = p A(i), i = 1,...,M, Kn(xi) where v u p uX 2 p kA(i)k = t φk(xi) = Kn(xi). k=1

58 CHAPTER 4

ADVANCED SPARSE APPROXIMATION

In this chapter we present an optimization problem of minimizing `1−`2 under linear con- straints. That is,

min kxk1 − kxk2 subject to Ax = b x∈Rp where A is a matrix of size m × p, b is a vector of size m × 1, and m < p. The `1−`2 minimization is designed to find a spare solution in the feasible set {x | Ax = b}. This

problem was first proposed in [59] and then was studied in [105]. Here a further theoretical

study is discussed, which improves and extends the theory of `1−`2 minimization. Theo- retical estimates regarding its recoverability for sparse signals are improved. Also similar theoretical results are extended for non-sparse signals. This extension includes the analysis of (sparse) Legendre polynomial approximations via (weighted) `1−`2 minimization. We also remark that this sparse function approximation has an immediate application for stochastic collocation of UQ. Various numerical examples are presented in Chapter 7.2.

We observe that the `1−`2 minimization seems to produce results that are consistently better than the `1 minimization, although a rigorous proof of this is lacking.

4.1 Review on `1-`2 minimization

We consider the standard recovery problem of underdetermined system Ax = b, where the

m×p measurement matrix A ∈ R with m < p. In many practical problems, the problem is severely underdetermined with m  p.

The `1−`2 minimization method proposed in [59, 105] seeks to find a sparse solution to

59 the underdetermined system Ax = b by solving

min kxk1 − kxk2 subject to Ax = b. (4.1) x∈Rp

If the data b contain noise, the corresponding “de-noising” version is

min kxk1 − kxk2 subject to kAx − bk2 ≤ τ, (4.2) x∈Rp where τ is related to the noise level in b.

Remark 4.1.1. The standard (and popular) approach of finding a sparse solution is com- pressive sensing (CS), (cf. [24, 22, 34]). Although sparsity is naturally measured by the `0 norm, the `1 norm is widely adopted, for it leads to a constrained optimization problem

min kxk1 subject to Ax = b (4.3)

that can be solved as a convex optimization. (The use of `0 norm will lead to an NP-hard optimization problem [69].)

As discussed in [59, 105], the motivation of using `1−`2 is that since the contours of kxk1 − kxk2 are closer to those of kxk0, the `1−`2 formulation should be more sparsity- promoting, compared to the use of kxk1 norm. The recoverability of the `1−`2 minimization problem (4.1) has been studied in [59, 105] in the context of exact recovery, which assumes the true solution is s-sparse. The major results from [59, 105] are summarized as follows.

Letx ¯ be a vector with sparsity of s satisfying

√ !2 3s − 1 a(s) = √ > 1, (4.4) s + 1 and suppose A satisfies the condition

δ3s + a(s)δ4s < a(s) − 1. (4.5)

Then, the following results hold ([105]),

• Let b = Ax¯, thenx ¯ is the unique solution to (4.1).

60 m • Let b = Ax¯ + e, where e ∈ R is any perturbation with kek2 ≤ τ, then the solution opt opt x to (4.2) satisfies kx − x¯k2 ≤ Csτ, where

2p1 + a(s) Cs = p √ > 0. (4.6) a(s)(1 − δ4s) − 1 + δ3s

4.2 Recovery properties of `1-`2 minimization

We present two sets of recovery results for the `1−`2 minimization. One for exact recovery of sparse functions, which improves the results in [105], and the other for recovery of non-sparse functions, which were not available before.

Exact recovery of sparse solutions

Theorem 4.2.1. Let x¯ be a vector with sparsity of s satisfying

 3s − 1 2 a(s) = √ √ > 1, (4.7) 3s + 4s − 1 and suppose A satisfies the condition

δ3s + a(s)δ4s < a(s) − 1.

Then, the following results hold

• Let b = Ax¯, then x¯ is the unique solution to (4.1).

m • Let b = Ax¯ + e, where e ∈ R is any perturbation with kek2 ≤ τ, then the solution opt opt x to (4.2) satisfies kx − x¯k2 ≤ Csτ, where √  2 3s − ps · a(s) Cs = p √ . a(s)(1 − δ4s) − 1 + δ3s

The proof can be found in the Appendix B.1. This result is very similar to that from

[105], summarized in (4.4)–(4.6). Upon comparing the constant a(s) in (4.7) against that in (4.4), the current result represents an improvement over that from [105]. The condition of (4.7) is satisfied for s > 2, however the condition of (4.4) is satisfied for s > 7. Thus the new condition is only missing the case of s = 2. 61 We now present another estimate for the exact recovery, using the more natural measure of δs, instead of δ3s + a(s)δ4s.

Theorem 4.2.2. Let x¯ be any vector with sparsity s satisfying

1 δ < , (4.8) s 1 + c(s) where

√ 9 9 r 1 2 c(s) := 5 + √ + √ + . 4 5s 2 5 4s2 s

Then, the following results holds.

• Let b = Ax¯, then the solution xopt to (4.1) is exact, i.e., xopt =x ¯.

m • Let b = Ax¯ + e, where e ∈ R is any perturbation with kek2 ≤ τ, then the solution xopt to (4.2) satisfies

opt kx − x¯k2 ≤ Cesτ (4.9)

where

2β(s)p1 + 1/(1 + c(s)) r 1 r 1 Ces = , β(s) = + + 2. (4.10) 1 − (1 + c(s))δs 4s 4s

The proof can be found in Appendix B.2. The derivation uses a different technique, from that in Theorem 4.2.1 and in [105]. This allows us to extend the analysis to the study of recovery of non-sparse signals, as presented in the following section.

Recovery of non-sparse solutions

To quantify the recovery accuracy for a general non-sparse solutionx ¯, it is customary to measure the solution against its “best” sparse versionx ¯s, which is the s-sparse vector containing only the s largest entries (in absolute values) ofx ¯.

n Theorem 4.2.3. Let x¯ ∈ R be an arbitrary vector and x¯s be the s-sparse vector containing

62 only the s largest entries (in absolute value) of x¯. Let us assume

1 δ2s < √ √ . (4.11) √2 2 1 + 2 + s−1

Then, the following results hold.

• Let b = Ax¯, the solution xopt to (4.1) satisfies

opt kx − x¯k2 ≤ C2,skx¯ − x¯sk1. (4.12)

m • Let b = Ax¯ + e, where e ∈ R is any perturbation with kek2 ≤ τ, then, the solution xopt to (4.2) satisfies

opt kx − x¯k2 ≤ C1,sτ + C2,skx¯ − x¯sk1. (4.13)

Here, the constants are p 4 s(1 + δ2s) C1,s = √  √ √ √  , s − 1 − ( 2 + 1) s + 2 − 1 δ2s √ (4.14) 2 + (2 2 − 2)δ2s C2,s = √  √ √ √  . s − 1 − ( 2 + 1) s + 2 − 1 δ2s

The proof can be found in Appendix B.3. Note that by using the standard CS notation

n for the error of the best s-term approximation of any vector x ∈ R , i.e.,

σs,p(x) = inf ky − xkp, (4.15) kyk0≤s the term kx¯ − x¯sk1 in the theorem is σs,1(¯x).

Remark 4.2.1. It must be pointed out that the estimates in Theorem 4.2.2 and 4.2.3 are derived by using the same approach developed in the standard CS literature. It is worthwhile to compare the current estimates for the `1−`2 minimization against the estimates for the standard CS using `1 minimization. For the `1 minimization, [17] derived an estimate of √ δ < 1√ for sparse solutions, and [21] derived an estimate of δ < 2 − 1 for non- s 1+ 5 2s sparse solutions. The current corresponding estimates for the `1−`2 minimization are (4.8) and (4.11), respectively. It is evident that the estimates for the `1−`2 minimization are

63 worse than those of the `1 minimization. The work of [59, 105] also arrived at similarly worse (compared to the `1 minimization) estimates for `1−`2 minimization, even though our current estimates notably extend those in [59, 105]. On the other hand, the numerical

examples in this paper and [59, 105] demonstrate that the `1−`2 minimization consistently outperforms the `1 minimization, for a variety of problems. This leads to the conviction that the current estimates for the `1−`2 minimization may be overly conservative. New estimates, perhaps not in the popular form of RIP condition, may be required to fully understand the properties of the `1−`2 minimization. This shall be pursued in a separate study and reported later.

4.3 Function approximation by Legendre polynomials via `1-`2

d Let Pn be the total degree polynomial space of degree upto n, (2.30) whose dimension is

d + n p = dim( d ) = . (4.16) Pn d

Let {φj(z)} be an orthogonal polynomial system, i.e., Z φi(z)φj(z)dω(z) = δij (4.17)

d 2 and {φj(z)}1≤j≤p be a basis for Pn. For an unknown function f(z) ∈ Lω, we week to d construct an approximation fn(z) ∈ Pn which has the form of p X fn(z) = cjφj(z) (4.18) j=1 using samples of f, f(zi), i = 1, ··· , m. By enforcing fn(zi) = f(zi) (interpolation) or fn(zi) ≈ f(zi), we arrive at a system of equations for the expansion coefficients Ac = y or

T T Ac ≈ y, where c = (c1, . . . , cp) is the expansion coefficient vector, y = (f(z1), . . . , f(zm)) is the sample vector, and

A = (aij)1≤i≤m,1≤j≤p, aij = φj(zi), (4.19)

64 d is the Vandermonde-like matrix. Note that other polynomial spaces other than Pn can d be chosen, e.g., tensor product space. Here we focus on Pn (2.30) because it is the most common choice.

To be consistent with the notations in the previous sections, hereafter we will use x to

denote the coefficient vector c and b the data vector y. The system Ac = y is then rewritten

as Ax = b.

The orthogonal polynomial system {φj(z)} is called a bounded orthonormal system if it is uniformly bounded, for some K ≥ 1,

sup kφjk∞ ≤ K. (4.20) j

We now recall the following result for bounded orthonormal systems.

m×p Theorem 4.3.1 ([78, 79]). Let A ∈ R be the Vandermonde-like matrix (6.2), whose

entries φj(zi) are from a bounded orthonormal system (4.17) satisfying (4.20) and the points

zi, i = 1, ··· , m are i.i.d. random samples from the measure µ. Assume

m ≥ Cδ−2K2s log3(s) log(n), (4.21)

3 −γ log (s) √1 then with probability at least 1 − n , the RIC δs of m A satisfies δs ≤ δ. Here C, γ > 0 are universal constants.

Using this property of the Vandermonde-like matrix A, we now present the recoverability result of the `1−`2 minimization for Legendre approximation, i.e., {φj(z)} are Legendre polynomials.

Exact recovery of sparse functions

We first study the recovery properties for sparse target functions. Let us assume the target function f(z) consist of s numbers of non-zero terms in its Legendre expansion. That is,

X f(z) = x¯jφj(z), |T | = s. (4.22) j∈T

65 d Let fn ∈ Pn be a Legendre approximation of f in the form of (4.18). We then have

T kf(z) − fn(z)kω = kx¯ − xk2, x = [x1, ··· , xp] (4.23)

owing to the orthonormality of the basis functions. Note that k · kω is the norm defined in

d (2.1). Here it is assumed that T ⊆ Tn, the multiple index set for the polynomial space Pn. We now consider the practical case of high dimensionality, where d ≥ n. Note that for even moderately high dimensions d > 1, one often can not afford high-order polynomial expansions, and d ≥ n is often the case.

Theorem 4.3.2. Let A be the Vandermonde-like matrix constructed using the Legendre

d polynomials in Pn and m sampling points {zi}1≤i≤m from the uniform measure. Suppose d ≥ n and

m ≥ Cδ−23ns log3(s) log(p), (4.24)

where

1 √ 9 9 r 1 2 δ < , c(s) = 5 + √ + √ + , 1 + c(s) 4 5s 2 5 4s2 s and C > 0 is universal. Let the target function be constructed as (4.22), with the s-sparse

−γ log3(s) coefficient vector x¯ satisfying T ⊆ Tn. Then, with a probability exceeding 1 − p , the following results hold.

• Let b = Ax¯, then the solution xopt to (4.1) is exact, i.e., xopt =x ¯.

m • Let b = Ax¯ + e, where e ∈ R is any perturbation with kek2 ≤ τ, then the solution xopt to (4.2) satisfies Ces kxopt − x¯k ≤ √ τ, (4.25) 2 m

where Ces is from (4.10).

Proof. It is sufficient to derive (4.25), which immediately delivers the noiseless case by setting τ = 0. Since m ≥ Cδ−23ns log3(s) log(p), and A is constructed using the Legendre polynomials and sample points from the uniform measure, then, with the probability at

3 γ log (s) √1 1 least 1 − p , the restricted isometry constant δs of m A satisfies δs ≤ δ < 1+c(s) . The 66 constraint condition can be equivalently written as

1 1 1 √ Ax − √ b ≤ √ τ m m 2 m without changing the solution. The proof then follows immediately from Theorem 4.2.2.

Recovery of non-sparse polynomial functions

For a general unknown function ffull, let f be its best approximation in the polynomial space

d 2 d Pn (2.30). Typically, ffull ∈ Lω and f = Pffull is the orthogonal projection of ffull onto Pn. d Note that if ffull ∈ Pn, ffull = f. In general, the best approximation f is not sparse. That is, we have

X d f(z) = x¯jφj(z), |Tn| = p = dim Pn, (4.26) j∈Tn d where Tn is the index set for Pn. We present the reconstruction results for f using the `1−`2

minimization, in the context of Legendre polynomials. Letx ¯s be the “best” s-sparse version

ofx ¯. A sparse approximation is constructed by the `1−`2 minimization problem, (4.1) or (4.2), and the quality of the approximation is measured by comparing the reconstructed coefficient vector x tox ¯s.

Theorem 4.3.3. Let A be the Vandermonde-like matrix constructed using the Legendre

d polynomials in Pn and m sampling points {zi}1≤i≤m from the uniform measure. Suppose d ≥ n and

m ≥ Cδ−23ns log3(2s) log(p), (4.27) where

1 δ < √ √ , √2 2 1 + 2 + s−1 and C > 0 is universal. Let the target function be a Legendre series (4.26) with an arbitrary

p d coefficient vector x¯ ∈ R , p = dim Pn, and x¯s be its truncated vector corresponding to the 3 s largest entries (in absolute value). Then, with a probability exceeding 1 − p−γ log (2s), the

following results hold.

67 • Let b = Ax¯, then the solution xopt to (4.1) satisfies

opt kx − x¯k2 ≤ C2,skx¯ − x¯sk1. (4.28)

m • Let b = Ax¯ + e, where e ∈ R is any perturbation with kek2 ≤ τ, then the solution xopt to (4.2) satisfies

C kxopt − x¯k ≤ √1,s τ + C kx¯ − x¯ k (4.29) 2 m 2,s s 1

Here the constants C1,s and C2,s are from (4.14).

Proof. It is sufficient to derive (4.29), as (4.28) can be obtained by setting τ = 0. Since

m ≥ Cδ−23ks log3(2s) log(n), and A is constructed using the Legendre polynomials and

3 sample points from the uniform measure, then, with the probability at least 1 − nγ log (2s),

1 1 the restricted isometry constant δ2s of √ A satisfies δ2s ≤ δ < √ √ . The constraint m 1+ 2+ √2 2 s−1 condition can be equivalently written as

1 1 1 √ Ax − √ b ≤ √ τ. m m 2 m

The conclusion then follows from Theorem 4.2.3.

4.4 Function approximation by Legendre polynomials via a weighted `1-`2

We consider a weighted version of the `1−`2 minimization. The counterparts to the noiseless

`1−`2 minimization (4.1) and de-nosing `1−`2 minimization (4.2) are,

min kxk1 − kxk2 subject to W Ax = W b, (4.30) x∈Rp

min kxk1 − kxk2 subject to kW Ax − W bk2 ≤ τ, (4.31) x∈Rp

respectively. Here d d/2 Y 2 1/4 Wi,j = (π/2) (1 − z` ) δij (4.32) `=1 is the tensored Chebyshev weight and τ is related to the noise level in W b.

68 Exact recovery of sparse functions via a weighted `1-`2

We first consider the recovery of sparse target functions, i.e., f has the form of (4.22).

Theorem 4.4.1. Let A be the Vandermonde-like matrix constructed using the Legendre

d polynomials in Pn, d ≥ 1, and m sampling points {zi}1≤i≤m from the Chebyshev measure. Suppose

m ≥ Cδ−22ds log3(s) log(p), (4.33) where

1 √ 9 9 r 1 2 δ < , c(s) := 5 + √ + √ + , 1 + c(s) 4 5s 2 5 4s2 s and C > 0 is universal. Let the target function be a Legendre series (4.22) with a s-sparse

−γ log3(s) coefficient vector x¯ satisfying T ⊆ Tn. Then, with a probability exceeding 1 − p , the following results hold.

• Let W b = WAx¯, then the solution xopt to (4.30) is exact.

m • Let W b = WAx¯+e, where e ∈ R is any perturbation with kek2 ≤ τ, then the solution xopt to (4.31) satisfies Ces kxopt − x¯k ≤ √ τ, (4.34) 2 m

where Ces is from (4.10).

Proof. Since m ≥ Cδ−22ds log3(s) log(p) and A is constructed using the Legendre poly- nomials and sample points from the uniform measure, then, with the probability at least

3 γ log (s) √1 1 1 − p , the restricted isometry constant δs of m WA satisfies δs ≤ δ < 1+c(s) . The constraint condition can be equivalently written as

1 1 1 √ W Ax − √ W b ≤ √ τ. m m 2 m

The conclusion then follows from Theorem 4.2.2.

Recovery of non-sparse functions a weighted `1-`2

We present the recovery properties for general non-sparse target functions (4.26). 69 Theorem 4.4.2. Let A be the Vandermonde-like matrix constructed using the Legendre

d polynomials in Pn, d ≥ 1, and m sampling points {zi}1≤i≤m from the Chebyshev measure. Suppose

m ≥ Cδ−22ds log3(2s) log(p), (4.35) where

1 δ < √ √ , √2 2 1 + 2 + s−1 and C > 0 is universal. Let the target function be a Legendre series (4.26) with an arbitrary

p d coefficient vector x¯ ∈ R , p = dim Pn, and x¯s be its truncated vector corresponding to the 3 largest s entries (in absolute value). Then, with a probability exceeding 1 − p−γ log (2s), the

following results hold.

• Let W b = WAx¯, then the solution xopt to (4.30) satisfies

opt kx − x0k2 ≤ C2,skx0 − x0,sk1. (4.36)

m • Let W b = WAx¯+e, where e ∈ R is any perturbation with kek2 ≤ τ, then the solution xopt to (4.31) satisfies

C kxopt − x¯k ≤ √1,s τ + C kx − x¯ k . (4.37) 2 m 2,s 0 s 1

Here the constants C1,s and C2,s are from (4.14).

Proof. Since m ≥ Cδ−22ds log3(2s) log(p) and A is constructed using the Legendre polyno- mial and sample points from the Chebyshev measure, then, with the probability at least

γ log3(2s) 1 1 1 − p , the restricted isometry constant δ2s of √ WA satisfies δ2s ≤ δ < √ √ . m 1+ 2+ √2 2 s−1 The constraint condition can be equivalently written as

1 1 1 √ W Ax − √ W b ≤ √ τ. m m 2 m

The conclusion then follows from Theorem 4.2.3.

70 4.5 Summary

In this chapter we study the `1−`2 minimization methods for sparse approximations of functions. In particular, we focus on the use of Legendre polynomials, which are widely used in UQ computations. We derive the recoverability properties of the `1−`2 minimiza- tion, for both the direct minimization and the Chebyshev preconditioned minimization.

The theoretical estimates indicate that in low dimensions the Chebyshev preconditioned

`1−`2 minimization should be preferred, whereas in high dimensions the straightforward unweighted `1−`2 minimization should be preferred. Numerical examples are provided in

Chapter 7.2 and verify these findings. The `1−`2 minimization methods seem to produce better results than the `1 minimization method for a variety of test problems. Although the theoretical proof is still lacking, it nevertheless can serve as a viable alternative for high dimensional function approximations.

71 CHAPTER 5

SEQUENTIAL APPROXIMATION

We introduce a sequential function approximation method. For an unknown target function f(x), the method sequentially seeks an approximation using only a single random sample of f. Unlike the traditional function approximation approaches, this method never forms or stores a matrix in any sense. This results in a simple numerical implementation using only vector operations and avoids the need to store the entire data set. Thus the sequential method would particularly be suitable when the data set is exceedingly large.

The sequential method is motivated by the randomized Kaczmarz (RK) method [88].

The RK method is a randomized iterative algorithm for solving (overdetermined) linear systems of equations. In this chapter, we present a sequential function approximation in high dimensions. Convergence analysis is also presented both in expectation and almost surely. The analysis establishes the optimal sampling probability measure which results in the optimal rate of convergence. Upon introducing the general sequential method, an application to the Gauss-tensor quadrature grid is discussed. This results in a very efficient method by taking advantage of the tensor structure of the grids.

Various numerical examples (up to hundreds of dimensions) are provided in Chapter 7.3 to verify theoretical analysis and demonstrate the effectiveness of the method.

5.1 Randomized Kaczmarz algorithm

Let us start by briefly reviewing the Randomized Kaczmarz (RK) algorithm ([88]). It is an iterative method to solve an overdetermined linear system of equations, Ax = b, where A is a full-rank matrix of size m × p with m > p. For an arbitrary initial approximation x(0) 72 to the solution, the (k + 1)-th iteration is computed by

(k) (k+1) (k) bik − hx ,Aik i x = x + 2 Aik , (5.1) kAik k2

2 where ik is randomly chosen from {1, . . . , m} with probabilities proportional to kAik k2. Error analysis has been studied for both consistent and inconsistent systems, in term of the expectation, cf. [88, 70]. The expectation is understood as the conditional expectation over all the previous random choices of the row-indices. Here we summarize some main results about the RK method.

• Consistent system ([88]). For a consistent system where x∗ is the solution to Ax∗ = b,

the k-th iterate of (5.1) satisfies

(k) ∗ −2 k/2 (0) ∗ Ekx − x k2 ≤ (1 − κ(A) ) · kx − x k2,

−1 where κ(A) = kAkF kA k.

• Inconsistent system ([70]). For an inconsistent system, let x∗ satisfy Ax∗ = y and

b = y + e. Then the k-th iterate of (5.1) satisfies

(k) ∗ −2 k/2 (0) ∗ Ekx − x k2 ≤ (1 − κ(A) ) · kx − x k2 + κ(A)γ,

|ei| where γ = maxi . kAik2

Here the spectral and Frobenius norm of A are defined as, respectively,

s X 2 kAk = max kAxk2 and kAkF = aij. kxk =1 2 i,j √ From the fact the κ(A) ≥ p, it is easy to see that the optimal convergence rate of the RK method is (1 − 1/p).

5.2 Sequential function approximation

The setup of this chapter is mostly introduced in Chapter 2.1. For readers who are not familiar with this setup, it is suggestive to review Chapter 2.1.

73 2 d Let f be an unknown target function in Lω(D) (2.1) where D is a domain in R , d ≥ 1. 2 Let Π be a p-dimensional linear subspace of Lω(D) where the approximation would be sought, and {φj(x)}1≤j≤p be its basis. The goal is to construct an approximation fe in a p-dimensional linear space Π,

p X T fe(x) = cjφj(x), c = [c1, ··· , cp] j=1

2 by appropriately determining its expansion coefficients c. Let PΠf be the best Lω approx- imation of f onto Π defined in (2.11). This can be written as

p X T PΠf(x) = cˆjφj(x), cˆ = [ˆc1, ··· , cˆp] j=1

where cˆ is defined in (2.5). For notational convenience, we adopt matrix-vector notation,

T Φ(x) = [φ1(x), ··· , φp(x)] .

Then fe(x) = hΦ(x), ci, where h·, ·i is the standard vector inner product. In this chapter, we assume the basis functions satisfy

2 2 0 < inf kΦ(x)k2 ≤ sup kΦ(x)k2 < ∞. (5.2) x∈D x∈D

Algorithm

d First, we choose a sampling measure ν on D ⊂ R , from which samples of x are drawn. Note that the sampling measure ν is not necessarily the same as the measure ω (2.2) of the

2 space Lω where the unknown target function f lies in. Also ν could be a discrete probability measure.

Starting from an arbitrary initial choice of the coefficient vector c(0), we draw i.i.d.

samples from the sampling probability measure ν. Let zk ∈ D, k = 1,... , be the k-th

sample drawn from ν and f(zk) its function value. The proposed RK method then updates the coefficient vector in the following way, for k = 1, ··· ,

(k−1) (k) (k−1) f(zk) − hΦ(zk), c i c = c + 2 Φ(zk), zk ∼ ν (5.3) kΦ(zk)k2 74 where h·, ·i is the vector inner product and k · k2 the vector 2-norm. The key feature of the algorithm is that there is no need to explicitly construct the

“model matrix” in the linear system (2.12), which is used in almost all function regression methods. The algorithm incorporates data as they arrive and can be flexible in many practical situations. The computational cost at each iteration is O(p), which depends only

on the dimensionality of the linear space Π and not on the size of the data.

5.3 Convergence and error analysis

Convergence in Expectation

For any probability measure ν on D (possibly either absolutely continuous or discrete) satisfying

kfkν < ∞, |hφi, φjiν| < ∞,

we define another measureν ˜ associated with ν (yet not necessarily probability) by

p˜ dν˜(x) = Pp 2 dν(x), (5.4) j=1 φj (x)

where p X 2 p˜ := kφjkω. (5.5) j=1

Note that when the basis is orthonormal, i.e., kφjkω = 1 for all j = 1, . . . , p, we havep ˜ = p.

Let h·, ·iν˜ be its corresponding inner product and

Σ = (σij)1≤i,j≤p, σij = hφi, φjiν˜, (5.6)

be the covariance matrix of the basis {φj(x), j = 1, . . . , p} under this new measure. Obvi- ously, Σ is symmetric and positive definite. It possesses eigenvalue decomposition

Σ = QT ΛQ, (5.7)

where Q is orthogonal and Λ = diag(λ1, . . . , λp) with

λmax(Σ) = λ1 ≥ λ2 ≥ · · · ≥ λp = λmin(Σ) > 0. (5.8)

75 2 Let us now consider the best Lω approximation PΠf (2.11). Let

T e = [1, . . . , p] , j = hf − PΠf, φjiν˜, 1 ≤ j ≤ p, (5.9) be the projection of its error onto the basis using the new measureν ˜. Obviously, ifν ˜ = ω, then e = 0.

We are now in position to introduce the convergence of the RK algorithm (5.3), in terms of its coefficient vector compared to the best approximation coefficient vector cˆ in (2.11) in

expectation. At the k-th iteration, the expectation E is understood as the expectation over

the random samples {zj}1≤j≤k of the algorithm. Also for simplicity of the exposition and without loss of generality, we assume c(0) = 0.

2 Theorem 5.3.1. Let cˆ be the coefficient vector of the Lω-projection (2.11) and ν be the sampling measure from which the i.i.d. samples are drawn. Let ν˜ be defined in (5.4) and

Σ in (5.6) with its eigenvalues labeled as in (5.8). Then, the k-th iterative solution of the

algorithm (5.3) satisfies

k 2 (k) (k) 2 k 2 (k) F` + (r`) (kcˆk2 − F`) +  ≤ Ekc − cˆk2 ≤ Fu + (ru) (kcˆk2 − Fu) +  , (5.10) where

2 kfkν˜ 2 ru = 1 − λmin(Σ)/p,˜ Fu = − kcˆk2, λmin(Σ) 2 (5.11) kfkν˜ 2 r` = 1 − λmax(Σ)/p,˜ F` = − kcˆk2, λmax(Σ) and 2 (k) = − cˆT QT D(k)Qe, (5.12) p˜ with k (k) (k) (k) (k) 1 − (1 − λj/p˜) D = diag[d1 , . . . , dp ], dj = , 1 ≤ j ≤ p. λj/p˜ In the limit of k → ∞,

T −1 (k) 2 T −1 F` − 2cˆ Σ e ≤ lim Ekc − cˆk2 ≤ Fu − 2cˆ Σ e. (5.13) k→∞

The proof can be found in Appendix C.1.

76 The convergence bounds can be notably tighter when one chooses the sampling measure

ν in the following special form.

Corollary 5.3.1. Let the sampling probability measure ν be

Pp φ2(x) dν(x) = j=1 j dω(x). (5.14) p˜

Then, the k-th iterative solution of (5.3) satisfies

k 2 (k) 2 k 2 F` + (r`) (kcˆk2 − F`) ≤ Ekc − cˆk2 ≤ Fu + (ru) (kcˆk2 − Fu). (5.15)

Furthermore, if the basis {φj(x), j = 1, . . . , p} is orthonormal with respect to the measure ω, i.e., Φ(x) = Ψ(x) in (2.7), then the following equality holds,

1 kc(k) − cˆk2 = kf − P fk2 + rk(kcˆk2 − kf − P fk2 ), r = 1 − . (5.16) E 2 Π ω 2 Π ω p

Proof. The choice of ν in the form of (5.14) results in dν˜ = dω. We immediately have

Σ = TTT and (k) = 0 in (5.12) because e = 0 in (5.9). The proof then follows naturally.

Error bounds in expectation

The convergence results in terms of the coefficient vector readily give us error bounds of the resulting function approximation fe. Let

p (k) X (k) (k) fe (x) = cj φj(x) = hΦ(x), c i (5.17) j=1

be the approximation constructed by the coefficient vector obtained at the kth step of (5.3).

Then the following result holds.

Theorem 5.3.2. Under the same conditions in Theorem 5.3.1 and denote E = kf −PΠfkω the error by the orthogonal projection, the k-th step approximation (5.17) satisfies

2 T (k) (m) (k) 2 2 T (k) (m) E + λmin(TT ) · (E` +  ) ≤ Ekf − fe kω ≤ E + λmax(TT ) · (Eu +  ), (5.18) where

(k) k 2 (k) k 2 E` = F` + (r`) (kcˆk2 − F`),Eu = Fu + (ru) (kcˆk2 − Fu), (5.19) 77 are defined in (5.11), (k) in (5.12), and T is the transformation matrix (2.7) between the bases.

Moreover, if the sampling probability measure ν takes the special form of (5.14), then

2 (k) (k) 2 2 (k) E + λmin(Σ) · E` ≤ Ekf − fe kω ≤ E + λmax(Σ) · Eu , (5.20) where Σ is the covariance matrix defined in (5.6).

Furthermore, if the basis {φj(x), j = 1, . . . , p} is orthonormal with respect to the measure ω, i.e., Φ(x) = Ψ(x) in (2.7), then the following equality holds,

(k) 2 2 k 2 2 1 kf − fe k = 2E + r (kPΠfk − E ), r = 1 − . (5.21) E ω ω p

Proof. The proof consists of straightforward derivations from Theorem 5.3.1 and Corollary

5.3.1.

The last equality (5.21) is achieved when one employs the special form of the sampling probability measure (5.14) and the orthonormal basis Ψ(x) with respect to the measure ω.

The subsequent rate of convergence is 1 − 1/p, which is the optimal rate of convergence

for RK methods. Hereafter, we will denote the sampling measure (5.14) optimal sampling

measure.

Almost sure convergence

We now provide almost sure convergence for a special case. In particular, we assume f ∈ Π.

Consequently, E = kf − PΠfkω = 0. We also employ the orthonormal basis Ψ(x) (2.8) with respect to the measure ω and the optimal sampling measure (5.14). It then follows immediately from (5.16) of Corollary 5.3.1 that

(k) 2 (k) 2 lim Ekf − fe kω = lim Ekc − cˆk2 = 0. k→∞ k→∞

In fact, this holds true without the expectation.

Theorem 5.3.3. Assume f ∈ Π, and the orthonormal basis Ψ(x) (2.8) with respect to the measure ω and the optimal sampling measure (5.14) are used in (5.3). Then, the coefficient

78 vector satisfies

(k) 2 lim kc − cˆk2 = 0 almost surely. (5.22) k→∞

Consequently, the kth step approximation fe(k) (5.17) satisfies

(k) 2 lim kf − fe kω = 0 almost surely. (5.23) k→∞

The proof can be found in Appendix C.2.

Remarks on the optimal sampling measure

The results in Corollary 5.3.1 and Theorem 5.3.2 state that the optimal convergence rate

(1 − 1/p) can be achieved when one uses orthonormal basis in the approximation and more importantly, draw samples from the optimal sampling measure (5.14). In general this probability measure may not possess simple analytical form. Consequently, samples can not be drawn via standard random number generator. One practical way is to employ methods such as Markov chain Monte Carlo (MCMC) via, for example, the Metropolis-

Hasting algorithm [53, 52]. The literature on MCMC methods is abundant and will not be reviewed here.

We also remark that if one uses (normalized) orthogonal polynomials as the orthonormal basis Ψ(x) in the approximation, the corresponding optimal sampling measure has a strong

T connection with the equilibrium measure. More specifically, let Ψ(x) = [ψ1(x), . . . , ψp(x)]

d be an orthonormal polynomial basis from the total degree polynomial space of degree n, Pn of (2.30), and denote the corresponding optimal sampling measure (5.14) by

Pp ψ2(x) dµ (x) = j=1 j dω(x), (5.24) n p

where the dependence on the degree n is explicitly shown. Then, the measure converges to

lim dµn(x) = dµD(x), almost everywhere x ∈ D, (5.25) n→∞

where µD is the pluripotential equilibrium measure of D ([9]). For a compact review, please see Chapter 2.6.

79 For tensor bounded domain D = [−1, 1]d, the equilibrium measure is known to be the product measure of the arcsine measure with “Chebyshev” density, i.e.,

1 dµD(x) = dx. (5.26) Qd q 2 j=1 π 1 − xj

Drawing samples from this measure is straightforward. Therefore, for orthogonal polynomial approximations of moderate to high degrees, it was proposed in [86] to draw samples from this equilibrium measure. This simplifies the sampling procedure and yet should deliver near optimal convergence. We remark that similar practice has been discussed in least squares polynomial approximation in [52, 66].

5.4 Randomized tensor quadrature approximation

We apply the sequential approximation method to the tensor Gauss quadrature grid. The tensor Gauss quadrature is reviewed on Chapter 2.5. (i) (i) For each dimension i, i = 1, ··· , d, let {(zj , wj )}1≤j≤m be one-dimensional quadrature rules of exactness of degree at least 2n on Di ⊂ R, i.e., m Z X (i) (i) 1 wj f(zj ) = f(xi)$i(xi)dxi, ∀f ∈ P2n (5.27) j=1 Di

1 where P2n is the polynomial space of degree upto 2n. For many Di, there are many choices which satisfy this requirement. For example, when Di is an interval, we have the Gauss- Lobatto rule, the Gauss-Radau rule, the Gauss-Legendre rule, etcetera, just to name a few.

Based on one-dimensional quadratures, we construct the tensor quadrature rule as follow.

d d Let D = D1 ⊗ · · · ⊗ Dd ⊂ R , M = m , and

(i) (i) (i) (i) Θi,m = {z1 , ··· , zm },Wi,m = {w1 , ··· , wm }, i = 1, ··· , d (5.28) be the one-dimensional quadrature point set and weight set, respectively. Let

ΘM = Θ1,m ⊗ · · · ⊗ Θd,m,WM = W1,m ⊗ · · · ⊗ Wd,m, (5.29) be the tensor product of all 1D quadrature point sets and weight sets, respectively. It is

80 d d clear that |ΘM | = |WM | = m = M. Let [m] = {1, ··· , m} and i = (i1, ··· , id) ∈ [m] . By employing a linear ordering between the multi-index i and the single index i, i.e.,

z = (z(1), ··· z(d)), w = (w(1), ··· w(d)), i ←→ i = (i , . . . , i ) ∈ [m]d i i1 id i i1 id 1 d the tensor quadrature rule {(zi, wi)}1≤i≤M is constructed. By the exactness of one-dimensional quadratures, the tensor quadrature has the following exactness,

M Z X (j) (j) TP w f(z ) = f(x)$i(x)dx, ∀f ∈ Pd,2n (5.30) j=1 D where

TP α α1 αd Pd,2n := span{x = x1 ··· xd , |α|∞ ≤ 2n}

1 d TP is the tensor product of the one-dimensional polynomial space P2n. Since P2n ⊆ Pd,2n for d any d ≥ 1 and n ≥ 0, the 2n polynomial exactness (5.30) obviously holds for all f ∈ P2n. d Let ω be a product of univariate measures $i, i = 1, ··· , d. Let Pn be the total d n+d degree polynomial space (2.30) of degree upto n whose dimension is p = dim Pn = d . d 2 Let {ψj(x)}1≤j≤p be an orthogonal basis for Pn. For f ∈ Lω, we denote its orthogonal d d projection onto Pn by PΠf. Again, our goal is to construct an approximation in Pn for the 2 unknown target function f(x) ∈ Lω(D). With the tensor quadrature points defined, we proceed to apply the randomized sequen- tial approximation method (5.3) with the following discrete sampling measure. For any tensor quadrature point zj ∈ ΘM , let us define the sampling discrete probability measure

M 2 X kΨ(zj)k µ (x) := w 2 δx − z  (5.31) ∗ j p j j=1 R which satisfies dµ∗ = 1 by the 2n polynomial exactness of the tensor quadrature and δ(x) is the Dirac delta function.

Setting an initial choice of c(0) = 0, one then computes, for k = 1,...,

(k−1) (k) (k−1) f(zjk )) − hc , Ψ(zjk )i c = c + 2 Ψ(zjk ), zjk ∼ µ∗, (5.32) kΨ(zjk )k2

which is a version of (5.3) using ν = µ∗ as a discrete sampling measure. Again, the

81 implementation of the algorithm is remarkably simple. One randomly draws a point from the tensor quadrature set ΘM using the discrete probability (5.31), and then applies the iteration update (5.32), which requires only vector operations. The iteration continues until convergence is reached.

The theoretical convergence rate of the algorithm (5.32) readily follows from the general analysis presented in Theorem 5.3.1.

For convenience, we introduce the following tensor quadrature probability measure

M X  µ(x) := wjδ x − zj . j=1

Due to the exactness of the tensor quadrature, this probability measure defines an inner

d product on Pn, defined by Z hg, hiw := g(x)h(x)dµ(x)

2 for g, h ∈ Lω(D). Let denote its corresponding induced discrete norm by k · kw. Note that d for g ∈ Pn, kgkω = kgkw.

Theorem 5.4.1. Assume c(0) = 0. The k-th iterative solution of the algorithm (5.32)

satisfies

(k) 2 2 k 2 2 Ekc − cˆk2 = kf − PΠfkw + E + r (kPΠfkw − kf − PΠfkw − E), (5.33)

where

 E = 2 hf − PΠf, PΠfiw − hcˆ, ei , r = 1 − 1/p,

and e = c˜ − cˆ with c˜j := hf, ψjiw. And,

(k) 2 2 2 2 2 lim Ekc − cˆk2 = kf − PΠfkw + E = kfkw − kc˜k2 + kek2. (5.34) k→∞

Furthermore, the resulting approximation fe(k) = hc(k), Ψ(x)i satisfies

(k) 2 2 2 k 2 2 Ekfe − fkω = kf − PΠfkω + kf − PΠfkw + E + r (kPΠfkw − kf − PΠfkw − E). (5.35)

82 And,

(k) 2 2 2 lim Ekfe − fkω = kf − PΠfkω + kf − PΠfkw + E. (5.36) k→∞

Proof. By choosing µ∗ of (5.31) as ν in Theorem 5.3.1, we obtain that

2 2 (k) ru = r` = r, F` = Fu = kfkw − kcˆk2,  = −2hcˆ, ei.

Then it follows from (5.10) that the proof is completed.

Theorem 5.4.1, as an equality, gives the expected numerical error of the proposed al- gorithm (5.32). The error converges with a rate of 1 − 1/p. Upon convergence, as the iterate step k → ∞, the error depends only on the best approximation PΠf of the target function and the accuracy of tensor quadrature rule. Note that if the quadrature induces no error, Theorem 5.4.1 is identical to Corollary 5.3.1 and 6.3.1 as kfkω = kfkw, c˜ = cˆ, and e = 0. The expectation E in (5.33) shall be understood as the expectation over the random sequence {zj` }1≤`≤k of the algorithm.

d Remark 5.4.1 (Almost sure convergence). For the special case of f ∈ Pn, the algorithm (5.32) converges almost surely. This is a result of Theorem 5.3.3.

5.4.1 Computational aspects of discrete sampling

We discuss several essential issues regarding the implementation of the algorithm (5.32). We remark that the algorithm (5.32) is in fact a special case of the general algorithm (5.3.1), which can use any probability measure to draw the samples. However, the optimal rate of convergence is achieved by using the special probability (5.31), which is used in the algorithm (5.32). The optimal convergence rate is essential in high dimensions, and the performance of the main algorithm is significantly superior to the general algorithm.

Sampling the discrete probability (5.31)

The probability distribution (5.31) is critical for the optimality of the main algorithm (5.32).

This is a multivariate distribution with a non-standard form and can not be easily sampled by existing software. Direct sampling requires the evaluation and storage of µ∗(j) for all 83 j = 1,...,M. This becomes impractical in high dimensions, as M can be prohibitively large.

One alternative is to use accept-reject method, which is a popular approach for sampling non-standard probability. Although applicable in this case, it results in an increasingly large portion of rejections and becomes highly inefficient in higher dimensions. Here we present an effective sampling strategy for µ∗(j). It utilizes conditional probability to conduct the sampling dimension-by-dimension.

Let x be a random variable satisfying the distribution (5.31). Then µ∗(j) is the proba-   bility of x taking the tensor quadrature point z = z(1), . . . , z(d) j j1 jd

  µ (j) = (x = z ) = x = z(1), ··· , x = z(d) ∗ P j P 1 j1 d jd

1 X  (1) (1)   (d) (d)  = w φ2 (z ) ··· w φ2 (z ) . p j1 k1 j1 jd kd jd |k|≤n   The marginal probability of x = z(1) is P 1 j1

m m  (1) X X  (1) (d) x = z = ··· x = z , ··· , x = z P 1 j1 P 1 j1 d jd j2=1 jd=1 d 1 X  (1) 2 (1)  Y 2 = w φk (z ) kφk` (x)k (5.37) N j1 1 j1 ω |k|≤n `=2 (1) n   wj X n − k1 + d − 1 (1) = 1 φ2 (z ), p d − 1 k1 j1 k1=0 where the 2n polynomial exactness of the tensor quadrature (5.30) is used. In fact, without any special treatment to the first component, this result applies to any dimension i =

1, . . . , d, w(i) n    (i) ji X n − ki + d − 1 2 (i) P xi = z = φk (z ). ji p d − 1 i ji ki=0 The marginal distribution of the first ` components is

m m  (1) (`) X X  (1) (d) x = z , ··· , x = z = ··· x = z , ··· , x = z P 1 j1 ` j` P 1 j1 d jd j =1 j =1 `+1 d (5.38) 1 X  (1) (1)   (`) (`)  = w φ2 (z ) ··· w φ2 (z ) . p j1 k1 j1 j` k` j` |k|≤n

These probabilities can be explicitly evaluated. Since the quadrature set ΘM (5.29) is

84 the tensor product of the one-dimensional sets Θi,m (5.28), we can efficiently sample µ∗(j) component-by-component in a recursive manner. Also, it is important to realize that the

points in each one-dimensional quadrature point set Θi,m (5.28) are uniquely labelled by the integer {1, . . . , m}. That is,

n (i) (i)o Θi,m = z1 , . . . , zm ←→ {1, . . . , m}, i = 1, . . . , d. (5.39)

Then, each point in the tensor quadrature set ΘM corresponds to a unique multi-index.

Sampling a random variable x from the set ΘM is equivalent to sampling a random multi- index α from the tensorized integer set [m]d.

Our sampling method is now described in Algorithm 1.

Algorithm 1 Draw a point z = z(1), . . . , z(d) from the tensor quadrature set Θ (5.29) j j1 jd M using the discrete distribution (5.31) 1: Evaluate the one-dimensional marginal probability mass P(x1 = z1`) using (5.37) for all ` = 1, . . . , m. n (1) om (1) 2: From the probability P(x1 = z` ) , draw zj ∈ Θ1,m, or, equivalently, draw j1 `=1 1 from the integer set {1, . . . , m}.  (1) 3: Setp ˆ = x = z P 1 j1 4: for k from 2 to d do 5: Evaluate the following probabilities using (5.38). For ` = 1, 2, ··· , m,   P (`) := x = z(1), ··· , x = z(k−1), x = z(k) . k P 1 j1 k−1 jk−1 k `

ˆ(`) (`) 6: Evaluate the conditional probabilities Pk = Pk /pˆ. m n ˆ(`)o (k) 7: From the conditional probability Pk , draw zj ∈ Θk,m, or, equivalently, draw `=1 k jk from the integer set {1, . . . , m}. (jk) 8: Setp ˆ = Pk 9: end for (1) (d) 10: return z = z , . . . , z  and the multi-index j = (j , . . . , j ). j j1 jd 1 d

The distinct feature of this algorithm is that it involves only one-dimensional sampling in Step 2 for the first component and Step 7 for the kth component, k = 2, . . . , d, and can

be trivially realized.

85 Convergence criterion

The convergence rate (1 − 1/p) in Theorem 5.4.1 is in the form of equality. This provides

us a very sharp estimate of convergence criterion. Let K be the total number of iterations for the main algorithm (5.32). Let K = γp, where γ is constant independent on N. Then

we derive

K ∞ !!  1   1 X 1 1 1 − = exp γp ln 1 − = exp −γ p p i pi−1 i=1  1 = exp −γ + O ≈ e−γ, if p  1. p

According to Theorem 5.4.1, this implies that the square of iteration error, the last term in (5.35), becomes roughly e−γ times smaller. For example, when γ = 5, e−γ ≈ 6.7 × 10−3; when γ = 10, e−γ ≈ 4.5 × 10−5. In most problems, γ = 10 can be sufficient. Our extensive numerical tests verify that K ∼ 10p is a good criterion for accurate solutions. On the other hand, suppose one desires the iteration error to be at certain small level   1, then the iteration can be stopped at ∼ − log()p steps.

In high dimensions d  1, the total number of tensor quadrature points M = (n + 1)d grows exceptionally fast. It is much bigger than the cardinality of the polynomial space n+d p = d . That is, M  p. The linear stopping criterion K = γp implies that not all the tensor quadrature points are sampled. In fact, in high dimensions, only a very small random portion (consisting of K points) of the full grids ΘM is used in the iteration to reach accurate convergence. This remarkable feature implies that the exceptionally large set ΘM is never needed during implementation.

Computational cost

Owing to the tensor structure of the underlying grids, the randomized tensor quadrature algorithm (5.32) allows a remarkably efficient implementation in high dimensions d  1.

For simplicity of exposition, let us assume, only here in this subsection, that we use the same type of polynomials with the same degree in each dimension. Prior to the computa- tion, one needs to store only one-dimensional data regarding the polynomial basis. More 86 specifically, at each one-dimensional quadrature point j = 1, . . . , m, we store

(•) (•) n 2 (•) n wj , {φi(zj )}i=0, {φi (zj )}i=0. (5.40)

These are very small amount of data to store and irrespective of the dimension d of the prob- lem. The evaluations of multi-dimensional polynomials, whenever needed, can be quickly conducted by using these stored one-dimensional data and the multi-index to single-index mapping (2.40). For example, for any multi-index jk = ((jk)1, ··· , (jk)d) drawn from the Algorithm 1, we have

(1)  (d)  ψi(zj ) = φi z ··· φi z , 1 ≤ i ≤ N. k 1 (jk)1 d (jk)d

We now provide a rough estimate of the operation count of the main algorithm (5.32).

At each iteration step, one needs to first draw a random sample based on the probability

(5.31), and then conduct the iteration (5.32). The sampling step, when using the Algorithm

(•) 2 (•) n 1, requires the use component information of wj and {φi (zj )}i=0, which can be readily retrieved. In our current implementation, the major cost stems from the evaluation of (5.38).

This requires memory storage of O(d × p) terms in (5.38) and then O(d × p) flops. Then, in the iteration step of (5.32), the evaluation of the polynomial basis functions requires

O(d × p) flops, and the vector inner product and norm computation require O(p) flops.

The required memory storage is O(p) real numbers, for the vectors of length N. However, the efficient implementation requires the storage of the multi-index to single-index mapping

(2.40), which involves O(d × p) entries.

In summary, the algorithm (5.32) requires the memory storage of O(d×p) real numbers,

n+d n and each iteration step involves O(d × p) flops. Note that p = d ∼ d . Since any reasonable polynomial approximation requires n ≥ 2, we have p  d in high dimensions.

Assume the algorithm (5.32) terminates after K = γN steps with γ ∼ 10, then the overall computational complexity of the algorithm is: O(Kp) ∼ O(p2) flops and O(p) storage of real numbers.

On the other hand, most of the existing regression methods require the explicit op- erations on the model matrix, which is the polynomial basis functions evaluated at the

87 sample points. Let J ≥ p be the number of sample points. Then memory storage is for

O(J ×p) ∼ O(p2) real numbers. The operation counts depend on the method. For example, for the least squares method, it is O(J ×p2) ∼ O(p3) flops. We observe that the dominating

cost and storage for the current method (5.32) are both one order smaller than the standard

regression methods such as least squares.

88 CHAPTER 6

CORRECTING CORRUPTION ERRORS IN DATA

We now turn our attention to the quality of data. We discuss the problem of constructing an accurate function approximation when data are corrupted by unexpected errors. Here the unexpected corruption errors are different from the standard observational noises (e.g. white noises), which are usually modeled as iid random variables. The corruptions errors can have much larger magnitude and in most cases are sparse.

This chapter is mainly aimed to present a theorem (Theorem 6.3.2) showing that the sparse corruption errors can be effectively eliminated by using `1-minimization, also known as least absolute deviations method. Theorem 6.3.2 states that with overwhelming proba- bility, the error of the `1-minimization solution with the corrupted data is bounded by the errors of the `1- and `2-minimization solutions with respect to the uncorrupted data and the sparsity of the corruption errors. This ensures that the `1-minimization solution with the corrupted data are close to the regression results with uncorrupted data, thus effectively eliminating the corruption errors.

Several numerical examples are presented in Chapter 7.4 to verify the theoretical finding.

Setup

We consider the problem of approximating an unknown function f(x) in a bounded do-

d main D ⊂ R , d ≥ 1. Let φi(x), i = 1, . . . , p, be a set of basis functions, we write the approximation as p X fe(x; c) = cjφj(x). (6.1) j=1

89 T where c = (c1, . . . , cp) is the coefficient vector. Let

A = (aij), aij = φj(xi), 1 ≤ i ≤ m, 1 ≤ j ≤ p, (6.2) be the model matrix, and

T f = (f(x1), . . . , f(xm)) (6.3) be the samples of f(x), and

b = f + es, (6.4) be the data, where es is the external corruption error. We are interested in determining the coefficient vector c via the corrupted data b and wish to eliminate the effects of the corruption error es. That is, without knowing the precise form and locations of es, we wish to compute c using the corrupted data b as if it is computed from the uncorrupted data f, which is not available. This chapter will show that this is indeed possible under certain mild assumptions. More precisely, this can be achieved by solving the overdetermined

`1-minimization problem (the LAD method)

(P0) min kb − Ack1. (6.5) c∈Rp

(m−p)×m Let F ∈ R be the kernel of the model matrix A. Then the following problem is equivalent to (P0)

(P1) min kgk1 subject to F g = F b. (6.6)

The equivalence was established in [23]. Also the proof can be found in Chapter 2.2. The error correction work of [23] is based on the fact that the sparse solution of (P1) can be exact

(with overwhelming probability) when the error es is sparse. Consequently, the solution of (P0) can be exact to the exact system f = Ac even in the presence of the error.

This, however, is not the case in our setting, as for function approximations we do not have the exact system f = Ac. To further clarify this, we derive from (P1)

b = g + Ac = g + f − (f − Ac).

90 Using (6.4), we have

g = es + (f − Ac) and F g = F es + F f. (6.7)

It is now obvious that g can be interpreted as the sum of the corruption error es and the approximation error f − Ac. Therefore, even if the corruption error es can be sparse, as assumed in [23] and this paper, the term g is not sparse because of the approximation error of the regression model Ac for the uncorrupted data f. Therefore, the exact error correction for (P0), as established in [23], is not possible. This chapter is aimed to show that the solution of (P0) can still be very accurate and closely related to the regression error f − Ac with respect to the uncorrupted data f.

6.1 Assumptions

m×p Throughout this chapter we assume the model matrix A ∈ R is overdetermined (i.e., m > p) and full rank. We then make the following assumptions.

• (A1) The target function f(x) is uniformly continuous in D.

m • (A2) The corruption error es ∈ R is sparse and nontrivial.

• (A3) The sample points x1, . . . , xm are drawn as i.i.d. random variables from a prob- ability measure µ.

By definition, the Assumption (A1) implies that, for any  > 0, there exists a r depending only on  such that for any x1 and x2 satisfying kx1 − x2k2 ≤ r, we have

|f(x1) − f(x2)| < . (6.8)

Note that r depends only on f and .

The Assumption (A2) implies the following.

• By “sparse”, we assume that only s ≥ 1 entries of es are non-zero and denote s its sparsity. Compared to the data length m, s should be relatively small. The

quantitative meaning of the smallness will be made clear in the analysis. 91 • By “nontrivial”, we assume that the magnitude of the s corrupted entries is relatively

larger than the approximation errors using the uncorrupted data, kf−Ack, in a suitable

norm. The precise meaning of this will be made clear in our analysis. Naturally, if

the corruption error es is too small, it will be mixed up with the approximation error f − Ac and can not be distinguished. In this case, it will be treated as the noises by

the regression methods (LAD or LSQ).

We remark that these two assumptions on the corruption error es are rather mild. They can be considered as reasonable assumptions on faulty sampling of the functions, i.e., the sampling of the function f(x) experiences a small percentage (sparse) of abnormal (non- trivial) readings. The underlying mechanism causing the faulty sampling may be completely deterministic. Therefore, we make no probabilistic assumptions on the corruption errors —

(A2) shall be the only assumption.

6.2 Auxiliary results

We now present a few auxiliary results that will be used in the core part of the proof. The

first few results are related to the sparsity of the solution and the rest are concerned with the probabilistic distribution of the samples.

6.2.1 Sparsity related results

Let s denote the sparsity of the corruption error es, i.e., the number of nonzero entries in es. The following theorem is a modified version of Theorem 1.2 in [21]. The main difference is the incorporation of the term σ2s,1(x), where

σs,1(x) := inf kz − xk1 kzk0≤s

p is the error of the best s-term approximation of a vector x ∈ R in vector 1-norm.

Theorem 6.2.1. Assume that δ < 2√ and y = Ax + z with kzk ≤ τ. Then the solution 3s 2+ 6 x∗ to

min kxk1 subject to ky − Axk2 ≤ τ (6.9) x 92 satisfies

∗ −1/2 kx − xk2 ≤ C0τ + C1s σ2s,1(x), (6.10)

where √ √ 1 + 2 2(1 + 2ρ) C = α, C = , 0 1 − ρ 1 1 − ρ √ √ (6.11) 2 1 + δ 6δ α = 3s , ρ = 3s . 1 − δ3s 2(1 − δ3s) The proof can be found in Appendix D.1.

6.2.2 Probabilistic results of the samples

Let Θm = {x1, . . . , xm} be the m-point sample set, Θe = {xi1 , . . . , xis } ⊂ Θm be the samples

where the corruption errors occur, and ΘR = Θm \ Θe be the uncorrupted sample points. For notational convenience, we write

Θe = {z1, . . . , zs}, ΘR = Θm \ Θe = {y1, . . . , ym−s}. (6.12)

Theorem 6.2.2. Under the assumption (A3), for each zi ∈ Θe, i = 1, . . . , s, with probability P(m, s, r, µ) depending on m, s, r and the sampling probability measure µ, there exists at

least one point yji ∈ ΘR that lies in its ball B(zi, r) and the yji are distinct for each zi. Here s s !m−s−|α| X m − s Y X P(m, s, r, µ) = 1 − pαi 1 − p , (6.13) α i i α∈Ω i=1 i=1 where s Y Ω = {α = (α1, ··· , αs): αi = 0, |α| ≤ m − s}, i=1 (6.14)   R dµ m − s (m − s)! B(zi,r) = , pi = R . α α1! ··· αs!(m − s − |α|)! D dµ The proof can be found in Appendix D.2.1.

Intuitively, when the sparsity of the corruption errors is fixed, the probability of finding an uncorrupted sample near any given corrupted sample should go to one, when the total number of samples approaches infinity. This is indeed the case.

93 Theorem 6.2.3. Under the assumption (A3), we have

lim P(m, s, r, µ) = 1. (6.15) m→∞

Moreover, for samples satisfying pi =p ¯ for all i, we have

P(m, s, r, µ) > 1 − 2s(1 − p¯)m−s.

In another word, Theorem 6.2.2 holds with overwhelming probability.

The proof can be found in Appendix D.2.2.

6.3 Error Analysis

We now derive the main theorem (Theorem 6.3.2) for the function approximation properties

of (P0) subject of corruption errors es. Again, let

n X fe(x; c) = cjφj(x) (6.16) j=1

be the approximation of f(x). Using Theorem 6.2.2, we immediately obtain

Corollary 6.3.1. Under the assumptions (A1) and (A3), there exists r which depends on f,  and c such that with the probability P(m, s, r, µ) defined in (6.13), for any zk ∈ Θe there exists yik ∈ ΘR such that yik ∈ B(zk, r), and

|f(zk) − fe(zk; c)| < |f(yik ) − fe(yik ; c)| + . (6.17)

0 Furthermore, if kck∞ ≤ M for some finite constant M, there exists r which depends

0 on f, , p, and M such that with probability P(m, s, r , µ), there exists yik ∈ ΘR such that 0 yik ∈ B(zk, r ), and (6.17) holds.

Proof. Since f and fe are uniformly continuous, then for a fixed , there exist r,f and r,c

such that for all x, y with kx − yk2 ≤ r, where r = min{r,f , r,c},

|f(x) − f(y)| < /2, |fe(x; c) − fe(y; c)| < /2.

94 Therefore,

|f(x) − fe(x; c) − f(y) + fe(y; c)| ≤ |f(x) − f(y)| + |fe(x; c) − fe(y; c)| < .

This completes the first part of the proof.

Under the condition kck∞ ≤ M and the uniform continuity of the basis φj(x), there

exists a rb, such that for all x,y satisfying kx − yk2 < rb,

 |φ (x) − φ (y)| < , 1 ≤ i ≤ p. i i 2Mp

And we immediately obtain

p X |fec(x) − fec(y)| ≤ M |φj(x) − φj(y)| < /2. j=1

By taking

0 r = min{rf,, rb}, (6.18)

the proof is completed.

0 For notational convenience, we shall use Pf (m, s, ) to stand for P(m, s, r , µ) in the rest of the discussion.

We now present the main theory regarding (P0) subject to corrupted data. Without

causing any practical differences, we require the coefficient vector to be bounded and write

(P0) equivalently as,

∗ (P0) c = argmin kb − Ack1. (6.19) kck2≤M

To quantify the quality of the solution of (P0), we define the following regression solu-

tions using the uncorrupted data:

LAD LAD c = argmin kf − Ack1 and τ1 = kf − Ac k1, c (6.20) LSQ LSQ c = argmin kf − Ack2 and τ2 = kf − Ac k2. c These are the idealized (and unavailable) regression solutions when the data are uncorrupted

and will be used to quantify the quality of the solution of (P0) with corrupted data.

95 We also denote

Λ = supp(es),R = {1, . . . , m}\ Λ, (6.21) the index set of the samples in Θe and ΘR, respectively, and

0 Λc = {j ∈ Λ : sign(es)j 6= sign(f − Ac)j}. (6.22)

0 That is, the index set Λc represents the samples where the corruption errors and regression errors with respect to the uncorrupted data have opposite signs. Finally, for an index set

Λ, we denote kxkΛ = kxΛk1 its 1-norm on the index set.

For any T ⊂ [m] = {1, . . . , m}, let AT denote a submatrix of A consisting of all rows corresponding the index set T . We are now in position to present the main result. In

addition to the assumptions (A1)–(A3), we also assume that fT ∈/ range(AT ) for any T ⊂ [m]. This is the discrete version of assuming the target function f(x) is not in the span of

the basis {φj(x)}. This is almost always the case in practice.

Lemma 6.3.2. Under the assumptions (A1)–(A3), where the sparse corruption error es being nontrivial implies

LAD min |(es)j| > kf − Ac k1, (6.23) j∈Λ

let us further assume fT ∈/ range(AT ), where T ⊂ {1, . . . , m} with |T | ≥ m − 2s. Then,

∗ ∗ with the probability Pf (m, s,  ) defined in (6.13) with the radius of (6.18), the solution c to (P0) satisfies:

∗ (a) kesk1 < kb − Ac k1

∗ (b) |(f − Ac ) 0 | < |es 0 |. Λc∗ Λc∗

∗ (c) (b − Ac )(i) 6= 0, ∀i ∈ Λ.

0 ∗ ∗ 1 where Λc∗ = {j ∈ Λ: sign(es(j)) 6= sign(f − Ac )(j)} and  = minc s σ2s,1(f − Ac).

∗ Proof. From Corollary 6.3.1, with the probability Pf (m, s,  ),

∗ |f(zk) − fec(zk)| < |f(yik ) − fec(yik )| +  k = 1, ··· , s.

96 We then partition Λ into the following sets,

1 2 2 [2,s] [2,a] Λ = Λc ∪ Λc , Λc = Λc ∪ Λc ,

where ec = f − Ac and

1 2 0 Λc = {j ∈ Λ : sign(es(j)) = sign(ec(j))}, Λc = Λc,

[2,s] 2 [2,a] 2 Λc = {j ∈ Λc : |es(j)| > |ec(j)|}, Λc = {j ∈ Λc : |es(j)| ≤ |ec(j)|}.

0 1 For notational convenience, let Ωc = {ik}1≤k≤s and Rc = R ∪ Λc . ∗ Part (a): For any c with kck2 ≤ M, with the probability Pf (m, s,  ), we have

X X ∗ kf − Ack 2 = |f(z ) − f (z )| < |f(y ) − f (y )| + |Ω | Λc k ec k ik ec ik c 2 k∈Λc ik∈Ωc

∗ ≤ kf − AckΩc + s ≤ kf − AckΩc + σ2s(ec)

≤ kf − Ack + kf − Ack 0 Ωc Rc\Ωc

0 = kf − AckRc .

Therefore,

∗ kb − Ac k1 = kec∗ kR + kes + ec∗ k 1 + kes + ec∗ k 2 Λc∗ Λc∗

≥ kec∗ kR0 − kec∗ k 2 + kesk1 c∗ Λc∗

> kesk1.

[2,a] Part (b) and (c): Suppose |Λc∗ | > 0, otherwise the proof is trivial. Then, the solution d∗ to (P 1) can be written as follow;

∗ ∗ ∗ ∗ ∗ kd k = kd k + kd k 1 + kd k [2,s] + kd k [2,a] . (6.24) 1 R Λ ∗ c Λc∗ Λc∗

We now establish the following.

∗ [2,a] ∗ Lemma 6.3.3. |d(j)| ≥ |es(j)| for all j ∈ Λc∗ . Moreover, kd k [2,a] ≥ kesk [2,a] . Λc∗ Λc∗

97 [2,s] Proof. From the definition of Λc∗ , we have

[2,s] |ec∗ (j)| < |es(j)|, ∀j ∈ Λc∗ .

∗ [2,s] It follows from d = es + ec∗ that for all j ∈ Λc∗

∗ |(d − es)(j)| < |es(j)|

∗ ⇔ es(j) − |es(j)| < di < es(j) + |es(j)|   ∗ 0 < dj < 2es(j) if sign(es(j)) > 0, ⇔  ∗ 2es(j) < dj < 0 if sign(es(j)) < 0.

[2,s] Also from the fact that sign(es(j)) 6= sign(ec∗ (j)), we obtain for all j ∈ Λc∗ ,   ∗ dj < es(j) if sign(ec∗ (j)) < 0,

 ∗ dj > es(j) if sign(ec∗ (j)) > 0.

By combining the above results, we have

∗ [2,s] 0 < |dj | < |es(j)|, ∀j ∈ Λc∗ .

[2,a] ∗ Therefore for all j ∈ Λc∗ , |d(j)| ≥ |es(j)|.

LAD LAD LAD ∗ Suppose that c satisfies kf − Ac k1 < kesk [2,a] and c 6= c . We denote Λc∗ LAD LAD d = es + f − Ac . Then it follows from kesk [2,a] − kec∗ k [2,a] ≤ 0 that Λc∗ Λc∗

∗ ∗ ∗ ∗ ∗ kd k = kd k + kd k 1 + kd k [2,s] + kd k [2,a] 1 R Λ ∗ c Λc∗ Λc∗

≥ ke ∗ k + ke ∗ k 1 + ke k 1 + ke k [2,s] − ke ∗ k [2,s] + ke k [2,a] c R c Λ ∗ s Λ ∗ s c s c c Λc∗ Λc∗ Λc∗

∗ 0 ∗ [2,s] = kesk1 + kec kR ∗ − kec k c Λc∗

[2,a] ∗ 0 ∗ [2,s] ∗ [2,a] ≥ kesk1 + kesk + kec kR ∗ − kec k − kec k Λc∗ c Λc∗ Λc∗

= ke k + ke k [2,a] + ke ∗ k 0 − ke ∗ k 2 s 1 s c R ∗ c Λ ∗ Λc∗ c c

> ke k + ke k [2,a] + ke ∗ k 0 − s s 1 s c R ∗ \Ωc∗ 0 Λc∗ c

98 > kesk1 + kesk [2,a] Λc∗ LAD LAD ≥ kes + f − Ac k1 = kd k1.

∗ Note that in second inequality, we use the fact kd k [2,a] > kesk [2,a] . In sixth inequality Λc∗ Λc∗

∗ the following arguments are used: since kec kR∪Λ1\Ω > 0, once 0 is chosen properly, say, ∗ ∗ 1 0 =  where  = minc σ2s,1(ec), then kec∗ k 0 ∗ − s0 > 0 and the inequality holds. s Rc∗ \Ωc [2,a] However, the above inequality leads a contradiction which implies that |Λc∗ | = 0. This completes the proof.

We now introduce the concept of Haar condition.

p+1 Definition 6.3.4. Let a1, ··· , am, be a set of vectors in R , and m ≥ p + 1. The set is said to satisfy the Haar condition if every subset forms a linearly independent set.

m m×p Theorem 6.3.1 ([16], Theorem 3). For any vector b ∈ R and matrix A ∈ R , there exists a solution c∗ to (P 0) such that b − Ac∗ has at least p zero entries. Furthermore, if

m×(p+1) ∗ the augmented matrix [A; b] ∈ R satisfy the Haar condition, then b − Ac has exactly p zero entries.

Based on this result, we immediately obtain the following.

Corollary 6.3.5. Under the assumptions in Theorem 6.3.2, the solution c∗ to (P 0) inter-

∗ polates at least p non-corrupted data f with the probability Pf (m, s,  ), which is defined in (6.13).

∗ ∗ Proof. From the part (c) in Lemma 6.3.2, (b − Ac )(i) 6= 0, ∀i ∈ Λ. Therefore, (f − Ac )R have at least n zero entries.

We now present our main result.

Theorem 6.3.2. Suppose that F , the kernel of the model matrix A, satisfies δ < 2√ ≈ 3s 2+ 6

0.4495 and the assumptions in Lemma 6.3.2 are satisfied. Then, with probability Pf (m, s, ), defined in (6.13), the solution c∗ to (P0) satisfies

3 τ + s τ ≤ kf − Ac∗k ≤ C τ + 1√ C , (6.25) 2 2 2 0 2 s 1 99 ∗ where C0, C1 are from (6.11) in Theorem 6.2.1, s ≤ s = minc σ2s,1(f − Ac), and τ1 and

τ2 are the LAD and LSQ approximation errors of the uncorrupted data, respectively, defined in (6.20).

Proof. The lower end of the inequality obviously holds. We now consider the upper end of the inequality. Let us consider the auxiliary problem (P 10)

0 (P 1 ) min kdk1 subject to kF d − F esk2 ≤ τ2, (6.26)

∗ ∗ 0 ∗ and denote its solution as dp10 . Since d and es are feasible solution to (P 1 ) and d − es = f − Ac∗, it follows from Theorem 6.2.1 that

∗ ∗ ∗ ∗ kd − esk2 ≤ kdp10 − esk2 + kdp10 − d k2 1  C  ≤ C τ + C τ + √1 σ (d∗) 2 0 2 0 2 s 2s,1 3 C ≤ C τ + √1 σ (d∗). (6.27) 2 0 2 s 2s,1

From Lemma 6.3.2,

∗ ∗ ∗ ∗ C = kb − Ac k1 − kesk1 = kf − Ac kR + kf − Ac k 1 − kf − Ac k 2 . Λc∗ Λc∗

Since cLAD is a feasible solution to (P 0), we have

LAD LAD C ≤ kb − Ac k1 − kesk1 ≤ kf − Ac k1 = τ1.

With the probability Pf (m, s, ), we obtain

∗ ∗ kf − Ac k 2 ≤ kf − Ac kΩ ∗ + s · , Λc∗ c where Ωc∗ ⊂ R. Hence the following estimate holds

∗ ∗ ∗ kf − Ac kR = C + kf − Ac k 2 − kf − Ac k 1 Λc∗ Λc∗ ∗ ≤ τ1 + kf − Ac k 2 Λc∗ ∗ ≤ τ1 + kf − Ac kΩc∗ + s · .

100 From the definition of σ2s,1, we obtain

σ (d∗) = σ (b − Ac∗) ≤ kf − Ac∗k ≤ τ + s · . 2s,1 2s,1 R\Ωc∗ 1

Together, (6.27) becomes

3 τ + s kd∗ − e k ≤ C τ + 1√ C (6.28) s 2 2 0 2 s 1 and the proof is complete.

101 CHAPTER 7

NUMERICAL EXAMPLES

In this chapter we provide extensive numerical examples to demonstrate the performance of all approximation methods presented in this dissertation. Two kinds of numerical tests are conducted: (1) function approximations and (2) stochastic collocation of uncertainty quantification.

Function Approximation

To examine the performance of the methods, we employ several functions from [43], which have been widely used for multi-dimensional function integration and approximation tests.

More specifically, we use the following functions

d  2! X xi + 1 f (x) = exp − σ2 − χ ; (GAUSSIAN) 1 i 2 i i=1 d ! X xi + 1 f2(x) = exp − σi − χi ; (CONTINUOUS) 2 i=1 (7.1) d !−(d+1) X (xi + 1) 1 f (x) = 1 + σ , σ = ; (CORNER PEAK); 3 i 2 i i2 i=1 d  2!−1 Y xi + 1 f (x) = σ−2 + − χ (PRODUCT PEAK), 4 i 2 i i=1

d where x = (x1, ··· , xd) ∈ R , σ = [σ1, ··· , σd] are parameters controlling the difficulty of the functions, and χ = [χ1, ··· , χd] are shifting parameters.

102 Also, exclusively for d = 2, we consider the Franke function

 (9x − 2)2 (9x − 2)2   (9x + 1)2 9x + 1 f(x) = 0.75 exp − 1 − 2 + 0.75 exp − 1 − 2 4 4 49 10 (7.2)  (9x − 7)2 (9x − 3)2  + 0.5 exp − 1 − 2 − 0.2 exp(−(9x − 4)2 − (9x − 7)2). 4 4 1 2

We also consider the following two functions v  u d uX 2 9 f5(x) = cos t (xi − χi)  , f6(x) = , (7.3) Pd  i=1 5 − 4 cos i=1 xi which are fully coupled in all dimensions. We remark that these functions do not possess low-dimensional structures.

Stochastic Collocation of Uncertainty Quantification

We now consider the following stochastic diffusion , a standard benchmark in UQ computations. Consider, in one-dimensional spatial domain,

∂  ∂u  − κ(x, y) (x, y) = f, (x, y) ∈ (0, 1) × d, (7.4) ∂x ∂x R with forcing term f = const and boundary conditions

d u(0, y) = 0, u(1, y) = 0, y ∈ R .

We assume that the random diffusivity take an analytical form (for benchmarking purpose)

d X 1 κ(x, y) = 1 + σ cos(2πkx)y , (7.5) k2π2 k k=1 where y = (y1, ··· , yd) is a random vector with independent and identically distributed components.

103 7.1 Optimal sampling

We present numerical examples to demonstrate the performance of the optimal sampling strategies. We first examine the stability of the model matrix for orthogonal polynomial regression and then provide approximation results using monomials and Legendre polyno- mials, for a selection of functions in multiple dimensions.

For comparisons against other methods, we employ the standard sampling methods

• Monte Carlo (MC) sampling. Due to the random nature of the method, all MC results

reported in this section are statistical averages over 50 simulations.

• Quasi Monte Carlo (QMC) sampling. We adopt the Sobol sequence.

The results by the quasi-optimal set, following Definition 3.1.3, will be denoted by

S(Q), emphasizing that the S is applied to the Q matrix out of the QR-factorization of the model matrix A, whereas S(A) means S is applied to the model matrix A directly.

For least squares problems using orthogonal polynomial basis with dense candidate points, the difference between S(A) and S(Q) should be marginal. One can directly use S(A)

to avoid the QR factorization. In the comparison with the near optimal sampling, we

often refer S(A) as “S-optimality”. In almost all examples, we also compute the “best”

approximation errors obtained by OLS on the dense candidate sample sets and plot them

in dashed lines for reference. For the nearly optimal sampling method presented in Chapter

3.5 ([66]), we use “nOPT”, where the initial p points are determined by the truncated square matrices approach (3.43), and “nOPT-F” where the initial p points are determined by the approximate Fekete points (3.44). In almost all the figures, we use “CLS” to stand for the

Christoffel least squares method by [66].

In all examples we use the total degree polynomial spaces (2.30) to construct the poly- nomial basis functions. We use α to denote the sampling rate, i.e., m = α · p. We focus on the linear and low oversampling ratios α = 1.1 − 2, to demonstrate the effectiveness of the method.

104 7.1.1 Matrix stability

We first numerically investigate the condition number of the model matrix resulted from the quasi-optimal and the near optimal sampling selections, i.e.,

σ (A) κ(A) = max , for the quasi-optimal, σmin(A)

and √ √ σ ( WA) κ( WA) = max √ , for the near optiaml, σmin( WA) where σmax and σmin are the maximum and minimum singular values of the matrix, respec- tively. To reduce the effect of statistical fluctuation, we report the statistical average of the condition numbers over 50 ensemble of independent tests.

Bounded domain

Without loss of generality, we consider [−1, 1]d hypercube and use Legendre polynomials.

In Fig. 7.1, the condition numbers in d = 2 with respect to increasing polynomial degree from n = 5 to n = 35 are shown. It can be seen that at the very low over-sampling rate of α = 1.2, both the CLS and the S-optimal points exhibit weak instability. The near optimal sampling method, however, exhibits much better stability, almost as good as the

CLS reference solutions, obtained by oversampling 1.5p log p, which is proven to be stable.

Similar results for d = 4 can be seen from Fig. 7.2.

The condition numbers of the standard Monte Carlo method using the corresponding uniform random samples are not shown. It is known to be unstable and the condition numbers become so large that they skew the scale of the figures out of proportion.

Unbounded domain

d For unbounded domain case, we first consider the whole space R with the Gaussian density P 2 function exp(− kzk2). The Hermite orthogonal polynomials are use as the basis functions. In Figure 7.3, we report the condition numbers with respect to the increasing polynomial degree in both the d = 2 and d = 4. The results of α = 1.2 are shown as solid lines and those

105 4 2 10 10

CLS CLS S−optimality S−optimality nOPT nOPT nOPT−F nOPT−F 3 CLS Reference 10 CLS Reference

2 1 10 10

1 10

0 0 10 10 5 10 15 20 25 30 35 5 10 15 20 25 30 35

Figure 7.1: d = 2, Legendre polynomial. The condition numbers of the model matrix at increasing polynomial degree from 5 to 35. Left: oversampling rate α = 1.2; Right: oversampling rate α = 2. 3 1 10 10 CLS S−optimality nOPT nOPT−F CLS Reference

2 10

1 10

CLS S−optimality nOPT nOPT−F CLS Reference 0 0 10 10 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

Figure 7.2: d = 4, Legendre polynomial. The condition numbers of the model matrix at increasing polynomial degree from 1 to 8. Left: oversampling rate α = 1.2; Right: oversampling rate α = 2.

106 of α = 2 are shown as dashed lines. Again, the reference solutions are obtained by CLS with

α = log p which is proven to be stable. Similar to the case of bounded domain, we observe much improved the stability property by the current method “nOPT” and “nOPT-F”.

We then consider the “half-space” of [0, ∞)d with an exponential density function

exp(−kzk1). The Laguerre orthogonal polynomials are used as the basis functions. The condition numbers are plotted in Figure 7.4. Again, we observe much improved stability

property of the near optimal method at such low and linear oversampling rate. It is as good

as the mathematically stable solution obtained by CLS using p log p samples.

3 2 10 10 CLS nOPT nOPT−F CLS Reference

2 10

1 10

1 10

CLS nOPT nOPT−F CLS Reference 0 0 10 10 0 2 4 6 8 10 12 14 1 2 3 4 5 6 7 8

Figure 7.3: Conditions numbers of Hermite polynomial of increasing degree with oversam- pling rate α = 1.2 (solid lines) and α = 2 (dashed lines). Left: d = 2; Right: d = 4.

4 3 10 10 CLS CLS nOPT nOPT nOPT−F nOPT−F CLS Reference CLS Reference 3 10 2 10

2 10

1 10

1 10

0 0 10 10 0 2 4 6 8 10 12 14 1 2 3 4 5 6 7 8

Figure 7.4: Condition numbers of Laguerre polynomial of increasing degree with oversam- pling rate α = 1.2 (solid lines) and α = 2 (dashed lines). Left: d = 2; Right: d = 4.

107 7.1.2 Function approximation: Quasi-optimal sampling

Two dimensional results

We consider the two dimensional case (d = 2) in [−1, 1]2. The quasi-optimal sample sets are generated via M = 10, 000 uniformly distributed candidate points. To compute the numerical errors in the least squares approximation, we compute the difference between the target functions and the OLS approximations on another set of 20, 000 points and report the `2 norm of the errors.

Legendre polynomial regression

We first examine the results based on Legendre polynomial regression. We fix the polynomial

d degree at n = 5. At d = 2, we have p = dim Pn = 21. The numerical errors of the OLS approximations for the four test functions in (7.1) are shown in Fig. 7.12, where the error decay with respect to the number of sample points are plotted. The number of samples starts from m = 22, which corresponds to oversampling by one point. In all the plots, we also show the best approximation errors, which are obtained by the OLS approximations using the entire M = 10, 000 candidate points, in dashed lines as reference. The results for all the test functions are similar and render the following observations.

• The results of the quasi-optimal sets from S(Q) and S(A) are similar. This is expected,

as the columns of AM are nearly orthogonal on the dense candidate points.

• The results from the quasi-optimal sets are notably more accurate than MC and QMC,

especially when the number of samples m is small. The errors of the quasi-optimal

sets are very close to the best achievable errors by the candidate set. This is also

expected, due to the quasi-optimal nature of the subset design.

• The quasi-optimal sets result in much faster convergence with respect to the number of

samples. In all the examples, oversampling by just a few points will lead to converged

numerical approximations. On the other hand, MC and QMC requires m ∼ α · p,

where α = 2 ∼ 3 for converged results. The latter is well known. 108 −2 0 10 10 MC MC QMC QMC S(Q) S(Q) S(A) S(A)

−3 10 −1 10

−4 10 −2 10

20 30 40 50 60 70 80 90 100 110 20 30 40 50 60 70 80 90 100 110

MC MC QMC QMC −2 S(Q) S(Q) 10 S(A) S(A)

1 10

−3 10

0 10

20 30 40 50 60 70 80 90 100 110 20 30 40 50 60 70 80 90 100 110

Figure 7.5: Errors versus number of sample points by Legendre polynomials for the four test functions (7.1) at d = 2 and n = 5. Top left: f1 with σ = [1, 1], χ = [1, 0.5]; Top right” f2 with σ = [−2, 1], χ = [0.25, −0.75]; Bottom left: f3; Bottom right: f4 with σ = [−3, 2], χ = [0.5, 0.5]. Dashed lines are the best errors obtained by using all candidate points.

109 Numerical results for other degrees of polynomials are similar and are not shown here.

Instead, we use the Franke function (7.2) and demonstrate the performance of the quasi- optimal set at varying polynomial degrees. In Fig. 7.6, we plot the errors obtained at degree

d n = 20, which results in p = dim Pn = 231, with respect to increasing number of sample points. Again, we observe that the errors from the quasi-optimal sets are very close to the best error by using all the candidate points. The errors are much smaller than those of MC and QMC, and converge after just a few oversampling points. On the left of Fig. 7.7, the errors obtained by oversampling rate of 1.2 (solid lines) and 3 (dotted lines) are plotted, with respect to increasing polynomial degrees from n = 5 to n = 20. We observe that the quasi-optimal sets result in exponential convergence of the errors. On the other hand, the

MC and QMC results shown weak instability at higher degrees, especially at the smaller oversampling rate of 1.2. This is consistent with the recent analysis from [31, 65, 64], which showed that linear oversampling is asymptotically unstable. On the right of Fig. 7.7, the condition numbers of the model matrix A are plotted with respect to the polynomial degree.

We observe that the quasi-optimal set results in much smaller conditions number and also much smaller growth rate at higher degrees.

Monomial polynomial regression

We now examine the performance of OLS using monomial basis. Since the basis functions are not orthogonal, the quasi-optimal set are determined by maximizing the S value for the

Q matrix (denoted as S(Q) in the figures).

Fig. 7.8 shows the errors versus the number of sample points for the four test functions

d (7.1) at degree n = 5 (which results in p = dim Pn = 21). Again, we observe the drastic accuracy improvement of the quasi-optimal sampling – its errors stay very close to the best errors obtained by OLS on all the candidate points. Also, the errors quickly converge after only a few number of oversampling points. Similar behavior is observed for the Franke function, approximated by n = 20 polynomial, as shown in Fig. 7.9.

The errors at different degrees of approximation are shown on the left of Fig. 7.10, where solid lines refer to oversampling rate of 1.2 and the dotted lines oversampling rate of

110 104 MC 103 QMC S(Q) 102 S(A)

101

100

10-1

10-2

10-3 200 250 300 350 400 450 500

Figure 7.6: Errors versus number of sample points by Legendre polynomials at d = 2 and n = 20 for the Franke function (7.2). The dashed line is the best error obtained by using all candidate points.

7 102 10 MC QMC 6 10 S(Q) 101 S(A) 105

100 104

103 10-1

MC 102 10-2 QMC S(Q) 101 S(A) 10-3 100 5 10 15 20 5 10 15 20

Figure 7.7: Least squares approximation of the Franke function (7.2) by Legendre polyno- mials at d = 2 with oversampling rate of 1.2 (solid lines) and 3 (dotted lines). Left: Errors versus polynomial degrees; Right: Condition number of the model matrix versus polynomial degree.

111

MC MC QMC QMC S(Q) S(Q)

−1 10 −3 10

−4 10 −2 10

20 30 40 50 60 70 80 90 100 110 20 30 40 50 60 70 80 90 100 110

MC MC QMC QMC S(Q) S(Q)

1 10

−3 10

0 10

20 30 40 50 60 70 80 90 100 110 20 30 40 50 60 70 80 90 100 110

Figure 7.8: Errors versus number of sample points by monomials for the four test functions (7.1) at d = 2 and n = 5. Top left: f1 with σ = [1, 1], χ = [1, 0.5]; Top right” f2 with σ = [−2, 1], χ = [0.25, −0.75]; Bottom left: f3; Bottom right: f4 with σ = [−3, 2], χ = [0.5, 0.5]. Dashed lines are the best errors obtained by using all candidate points.

112 104 MC 103 QMC S(Q) 102

101

100

10-1

10-2

10-3 200 250 300 350 400 450 500

Figure 7.9: Errors versus number of sample points by monomials at d = 2 and n = 20 for the Franke function (7.2). The dashed line is the best error obtained by using all candidiate points.

3. Again, we observe convergence of the quasi-optimal sets and weak instability of the MC

and QMC. The condition numbers of the model matrix are shown on the right of Fig. 7.10.

The quasi-optimal set does induce smaller condition numbers, although the growth rate

with respect to the polynomial degree is roughly the same as that of MC and QMC. This

should be due to the lack of orthogonality of the monomial basis.

Higher dimensional examples: function approximations

We now present results in higher dimensions. First, we consider d = 5, and present the

results for f2 and f3 from (7.1). The polynomial basis functions are the normalized tensor product of Legendre orthogonal polynomials. The quasi-optimal set is computed using the

greedy algorithm over a candidate set of 20, 000 uniformly distributed points. Numerical

errors are computed via the `2 norm over another set 40, 000 points. The errors for n = 5 degree Legendre least squares approximation are shown in Fig. 7.11.

d At d = 5, we have p = dim Pn = 252. We observe similar behavior of the errors with respect to increasing number of sample points. Again, the quasi-optimal set results are notably more

accurate than those of MC and QMC, especially when the number of samples is smaller. 113 2 10 1010 MC 9 10 QMC S(Q) 1 10 108

107

100 106

105

10-1 104

103

10-2 MC 102 QMC S(Q) 10-3 1 5 10 15 20 5 10 15 20

Figure 7.10: Least squares approximation of the Franke function (7.2) by monomials at d = 2 with oversampling rate of 1.2 (solid lines) and 3 (dotted lines). Left: Errors ver- sus polynomial degrees; Right: Condition number of the model matrix versus polynomial degree.

Results for other functions are very similar and thus not shown.

101 10-1 MC MC QMC QMC S(Q) S(Q) S(A) S(A) 100

10-2

10-1

10-2 10-3 250 300 350 400 250 300 350 400

Figure 7.11: Errors versus number of sample points by Legendre polynomials at d = 5 and n = 5. Left: function f2 with σ = [1, 2, 3, 4, 5], χ = [0.5, ··· , 0.5]; Right: Corner Peak f3. Dashed lines are the best errors obtained by using all candidate points.

The function approximation results at other dimensions (e.g. d = 10) are also similar to those at d = 5. We choose not to show those results. Instead, we will show the results for stochastic collocation at d = 10 in the later.

114 7.1.3 Function approximation: Near-optimal sampling

Bounded domain results

We consider the two dimensional case d = 2 on [−1, 1]2 and seek approximations using

d Legendre polynomials. The polynomial degree at is fixed n = 10. At d = 2, p = dim Pn = 66. Thus, oversampling starts at m = 67. The numerical errors of the different methods for

the four test functions in (7.1) are shown in Fig. 7.12, with respect to the number of sample

points. We observe that all methods show decay of errors when the number of samples

is increased, as expected. The near optimal sampling methods, using either the squared

matrices approach (nOPT) or the Fekete point approach (nOPT-F) for the initial p points,

deliver very similar results. The results by the quasi-optimal points (S-optimality) from

Chapter 3.4 are also very similar. All of these are notably better than the CLS from [66].

Since the near optimal method is a combination of both CLS [66] and the quasi-optimal (S-

optimal) points, these results suggest that the high accuracy is mostly due to the strategy

by the quasi-optimal sampling strategy (3.25) . The references solution here are obtained

by CLS using p log p points.

Error behavior against the polynomial degree is also investigated. Here we present the

results for the Franke function (7.2). On the left of Fig. 7.13, we plot the errors obtained

d at degree n = 20, which results in p = dim Pn = 231. Very similar results by the different methods are observed. On the right of Fig. 7.13, the errors with respect to increasing polynomial degrees from n = 5 to n = 20 are plotted. Here, all methods use a very low oversampling rate of α = 1.1, except that the reference solution is obtained by p log p points.

We again notice the small errors by the quasi-optimal (S-optimal) points (3.25) and the two

versions of the near optimal points (3.43) and (3.44).

We now present a higher dimensional case of d = 4. The results for the target function

d f3 and f4 in (7.1), in domain [−1, 1] , are shown in Fig. 7.14. On the left of the figure, the error decay with respect to the number of samples at a fixed polynomial degree n = 8 is

plotted. On the right of the figure, the error decay with respect to increasing polynomial

degree is plotted, at the oversampling rate α = 1.1. Similar conclusion as in d = 2 can be

115 −5 0 10 10 CLS CLS S−optimality S−optimality nOPT nOPT nOPT−F nOPT−F CLS Reference CLS Reference

−6 −1 10 10

−7 −2 10 10

−8 −3 10 10 60 70 80 90 100 110 120 130 60 70 80 90 100 110 120 130 −4 1 10 10 CLS CLS S−optimality S−optimality nOPT nOPT nOPT−F nOPT−F CLS Reference CLS Reference

−5 0 10 10

−6 −1 10 10

−7 −2 10 10 60 70 80 90 100 110 120 130 60 70 80 90 100 110 120 130

Figure 7.12: Errors versus number of sample points by Legendre polynomials for the four test functions (7.1) at d = 2 and n = 10. Top left: f1 with σ = (1, 1) and χ = (1, 0.5); Top right: f2 with σ = (−2, 1) and χ = (0.25, −0.75); Bottom left: f3; Bottom right: f4 with σ = (−3, 2) and χ = (0.5, 0.5). Dashed lines are the best errors obtained by the reference solutions.

116 1 1 10 10 CLS CLS S−optimality S−optimality nOPT nOPT nOPT−F nOPT−F 0 0 CLS Reference 10 CLS Reference 10

−1 −1 10 10

−2 −2 10 10

−3 −3 10 10 200 250 300 350 400 450 500 5 10 15 20

Figure 7.13: Left: Errors versus number of sample points by Legendre polynomials at d = 2 and n = 20 for the Franke function (7.2). Right: Errors versus polynomial degrees at the oversampling rate of 1.1. The dashed line is the best error obtained by the reference solution.

drawn here.

Unbounded domain results

For the unbounded domain case, we consider the whole space in d = 3 and use Hermite polynomials. We examine both the Corner peak function from (7.1) and the following exponential function d X fu(x) = exp(− xi). i=1 In Figure 7.15, we show the errors with respect to increasing polynomial degree from n = 1 to n = 10 at the oversampling rate of 1.1. The reference solutions are obtained by

CLS using oversampling of α = 2. The left plot shows the results of approximating fu(x) while the right plot shows the results for the Corner Peak. Again we observe that the quasi

and near optimal methods deliver very accurate performance.

7.1.4 Stochastic collocation for SPDE

We now consider the stochastic diffusion problem (7.4). For the numerical test, we set

f = 2 σ = 1, d = 10 and let y be uniformly distributed in [−1, 1]10. For comparison, we

also employ the stochastic Galerkin using Legendre polynomial to solve the problem and

117 −3 0 10 10 CLS CLS S−optimality S−optimality nOPT nOPT −1 nOPT−F 10 nOPT−F CLS Reference CLS Reference

−2 10

−4 10

−3 10

−4 10

−5 −5 10 10 450 500 550 600 650 700 750 1 2 3 4 5 6 7 8

3 4 10 10 CLS CLS S−optimality S−optimality nOPT nOPT nOPT−F nOPT−F 3 CLS Reference 10 CLS Reference

2 10

2 10

1 10 1 10

0 0 10 10 450 500 550 600 650 700 750 1 2 3 4 5 6 7 8

Figure 7.14: Errors by Legendre polynomials in d = 4. Top row: f3; Bottom row: f4 with σ = (1, 2, 3, 4) and χ = (0.5, 0.5, 0.5, 0.5). Left column: The errors versus number of sample points for f3 and f4 at n = 8. Right column: The errors versus increasing polynomial degree from 1 to 8 with the oversampling rate α = 1.1. Dashed lines are the best errors obtained by the reference solutions.

118 1 2 10 10 CLS CLS S−optimality S−optimality nOPT nOPT nOPT−F nOPT−F 0 CLS Reference 10 CLS Reference

−1 1 10 10

−2 10

−3 0 10 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Figure 7.15: Errors against polynomial degree, for d = 3 whole space using Hermite polyno- Pd mial. Left: f(x) = exp(− i=1 xi); Right: Corner Peak function with σi = i and χi = 0.5. The dashed lines is the best errors obtained by the reference solutions.

G denote its solution un with expansion order n. Note that the stochastic Galerkin method d solves exactly p = dim Pn number of equations and exhibits exponential error convergence in this case. However, the resulting deterministic equations for the expansion coefficients are coupled and require care to solve. The properties of the Galerkin system are well studied and documented in a large amount of literature. Interested readers can consult [99] for a quick review.

At expansion order n = 4, the stochastic Galerkin solution induces negligible errors and will be used as the reference solution. The errors by the quasi/near optimal subset least squares methods will be computed against this reference solution.

In Fig. 7.16, the error convergence with respect to the polynomial degree is shown, for both the quasi/near optimal least squares methods and the standard gPC Galerkin. We observe similar exponential convergence by the quasi/near optimal LSQ methods. Note

d that we used a mere 10 points oversample, i.e., m = dim Pn + 10, for each n. With only 10 more equations to solve, the quasi/near optimal LSQs achieve similar high accuracy of the gPC Galerkin. Since the quasi/near optimal LSQs are completely non-intrusive and result in uncoupled equations, it is much easier to solve than the coupled system of gPC Galerkin.

119 −3 10 S−optimality nOPT gPC Galerkin

−4 10

−5 10

−6 10

−7 10 1 1.5 2 2.5 3 3.5 4

Figure 7.16: Error convergence versus Legendre polynomial expansion degree at dimension d = 10. Both S-optimality and nOPT are obtained by oversampling 10 points, i.e., m = d dim Pn + 10. The gPC Galerkin solution at n = 4 is used as the reference solution.

In Fig. 7.17, the error convergences of the quasi/near optimal LSQ solutions are shown with respect to the number of samples, at different polynomial orders. At each polynomial

d order n = 1, 2, 3, 4, the number of samples starts from (dim Pn + 1). For comparison, d the gPC Galerkin is also shown. Note that the Galerkin method uses exactly p = dim Pn number of equations but are coupled. We observe that the quais/near optimal LSQ methods are highly competitive against the gPC Galerkin method for this problem. With very similar accuracy, the quasi/near optimal LSQ methods solve only a slightly larger number of equations. However, the method is completely non-intrusive and much easier to adopt in practice than the Galerkin method. It can be seen that the results by nOPT has very similar accuracy, compared to gPC Galerkin, and the number of samples is only marginally

d larger than p = dim Pn. Since the gPC Galerkin solves the p number of couple equations, the quasi/near optimal methods represent convenient alternatives. It also can be seen that in this case, the nOPT (near optimal) has slightly better accuracy than the S-optimal points

(quasi-optimal), especially at lower orders.

120 −2 10 S−optimality nOPT n=1 −3 nOPT n=2 10 nOPT n=3 nOPT n=4 gPC Galerkin −4 10

−5 10

−6 10

−7 10

−8 10 1 2 3 4 10 10 10 10

Figure 7.17: Error convergence versus the number of equations to solve, at different Legendre expansion degree n = 1, 2, 3, 4 and dimension d = 10. The gPC Galerkin solution at n = 4 is used as the reference solution.

121 7.2 Advanced Sparse Approximation

In this chapter we present several numerical examples of the `1−`2 minimization method. In function approximations, we first choose the target functions as sparse polynomial functions with arbitrarily chosen sparse coefficients. We then use the `1−`2 minimization to recover the target functions and report the recovery probability against the sparsity. We then show the results for the non-sparse target functions In all tests we use 100 independent trials and report the averaged results to reduce the statistical oscillations. Finally, we report the results for solving a stochastic diffusion equation, which is one of the standard benchmark problems in UQ.

The Legendre polynomials are employed in all tests. We use “direct-uniform (L1-L2)” to denote the standard `1−`2 minimization using samples from the uniform distribution, whose properties are stated in Theorem 4.3.2 and 4.3.3. We also use “pre-Chebyshev (L1-L2)” to

denote the Chebyshev weighted `1−`2 minimization using samples from the Chebyshev dis- tribution, whose properties are stated in Theorem 4.4.1 and 4.4.2. In both cases, we also perform the `1 minimization methods, denoted as “direct-uniform (L1)” and “pre-Chebyshev

(L1)”, respectively, and compare the results against those by the `1−`2 minimization meth- ods.

For the solution of the `1−`2 minimization problem, we employ the difference of con- vex function algorithm (DCA) ([92]), which was adopted and modified to solve the `1−`2 minimization problem in [59, 105]. In particular, we used the Algorithm 2 in [59] and set

−6 −7 outer = 10 and inner = 10 . The algorithm essentially requires an iteration over a stan-

dard `1 minimization solver, and the iteration typically converges within 10 steps. Much more detailed analysis and pseudo-code of the algorithms can be found in [59].

7.2.1 Function approximation: Sparse functions

We first choose a sparsity level s and then fix s coefficients of the target polynomials, while

keeping the rest of the coefficients zero. The values of the s non-zero coefficients are drawn

from a Gaussian distribution with zero mean and unit variance. The procedure produces

122 target coefficientsx ¯ that we seek to recover using the `1−`2-minimization algorithms. We repeat the experiment 100 times for each fixed sparsity s and calculate the success rate. A recovery is considered successful when the resulting coefficient vector xopt satisfies kxopt −

−3 x¯k∞ ≤ 10 .

One-dimensional tests

We first examine the performance of the `1−`2 minimization methods in one spatial dimen-

d sion (d = 1) x ∈ [−1, 1]. The degree of the polynomial space Pn is fixed at n = 80. As

Theorem 4.3.2 and 4.4.1 indicate, the `1−`2 minimization will reach a success probability

3 of 1 − p−γ log (s) only when a sufficiently large number of sample points are used. This is clearly demonstrated in Fig. 7.18, where the success rate is plotted against the increasing numbers of sample points, with fixed sparsity levels of s. It is also obvious that the Cheby- shev weighted `1−`2 minimization, the pre-Chebyshev (L1-L2), produces better recovery probability than the unweighted version direct-uniform (L1-L2). It should also be noted that both `1−`2 minimization algorithms are notably better than the two `1 minimization methods, pre-Chebyshev (L1) and direct-uniform (L1). The numerical errors in the recon- structed functions are plotted in Fig. 7.19. Again, we observe the superior performance of the weighted `1−`2 minimization, the pre-Chebyshev (L1-L2). And both `1−`2 minimization methods are consistently better than the corresponding `1 minimization methods. We then examine the effect of the sparsity s of the target functions. Here we fix the

d polynomial space Pn with n = 80 and the number of samples m = 60, and vary the sparsity s in the target polynomial functions. The probability of successful recovery is plotted

In Fig. 7.20, while the numerical errors in the reconstructed Legendre polynomials are plotted in Fig. 7.21. Once again, we observe that both the `1−`2 minimization algorithms perform better than the `1 minimization algorithms, with the Chebyshev weighted `1−`2 minimization method offering the best performance.

123 1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Number of smaples Number of smaples

Figure 7.18: Probability of successful recovery vs. number of sample points. Left: s = 5; Right: s = 10. Line patterns: dotted-circle, direct-uniform(L1); dotted-triangle, pre- Chebyshev(L1);solid-circle, direct-uniform(L1-L2); solid-triangle, pre-Chebyshev(L1-L2).

102 100

100 10-2

10-2

10-4 10-4

10-6 10-6

10-8 10-8 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Number of smaples Number of smaples

Figure 7.19: Reconstruction error vs. number of sample points: Left: s = 5; Right: s = 10. Line patterns: dotted-circle, direct-uniform (L1); dotted-triangle, pre-Chebyshev (L1);solid- circle, direct-uniform (L1-L2); solid-triangle, pre-Chebyshev (L1-L2).

124 1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0 5 10 15 20 25 30 35 40 Sparsity

Figure 7.20: Probability of successful recovery vs. sparsity s (n = 80 and m = 60). Line patterns: dotted-circle, direct-uniform (L1); dotted-triangle, pre-Chebyshev (L1);solid- circle, direct-uniform (L1-L2); solid-triangle, pre-Chebyshev (L1-L2).

0 10

−1 10

−2 10

−3 10

−4 10

−5 10

−6 10

−7 10 0 5 10 15 20 25 30 35 40 Sparsity

Figure 7.21: Reconstruction error vs. sparsity s (n = 80 and m = 60).Line patterns: dotted- circle, direct-uniform (L1); dotted-triangle, pre-Chebyshev (L1);solid-circle, direct-uniform (L1-L2); solid-triangle, pre-Chebyshev (L1-L2).

125 Low-dimensional examples in d = 3

We then examine the performance of the methods in relatively low dimension of d = 3. The

d d polynomial space Pn is fixed at n = 10, which results in n = dim Pn = 286. The probability of successful recovery vs. the number of sample points (m) is plotted in Fig. 7.22, for a fixed sparsity level of s = 10. The corresponding numerical errors in the reconstructed Legendre polynomials are plotted in Fig. 7.23. Again, we observe that the two `1−`2 minimization algorithms produce notably better results than the two `1 minimization algorithms. Note that the difference between the weighted algorithms and

unweighted algorithm becomes smaller, compared to the results in one dimension in the

previous section.

We then examine the effect of the sparsity s of the target polynomial function, by fixing

the number of samples at m = 100. On the left of Fig. 7.24, the probability of successful

recovery is plotted; while on the right the numerical errors in the reconstructions are plotted.

We again observe that the best performance by the Chebyshev weighted algorithm, pre-

Chebyshev(L1-L2).

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0 20 40 60 80 100 120 Number of Samples

Figure 7.22: Probability of successful recovery vs. number of sample points (d = 3 and s = 10).Line patterns: dotted-circle, direct-uniform(L1); dotted-triangle, pre- Chebyshev(L1);solid-circle, direct-uniform(L1-L2); solid-triangle, pre-Chebyshev(L1-L2).

126 100

10-1

10-2

10-3

10-4

10-5

10-6

10-7 0 20 40 60 80 100 120 Number of Samples

Figure 7.23: Reconstruction error vs. number of sample points (d = 3 and s = 10). Line patterns: dotted-circle, direct-uniform (L1); dotted-triangle, pre-Chebyshev (L1);solid- circle, direct-uniform (L1-L2); solid-triangle, pre-Chebyshev (L1-L2).

High-dimensional examples in d = 10

We now provide a set of tests in a higher dimension of d = 10. We fix the polynomial space

d d Pn at n = 4, which results in p = dim Pn = 1, 001. Note that in practical applications, one can rarely afford to use high degree polynomials in high dimensions, thus the condition

d ≥ n in Theorem 4.3.2 is often satisfied.

The numerical errors in the reconstructions are plotted in Fig. 7.25 and Fig. 7.26, for

fixed sparsity s = 10 and s = 20, respectively. We observe a sharp error reduction at

around m ∼ 100 samples. This is well below the cardinality of the polynomial space,

m  p = 1, 001. It demonstrates the effectiveness of the sparse approximation algorithms.

The results also show that, contrary to the performance in low dimensions, it is the un- weighted `1−`2 minimization algorithm, direct-uniform (L1-L2), that offers the best perfor- mance. This is similar to the finding in [102], where it was shown that the unweighted `1

minimization should be preferred in high dimensions over the Chebyshev weighted `1 min- imization. The results of the two `1 minimization are also shown in Fig. 7.25 and Fig. 7.26

127 1 100

0.9 10-1 0.8

-2 0.7 10

0.6 10-3 0.5

10-4 0.4

0.3 10-5

0.2 10-6 0.1

0 10-7 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 Sparsity Sparsity

Figure 7.24: Performance vs. sparsity level s for d = 3 and m = 100. Left: probability of successful recovery; Right: numerical errors in the reconstruction. Line patterns: dotted- circle, direct-uniform (L1); dotted-triangle, pre-Chebyshev (L1);solid-circle, direct-uniform (L1-L2); solid-triangle, pre-Chebyshev (L1-L2)

for comparison. Once again, we observe that the `1−`2 minimization algorithm outper- form the `1 minimization algorithms. We also emphasize that the difference between the unweighted `1−`2 minimization and the Chebyshev weighted `1−`2 minimization is much smaller, compared to the difference in the two `1 minimization methods reported [102].

7.2.2 Function approximation: Non-sparse functions

In this section we present numerical tests for approximations of general non-sparse functions.

To quantify the performance of the algorithms, we evaluate the discrete `2 errors of the reconstructed solutions against the exact functions at 1, 000 random points. These points are different from the random samples used in the sparse reconstructions.

One-dimensional tests

In one dimension d = 1, we consider the well known Runge function

1 f(z) = , z ∈ [−1, 1]. 1 + 25z2

d We fix the polynomial space Pk at k = 80 and increase the number (m) of samples to obtain the sparse approximations. The numerical errors are plotted in Fig. 7.27. Once again,

128 100

10-2

10-4

10-6

10-8

10-10 101 102 103 Number of samples

Figure 7.25: Reconstruction error vs. number of points (d = 10 and s = 10).Line patterns: dotted-circle, direct-uniform (L1); dotted-triangle, pre-Chebyshev (L1);solid-circle, direct- uniform (L1-L2); solid-triangle, pre-Chebyshev (L1-L2).

100

10-2

10-4

10-6

10-8

10-10 101 102 103 Number of samples

Figure 7.26: Reconstruction error vs. number of points (d = 10 and s = 20).Line patterns: dotted-circle, direct-uniform (L1); dotted-triangle, pre-Chebyshev (L1);solid-circle, direct- uniform (L1-L2); solid-triangle, pre-Chebyshev (L1-L2).

129 we observe that the `1−`2 minimization algorithms perform better than the `1 minimization algorithms.

100

10-1

10-2

10-3 0 10 20 30 40 50 60 Number of samples

Figure 7.27: Reconstruction errors of 1d Runge function vs. the number of sample points (d = 1, k = 80). Line patterns: dotted-circle, direct-uniform (L1); dotted-triangle, pre- Chebyshev (L1);solid-circle, direct-uniform (L1-L2); solid-triangle, pre-Chebyshev (L1-L2).

Two-dimensional tests

In two dimension d = 2, we consider the following two functions: for z(i) ∈ [0, 1], i = 1, . . . , d,

! d X (i) f7 = exp − σi z − χi , i=1 (7.6) d !−(d+1) X 1 f = 1 + σ z(i) , σ = . 8 i i i i=1

Here, σ = [σ1, ··· , σd] are parameters controlling the difficulty of the functions, and χ =

[χ1, ··· , χd] are shifting parameters.

d We fix the polynomial space Pk with k = 15. The numerical erros are reported in

Fig. 7.28. Again, we observe the better performance by the `1−`2 minimization algorithms 130 over the `1 minimization algorithms, with the Chebyshev weighted `1−`2 minimization al- gorithm, pre-Chebyshev (L1-L2), offering the best performance.

100 100

10-1 10-1

10-2

10-2 10-3

10-4 10-3 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Number of sample points Number of sample points

Figure 7.28: Reconstruction errors vs. number of sample points (d = 2 and k = 15). Left: f7 in (7.6) with σ = [−2, 1] and χ = [0.25, −0.75]; Right: f8 in (7.6). Line patterns: dotted- circle, direct-uniform (L1); dotted-triangle, pre-Chebyshev (L1); solid-circle, direct-uniform (L1-L2); solid-triangle, pre-Chebyshev (L1-L2).

Numerical tests in higher dimensions, e.g., d = 10, are also conducted. We omit the

results, as they are similar to those obtained by solving the stochastic elliptic equation.

Instead, we now present the results for the stochastic elliptic equation.

7.2.3 Stochastic collocation: Elliptic problem

A direct application of the `1−`2 minimization method is stochastic collocation, which is one of the major methods in uncertainty quantification (UQ).

We now consider the stochastic elliptic equation, one of the most used benchmark prob- lems in UQ, in one spatial dimension, described in (7.4). We first test a relatively low

d dimensional random input case with d = 3. The polynomial space Pn is set at n = 10, d which results in p = dim Pn = 276. The numerical errors are plotted in Fig. 7.29. We observe rather smooth error decay with the increasing number of sample points. Again, the `1−`2 minimization algorithms offer better performance than the `1 minimization al-

131 gorithms. For further comparison, we also plot the errors by the sparse grids collocation method using Clenshaw-Curtis grids (cf, [100]). The sparse approximation methods, both the `1−`2 minimization and the `1 minimization, are notably superior to the sparse grids method.

101

100

10−1

10−2

10−3

10−4

10−5

10−6

10−7 100 101 102 103 104

Figure 7.29: Approximation errors of the stochastic elliptic equation (7.4) vs. number of sample points (d = 3 and n = 10). Line patterns: dotted-circle, direct-uniform (L1); dotted-triangle, pre-Chebyshev (L1);solid-circle, direct-uniform (L1-L2); solid-triangle, pre- Chebyshev (L1-L2); solid-square, sparse grid.

Next we vary the dimensionality of the random inputs, i.e., d in (7.5). Computations

d were for various values of d. Here we report the results at d = 10. The polynomial space Pn d is fixed at n = 4, which results in n = dim Pn = 1, 001. The results are plotted in Fig. 7.30.

Again, we observe the better performance of the `1−`2 minimization algorithms. In such a high dimensional case, the unweighted version, direct-uniform (L1), is better than the

Chebyshev weighted version, pre-Chebyshev (L1-L2), contrary to the low dimensional cases.

The difference between the unweighted and weighted algorithms is, however, much less obvious than the difference in the two `1 minimization algorithms. For further comparison, we also plot the errors obtained by the sparse grids collocation method. Both the `1−`2 and

132 `1 minimization methods are notably better than the sparse grids method.

2 10

0 10

−2 10

−4 10

−6 10

−8 10 1 2 3 4 10 10 10 10 Number of sample points

Figure 7.30: Reconstruction error vs. number of sample points (d = 10 and n = 4). Line patterns: dotted-circle, direct-uniform (L1); dotted-triangle, pre-Chebyshev (L1);solid- circle, direct-uniform (L1-L2); solid-triangle, pre-Chebyshev (L1-L2); solid-square, sparse grid.

133 7.3 Sequential Approximation

We provide numerical examples to demonstrate the performance of the sequential function approximation method (5.3). Two types of error measures are employed. One is the error in the expansion coefficient c(k) by the kth iteration of the sequential algorithm (5.3), compared against the best approximation coefficient cˆ in (2.11). The best approximation coefficient cˆ are computed by high accuracy numerical quadrature. This is straightforward in low dimensions. It becomes difficult in high dimensions.

The vector 2-norm shall be used to examine the difference. The other error measure is the error in the function approximation fe(k) at the kth step (5.17). To examine this error we independently draw a set of dense samples from the underlying probability measure ω (2.2), and compare the difference between fe(k) and the target function f at these sampling points. Vector 2-norm is again used to report the difference. Unless otherwise noted, we report the averaged results over 50 independent simulations to reduce the statistical oscillations in the results.

7.3.1 Continuous sampling

We shall first comprehensively report results in two dimensions (d = 2), using both orthogo- nal and non-orthogonal bases, to verify the theoretical results. We then report more results in higher dimensions (d ≥ 10), to demonstrate the effectiveness of the method.

The theoretical results in chapter 5.3 indicate that (by Jensen’s inequality) the errors k/2 of the sequential method at k-th iterate should scale with ∼ ru . This is exponential convergence. Therefore, we use semi-log scale to plot the of the errors against the iteration count. The slope of the resulting convergence curve is then estimated numerically as snum. This is related to the logarithm of the convergence rate in the following way,

1 s = log r . (7.7) num 2 num

134 On the other hand, whenever possible we also report the optimal convergence rate via

1 1 s = log r , r = 1 − . (7.8) opt 2 opt opt p

Two dimensions: non-orthogonal basis by radial basis functions

We first focus on the two dimensional case with D = [−1, 1]2. We employ a basis using

radial basis functions, which are non-orthogonal. In particular, we consider the Gaussian

radial basis function (RBF),

2 2 φi(x) = exp(− kx − zik2), (7.9) centered at the points zi with parameter  > 0. The centers zi are chosen to be tensor product of equidistance points. Let n be the number of equidistance points in each dimen-

sion. The total number of basis functions is then p = n × n. In the following results, we

use  = 4.1.

In Fig. 7.31, the approximation results for the Franke function (7.2) are shown. The

errors are the function approximation errors evaluated at 20, 000 independently generated

points. On the left, we show the results with respect to the number of iterations, obtained

by uniformly distributed sampling measure; and on the right those by tensor Chebyshev

sampling measure. We clearly observe the convergence with respect to the iteration count.

The errors saturate at certain levels, which are dictated by the number of basis functions

used in the computations. These results verify the results in Theorem 5.3.1 and 5.3.2, as

the convergence of the sequential method (5.3) can be obtained by any reasonable sampling

measure and non-orthogonal basis.

Two dimensions: orthogonal basis by Legendre polynomials

We now present the results obtained by normalized Legendre polynomials. We employ the

standard total degree polynomial space (2.30) with degree n, whose number of basis func-

tions follows (2.31). The errors reported here are the errors in the coefficient vector, where

the best approximation coefficient cˆ are computed by Gauss quadrature with a sufficiently

135 100 100 Uniform, p=6×6 Chebyshev, p=6×6 Uniform, p=9×9 Chebyshev, p=9×9 Uniform, p=12×12 Chebyshev, p=12×12

10-1 10-1

10-2 10-2 0 10K 20K 30K 0 10K 20K 30K

Figure 7.31: Approximation errors versus number of iterations by the Gaussian RBF basis for the Franke function (7.2), with 6 × 6, 9 × 9, and 12 × 12 number of bases. Left: Results by uniform sampling measure; Right: Results by Chebyshev sampling measure.

large number of quadrature points.

The convergence of the errors with respect to the iteration count is shown in Fig. 7.32,

for all the test functions in (7.1) and at different polynomial degree. Here the sampling

mesure is the tensor Chebyshev measure (5.26), which is the equilibrium measure dµD. As noted in Section 5.3, this is a good approximation to the optimal sampling measure µn (5.24), which shall result in the optimal error convergence rate (1 − 1/p) by Corollary 5.3.1.

The results in Fig. 7.32 clearly demonstrate the exponential convergence rate. The rate becomes smaller at higher degree polynomials (which has larger value p), as expected. The errors saturation level becomes smaller at higher degree polynomial, also as expected.

The quantities snum and sopt, which measure the half logarithm of the numerical conver- gence rate by (7.7) and the optimal convergence rate by (7.8), respectively, are tabulated in Table 7.1, for all the test functions in (7.1) at various polynomial degrees. We observe that the numerical rates match very well with the optimal rates, especially at relatively higher polynomial degree n ≥ 10. This verifies that the equilibrium measure µD (5.26), from which we draw samples in all the examples, is indeed a good approximation of the optimal sampling measure µn (5.24).

136 0 100 10 n=20 n=30 n=40

10-1

10-5

10-2

10-10

10-3

n=5 n=10 n=20 10-15 10-4 0 5K 10K 15K 20K 0 5K 10K 15K 20K 2 100 10 n=10 n=20 10-2 n=30 100

10-4

10-2 10-6

10-8 10-4

10-10

10-6 10-12 n=5 n=10 n=20 10-14 10-8 0 5K 10K 15K 20K 0 5K 10K 15K 20K

Figure 7.32: Errors versus number of iterations by Legendre polynomials for the four test functions (7.1) at d = 2. Top left: f1 with σ = [1, 1],χ = [1, 0.5]; Top right: f2 with σ = [−2, 1],χ = [0.25, −0.75]; Bottom left: f3; Bottom right: f4 with σ = [−3, 2],χ = [0.5, 0.5].

f1 f2 n snum sopt n snum sopt 5 −2.5 × 10−2 −2.4 × 10−2 20 −2.2 × 10−3 −2.2 × 10−3 10 −7.9 × 10−3 −7.6 × 10−3 30 −1.0 × 10−3 −1.0 × 10−3 20 −2.2 × 10−3 −2.2 × 10−3 40 −5.8 × 10−4 −5.8 × 10−4 f3 f4 n snum sopt n snum sopt 5 −2.3 × 10−2 −2.4 × 10−2 10 −7.9 × 10−3 −7.6 × 10−3 10 −7.9 × 10−3 −7.6 × 10−3 20 −2.2 × 10−3 −2.2 × 10−3 20 −2.2 × 10−3 −2.2 × 10−3 30 −1.0 × 10−3 −1.0 × 10−3

Table 7.1: Numerical rate of convergence snum (7.7) against the optimal rate of convergence sopt (7.8) in two dimension d = 2, for the test functions (7.1) at different polynomial degree.

137 Higher dimensions (d ≥ 10): Legendre polynomial basis

For higher dimensional tests, we focus on the orthogonal basis case and use normalized

Legendre polynomials in D = [−1, 1]d. We also focus on the case of GAUSSIAN function in

(7.1), as the results for other functions in (7.1) behave similarly in terms of the convergence rate. At high dimensions d  1, accurate computation of the best approximation coefficient cˆ is not easy. Therefore, we examine the numerical errors via the function approximation errors of fe(k), evaluated at 100,000 points independent from the simulations. The sampling measure is set to be the tensor Chebyshev measure, the equilibrium measure, from (5.26).

First we consider the d = 10 case and present the results for f1 from (7.1). In Fig. 7.33, we plot the error decay at different degrees with respect to the number of iterations. The

exponential convergence of the errors is clearly shown. The errors saturate at lower levels

for higher degree polynomials, as expected. The numerical errors in the converged results

are plotted in Fig. 7.34, with respect to the different degree of polynomials. One clearly

observes the exponential convergence of the approximation error at increasing degree of

polynomials. This is expected for this smooth target function. The number of iterations m to achieve the converged solutions is tabulated in Table 7.2. For reference, we also tabulate the cardinality of the polynomial space, p, at different polynomial degree. In this case, the number of iterations m is also the number of function data. These results indicate that the sequential method is indeed suitable for “big data” problems, where a large amount of data can be collected. At high degree n ≥ 9, the sequential method never requires the formation of the model matrix, which would be of size O(106 × 105) and can not be easily handled.

Instead the method requires only operations of the row vectors, whose size is O(1×105). All

the computations here were performed on a standard desktop computer with Intel i7-4770

CPU at 3.40GHz and with 24.0GB RAM.

The half logarithm of the numerical convergence rate and the optimal convergence rate

are tabulated in Table 7.3. We now observe differences between the rates. The numerical

rates are notably slower than the optimal rates. This is not surprising, as the equilibrium

measure µD, from which we draw the samples, is not a good approximation of the optimal

138 100 n=5 n=6 n=7 -1 10 n=8 n=9

10-2

10-3

10-4

10-5 0 500K 1,000K 1,500K 2,000K 2,500K 3,000K 3,500K

Figure 7.33: Approximation errors at varying degrees from n = 5 to 9 versus the number of iterations by Legendre polynomial at d = 10. The target function is the GAUSSIAN function f1 in (7.1) with ci = 1 and wi = 0.5.

100 d=10

10-1

10-2

10-3

10-4

10-5

10-6 1 2 3 4 5 6 7 8 9 10

Figure 7.34: Approximation errors in the converged solutions with respect to the polynomial degree by Legendre polynomials at d = 10. The target function is the GAUSSIAN function f1 in (7.1) with σi = 1 and χi = 0.5.

139 d = 10 n = 1 n = 2 n = 3 n = 4 n = 5 p 11 66 286 1,001 3,003 m 10,000 10,000 10,000 26,000 80,000 n = 6 n = 7 n = 8 n = 9 n = 10 p 8,008 19,448 43,758 92,378 184,756 m 220,000 500,000 1,700,000 3,000,000 7,000,000

Table 7.2: The number of iterations (m) used to reach the converged solutions in Fig. 7.33, along with the cardinality of the polynomial space (p) at dimension d = 10. The target function is the GAUSSIAN function f1 in (7.1) with σi = 1 and χi = 0.5.

sampling measure µn at the relatively lower polynomial degree n. The higher dimension d = 10 also would magnify the effect of the difference.

n = 5 n = 6 n = 7 n = 8 n = 9 −5 −5 −5 −6 −6 snum −6.7 × 10 −3.2 × 10 −1.4 × 10 −6.8 × 10 −3.4 × 10 −4 −5 −5 −5 −6 sopt −1.7 × 10 −6.2 × 10 −2.6 × 10 −1.1 × 10 −5.4 × 10

Table 7.3: Numerical rate of convergence snum (7.7) against the optimal rate of convergence sopt (7.8) in ten dimension d = 10 for the GAUSSIAN function f1 in (7.1) with σi = 1 and χi = 0.5.

For higher dimensions, we present the results in d = 20 and d = 40 in Fig. 7.35. As the complexity of the problem grows exponentially in higher dimensions, we confine ourselves to polynomial degree n ≤ 5. The numbers of iterations/data m required to reach the converged solution are tabulated in Table 7.4, along with the cardinality p of the polynomial space.

Again, all the simulations were conducted on the same desktop computer.

Optimal sampling measure

In the examples by Legendre orthogonal polynomials, we have employed the equilibrium measure µD in (5.26) as the sampling measure. It can be considered as an approximation to the optimal sampling measure µn in (5.24), at reasonably high polynomial degrees. Based on the results in Theorem 5.3.1 and 5.3.2, the optimal sampling measure µn should deliver

140 0 10 10-1 n=3 n=2 n=4 n=3 n=5 n=4

10-1

10-2

10-3 10-2 0 1M 2M 3M 4M 5M 6M 7M 0 5M 10M 15M 20M 25M

Figure 7.35: Error convergence with respect to the iteration count for the GAUSSAN function f1 with σi = 1 and χi = 0.5 by Legendre polynomials. Left: d = 20; Right: d = 40.

d = 20 n = 1 n = 2 n = 3 n = 4 n = 5 p 21 231 1,771 10,626 53,130 m 10,000 50,000 200,000 900,000 4,000,000 d = 40 n = 1 n = 2 n = 3 n = 4 n = 5 p 41 861 12,341 135,751 1,221,759 m 1,000 100,000 1,500,000 20,700,000 —

Table 7.4: The number of iterations (m) used to reach the converged solutions in Fig. 7.35, along with the cardinality of the polynomial space (p) at dimensions d = 20 and d = 40.

the optimal convergence rate (1 − 1/p). Here we examine the difference of using µn and µD.

Samples from µn are readily generated by MCMC. We first examine the low dimensional case d = 2. In Fig. 7.36, the results for approx- imating the GAUSSIAN function f1 in (7.1) are shown. We observe that the difference in the results is negligible, even at the low polynomial degree of n = 1. This indicates that

the equilibrium measure µD can indeed be a good replacement of the optimal measure µn in low dimensions.

At higher dimensions, however, the difference starts to become more visible. Here we

present the results at a high dimension of d = 40. The results for the target function f1 in (7.1) are shown in Fig. 7.37. We now clearly observe that the use of the equilibrium

141 100 0 10 100 dµ 1 dµ dµ 5 20 dµ dµ dµ 10-1

10-5 10-2

10-1

10-3 10-10

10-4

10-2 10-5 10-15 0 10 20 30 40 50 0 100 200 300 400 500 0 5K 10K 15K 20K

Figure 7.36: Errors versus number of iterations by Legendre polynomials for the GAUSSIAN with σ = [1, 1], χ = [1, 0.5] at d = 2. The solid line is the results by µn and the dashed line is by µ. From left to right, the polynomial degrees are set to be n = 1, n = 5 and n = 20.

measure µD results in much slower convergence than the optimal measure µn. The differ-

ence between µD and µn becomes magnified at higher dimensions, which is not surprising. Therefore, we conclude that at higher dimensions it is advantageous to use the optimal

sampling measure, for faster convergence. However, we emphasize again that the method

shall converge regardless what measure one uses.

-1 -1 10 10 10-1 dµ dµ dµ 1 2 3 dµ dµ dµ

-2 -2 10 10 10-2

10-3 10-3 10-3 0 200 400 600 800 1000 0 10K 20K 30K 40K 50K 0 100K 200K 300K 400K 500K

Figure 7.37: Errors versus number of iterations by Legendre polynomials for the GAUSSIAN with σi = 1, χi = 0.5 at d = 40. The solid line is the results by µn and the dashed line is by µ. The polynomial degrees are set to be (left) n = 1, (middle) n = 2 and (right) n = 3.

7.3.2 Discrete sampling on the tensor Gauss quadrature

We present several numerical examples to demonstrate the effectiveness of the sequential approximation on a tensor quadrature grid. We consider three types of different, and rep-

142 resentative, polynomials as the basis. They are Legendre polynomials in bounded domain,

Hermite polynomials in unbounded domain, and trigonometric polynomials in periodic do- main. The dimensions of our examples include low-dimension d = 2, intermediate dimen- sions d = 10 and d = 40, and high dimensions d = 100 and d = 500.

We note that for the functions of product form, f1, f2, and f4, all the terms in the convergence estimation (5.35) can be numerically computed with high accuracy. This gives

us a direct comparison between the theoretical convergence result (5.35) and the numerical

convergence.

In low dimension d = 2, we shall present numerical results averaged over 100 independent

simulations. This gives us a better comparison against Theorem 5.4.1, whose results are

stated in expectation. In higher dimensions, all numerical results are obtained over single

simulation.

The convergence behaviors for all the target functions are very similar. Therefore, we

rather arbitrarily choose subsets of the results to present here.

Low dimension d = 2

We first present comprehensive results in two dimension (d = 2). At this low dimension,

the computation (5.32) is cheap. We therefore present averaged numerical results over 100

independent simulations to reduce the statistical oscillations.

Legendre polynomial

We first consider the normalized Legendre polynomials approximation for the target func-

2 tions f1, f2, f3 and f4 in (7.1) over the domain D = [−1, 1] . The tensor quadrature point set is constructed by the standard one-dimensional Gauss-Legendre quadrature points. The

convergence history of the function approximation errors are displayed for different polyno-

mial degrees n in Fig. 7.38, where the theoretical rates are also plotted. Good agreement 1/2  1  between the numerical convergence rate and the theoretical rate 1 − p is observed. As excepted, the convergence rate becomes smaller for higher degree (n) polynomials, because the cardinality (p) of the polynomial space becomes larger. The saturation level of the

143 errors becomes smaller at higher degree polynomial, due to the higher order accuracy.

0 0 10 10 n=5 n=10 n=20 10−1 10−5 Theoretical rates 10−2

10−10

10−3 n=20 n=30 10−15 Theoretical rates n=40 −4 10 0 5K 10K 15K 20K 0 5K 10K 15K 20K

0 2 10 10 n=5 n=10 n=20 100 Theoretical rates 10−5 10−2

10−4

10−10 −6 10 n=10 n=20 Theoretical rates n=30 −8 10 0 5K 10K 15K 20K 0 5K 10K 15K 20K

Figure 7.38: Function approximation errors versus number of iterations by Legendre poly- nomials for four test functions in (7.1) at d = 2. Top left: f1 with σ = [1, 1] and χ = [1, 0.5]; Top right: f2 with σ = [−2, 1] and χ = [0.25, −0.75]; Bottom left: f3; Bottom right: f4 with σ = [−3, 2] and χ = [0.5, 0.5].

Hermite polynomial

We then examine the use of Hermite polynomials in unbounded domain. The normalized

Hermite polynomials are used. The tensor quadrature set is constructed via the standard

one-dimensional Gauss-Hermite quadrature points. The results for f1, f2, f4, and f5, are

144 shown in Fig. 7.39, along with the theoretical rates. Again, we observe excellent agreement between the numerical convergence and the theoretical convergence.

0 0 10 10

10−2 10−1 Theoretical rates Theoretical rates 10−4

10−2 10−6 n=10 n=10 n=15 n=20 n=20 n=40 −8 −3 10 10 0 5K 10K 15K 20K 0 5K 10K 15K 20K

0 0 10 10

10−2

10−4 Theoretical rates 10−5 Theoretical rates

10−6

10−8 10−10

10−10 n=5 n=10 n=10 n=15 n=20 n=20 −12 −15 10 10 0 5K 10K 15K 20K 0 5K 10K 15K 20K

Figure 7.39: Function approximation errors versus number of iterations by Hermite poly- nomials for four of test functions in (7.1) and (7.3) at d = 2. Top left: f1 with σ = [1, 1] and χ = [0.55, 0.5]; Top right: f2 with σ = [0.2, 0.1] and χ = [0.5, 0.5]; Bottom left: f4 with 1 σ = [− 3 , 0.5] and χ = [0.5, 0.5]; Bottom right: f5 with χi = 0.

Trigonometric polynomial

We now examine the use of trigonometric polynomials for function approximation, which is

essentially multivariate Fourier expansion. The domain is D = [0, 2π]d, and linear subspace

145 for approximation is ˆd  iα·x Pn := span e , |α| ≤ n/2 , (7.10) where α = (α1, . . . , αd) is multi-index with |α| = |α1| + ··· + |αd| and αi ∈ Z, i = 1, ··· , d. The cardinality of this space is

d X d n  p = dim Πˆ d = 2d−i 2 . (7.11) n i d − i n i=max{d− 2 ,0}

It is well known that the trapezoidal rule on uniform grids with equal weights is a quadrature rule for trigonometric polynomials. Consequently, we employ (n + 1) equally spaced points

{2π`/(n + 1), ` = 0, 1, ··· , n} in each dimension to construct the tensor quadrature points. ˆ d This ensures the exactness of the tensor quadrature for any functions in Π2n. In this case, the optimal sampling probability distribution (5.31) becomes the discrete uniform distribution.

In Fig. 7.40, the numerical convergence of the errors in f1, f2, f4 and f6 is shown. Again, excellent agreement between the numerical convergence rate and the theoretical convergence rate can be observed. The errors saturate at smaller levels when larger n is

used, as expected.

Intermediate dimensions d = 10 and d = 40

We now focus on dimensions d = 10 and d = 40. Hereafter all numerical results are reported

as those of single simulation.

Legendre polynomial

We first consider the approximation in d = 10 dimension using the normalized Legendre

10 polynomials. The results for approximating f1 in (7.1) over D = [−1, 1] are presented in Fig. 7.41 for different degrees. The exponential convergence of the errors with respect to

the iteration count can be clearly observed. For the polynomial degrees n = 5, 6, 7, 8, 9, the cardinality of the polynomial space is p = 3003, 8008, 19448, 43758 and 92378, respectively, as shown in Table 7.2. We observe from Fig. 7.41 that the ∼ 10p iteration count is indeed

a good (and quite conservative) criterion for converged solution.

146 0 0 10 10

10−1 10−1

Theoretical rates 10−2 Theoretical rates 10−2 10−3 n=20 n=20 n=30 n=30 n=40 n=40 −4 −3 10 10 0 5K 10K 15K 20K 0 5K 10K 15K 20K

1 1 10 10

100 100

Theoretical rates Theoretical rates 10−1 10−1

10−2 10−2 n=20 n=20 n=30 n=30 n=40 n=40 −3 −3 10 10 0 5K 10K 15K 20K 0 5K 10K 15K 20K

Figure 7.40: Function approximation errors versus number of iterations by trigonometric polynomials for four of test functions in (7.1) and (7.3) at d = 2. Top left: f1 with σi = 1 π+1 π+1 and χi = 2 ; Top right: f2 with σi = 1 and χi = 2 ; Bottom left: f4 with σi = 2 and π+1 χi = 2 ; Bottom right: f6.

147 Furthermore, for this separable function, all the terms in the theoretical convergence formula (5.35) in Theorem 5.4.1 can be accurately computed. We thus obtain

 k (k) 2 −5 1 n = 7 : fe − f ' 6.1279 × 10 + 172.5288 × 1 − , E ω 19448 and  k (k) 2 −6 1 n = 8 : fe − f ' 2.8532 × 10 + 172.5289 × 1 − . E ω 43758 We plot these “theoretical curves” in Fig. 7.42, along with the numerical convergence over a single simulation. We observe that they agree with each other very well and the difference is indistinguishable.

0 10 n=5 n=5 0 10 n=6 n=6 n=7 10−1 n=7 −1 n=8 n=8 10 n=9 n=9 10−2 10−2

10−3 10−3

−4 10−4 10

−5 −5 10 10 0 500K 1000K 1500K 2000K 0 500K 1000K 1500K 2000K

Figure 7.41: Coefficient errors (left) and function approximation errors (right) versus number of iterations by Legendre polynomials of different degree n at d = 10. The target function is the GAUSSIAN function f1 in (7.1) with σi = 1 and χi = 0.375.

For higher dimensions, we present the results of d = 40 in Fig. 7.43. As the complexity

of the problem grows exponentially in higher dimensions, we confine ourselves to polynomial

degree n ≤ 4. Again, the “theoretical convergence curves” agree very well with the actual

numerical convergence.

We now examine the convergence rate with different sampling probability. As Theorem

5.3.1 indicates, the sequential approximation on a tensor quadrature grid shall converge via

148 0 0 10 10 Numerical Numerical Theoretical Theoretical 10−1 10−1

10−2

10−2

10−3

10−3 10−4

−4 −5 10 10 0 100K 200K 300K 400K 0 200K 400K 600K 800K 1000K 1200K

Figure 7.42: Numerical and theoretical function approximation errors versus number of iterations by Legendre polynomials at d = 10: n = 7 (left) and n = 8 (right). The target function is the GAUSSIAN function f1 in (7.1) with σi = 1 and χi = 0.375.

any proper discrete sampling probability, as in the general algorithm (5.3). However, the optimal sampling probability (5.31) in the algorithm (5.3) results in the algorithm (5.32) and this shall give us the optimal rate of convergence. This can be clearly seen in Fig. 7.44, where the numerical convergence by the optimal sampling probability (5.31) and by the discrete uniform probability are shown. Top of Fig. 7.44 is for the GAUSSIAN function f1 and its bottom for the function f5. We clearly see that the general algorithm (5.3) using the non-optimal sampling measure does converge. However, its rate of convergence is significantly slower than that of the algorithm (5.32) which is a version of (5.3) using an optimal discrete measure. The difference in performance grows bigger in higher dimensions.

Consequently, we shall focus exclusively on the algorithm (5.32).

Hermite polynomial

We now consider approximations in unbounded domain using Hermite polynomial in d = 10 and d = 40. We present the results of function f5 in (7.3). The results are shown in Fig. 7.45 and 7.46, for the case of d = 10 and d = 40, respectively. Again, we observe the exponential rate of convergence and its good agreement with the theoretical rate.

149 −1 10 −1 Numerical 10 Numerical Theoretical Theoretical

10−2 10−2

−3 10 −3 0 5K 10K 15K 20K 10 0 200 400 600 800 1000

−1 −1 10 10 Numerical Numerical Theoretical Theoretical

−2 10 10−2

−3 −3 10 10 0 50K 100K 150K 200K 0 500K 1000K 1500K 2000K

Figure 7.43: Numerical and theoretical function approximation errors versus number of iterations by Legendre polynomials at d = 40. Top left: n = 1; Top right: n = 2; Bottom left: n = 3; Bottom right: n = 4. The target function is the GAUSSIAN function f1 in (7.1) with σi = 1 and χi = 0.375.

150 −1 −1 10 10 Optimal Optimal Nonoptimal Nonoptimal

−2 −2 10 10

−3 −3 10 10 0 100K 200K 300K 400K 0 2M 4M 6M 8M 10M

0 0 10 10 Optimal Optimal rate Nonoptimal

−1 −1 10 10

Optimal Optimal rate

−2 −2 Nonoptimal 10 10 0 100K 200K 300K 400K 0 2M 4M 6M 8M 10M

Figure 7.44: Function approximation errors obtained by optimal sampling strategy (5.31) and a non-optimal sampling strategy (uniform) with Legendre polynomials at d = 40: left (n = 3); right (n = 4). The target functions are (Top) the GAUSSIAN function f1 in (7.1) with σi = 1 and χi = 0.375 and (Bottom) f5 in (7.3) with χi = −0.1.

151 0 10

Theoretical rates

10−1

10−2

n=5 10−3 n=6 n=7 n=8 n=9 −4 10 0 500K 1000K 1500K 2000K

Figure 7.45: Function approximation errors versus number of iterations by Hermite poly- nomials of different degree n at d = 10. The target function is f5 in (7.3) with χi = −0.1.

0 0 10 10 Numerical Numerical Theoretical rate Theoretical rate

−1 10 10−1

−2 −2 10 10 0 50K 100K 150K 200K 0 500K 1000K 1500K 2000K

Figure 7.46: Function approximation errors versus number of iterations by Hermite poly- nomials of different degree n at d = 40: left (n = 3); right (n = 4). The target function is f5 in (7.3) with χi = −0.1.

152 High dimensions d ≥ 100

We now present the results at high dimensions d ≥ 100. In Fig. 7.47, the Legendre approximation results in [−1, 1]100 are shown, for polynomial order of n = 2 and n = 3.

The results shown here are for the GAUSSIAN function f1 in (7.1). We also plot the theoretical convergence using Theorem 5.4.1. Excellent agreement between the theory and the numerical results are observed. The cardinality of the polynomial space is p = 5151 for n = 2 and p = 176851 for n = 3. We observe that convergence is reached at around

K ∼ 50, 000 steps for n = 2 and K ∼ 1.5 × 106 steps for n = 3. This verifies the

K ∼ 10p convergence criterion. Note that the total number of the tensor quadrature points is M = 3100 ≈ 5.2 × 1047 for n = 2, and M = 4100 ≈ 1.6 × 1060 for n = 3. The algorithm

(5.32) thus converges after using only a very small portion of the tensor quadrature.

−3 −3 10 10 Numerical Numerical Theoretical Theoretical

−4 10 10−4

−5 −5 10 10 0 20K 40K 60K 80K 100K 0 500K 1000K 1500K 2000K

Figure 7.47: Numerical and theoretical function approximation errors versus number of iterations by Legendre polynomials at d = 100: n = 2 (left) and n = 3 (right). The target function is the GAUSSIAN function f1 in (7.1) with σi = 1 and χi = 0.375.

100 In Fig. 7.48 we present the approximation in unbounded domain R by Hermite polynomials of degree n = 2 and n = 3, for the function f5 in (7.3). Again, we observe the good agreement between the numerical convergence and the theoretical convergence rate.

The K ∼ 10p convergence criterion also holds well.

153 0 10 0 10 Numerical Numerical Theoretical rate Theoretical rate

−1 −1 10 10 0 10K 20K 30K 40K 50K 0 500K 1000K 1500K 2000K

Figure 7.48: Function approximation errors versus number of iterations by Hermite poly- nomials at d = 100: n = 2 (left) and n = 3 (right). The target function is f5 in (7.3) with χi = −0.1.

Finally, we present the approximation results in d = 500 dimension. This is for the func-

tion f5 in (7.3), by polynomials of degree n = 2. The cardinality of the polynomial space is p = 125, 751. The left of Fig. 7.49 is for the results in bounded domain by Legendre polyno-

mials, and the right is for unbounded domain by Hermite polynomials. Again, we observe

the expected exponential convergence and its agreement with the theoretical convergence.

500 238 Note that in this case the tensor quadrature grids ΘM consists of M = 3 ≈ 3.6 × 10 points. A number too large to be handled by most . However, the current ran- domized tensor quadrature method converges after ∼ 1.2 × 106 steps, following the ∼ 10p convergence rule and using only a tiny portion, ∼ 1/10232, of the tensor quadrature points.

154 0 0 10 10

10−1

Numerical Numerical Theoretical rate Theoretical rate −2 −1 10 10 0 500K 1000K 1500K 2000K 0 500K 1000K 1500K 2000K

Figure 7.49: Function approximation errors versus number of iterations of n = 2 at d = 500. 500 500 Left: Legendre polynomials in [−1, 1] ; Right: Hermite polynomials in R . The target function is f5 in (7.3) with χi = 0.

155 7.4 Correcting Data Corruption Errors

In this section we provide numerical examples of the error correction technique in Chapter

6. There exist a variety of ways to solve the minimization problem (P0) (6.5). Here we employ a mature method, the popular primal-dual algorithm, for to solve (P0) (6.5). In particular, we employ the software package `1-Magic ([19]). We remark that a distinct feature of the solution of (P0) is that it interpolates at least p non-corrupted data f with

∗ probability Pf (m, s,  ). (See Corollary 6.3.5.) All of our numerical results indicate that this property holds true numerically. However, a proof that the primal-dual algorithm satisfies this property is not available at the moment. This will be pursued in a future work.

To generate corrupted data, we sample the test functions at m number of points from

i.i.d. uniform distribution in [−1, 1]d. This defines the uncorrupted data vector f. We then

add corruption errors es to α·m samples, where 0 ≤ α < 1 is the corruption rate, and obtain

the corrupted data vector b = f + es. The corruption errors are drawn from a Gaussian distribution N (µ, σ2), where the mean µ and variance σ2 are chosen separately for each test

function to ensure the errors are sufficiently big.

We carefully examine the numerical performance of the `1 minimization method (P0) (6.5), which is denoted as LAD (least absolute deviation) here. For comparison, we also

examine the results obtained by the standard least squares method (LSQ). Based on the

analysis of this paper, we expect LAD to eliminate the corruption errors and LSQ not. We

also produce reference solutions using the LSQ method for uncorrupted data f. This will

be denoted as p-LSQ, which stands for LSQ with perfect data. If LAD can successfully

eliminate the corruption errors, we expect its results to be close to those of p-LSQ. The

numerical approximation errors by the methods are computed on a fixed dense set of points,

different from the sampling points used in the approximation. All results are produced using

2-norm and over an average of 50 runs to reduce the effect of random fluctuation.

Different basis functions in the approximations are considered. We present the results

by Legendre polynomials for polynomial regression and by radial basis functions for non-

polynomial regression. Different dimensions d are also considered. All the methods behave

156 quite similarly in different dimensions. We shall present a set of selected results to cover all combinations of the tests.

Two dimensional examples : radial basis functions

Here we present the results by radial basis functions. In particular, we consider the Gaussian radial basis function,

2 2 φi(x) = exp(− kx − zik2), (7.12) centered at the points zi with parameter  > 0. The centers zi are chosen to be tensor equidistance points.

In Fig. 7.50, the approximation results using  = 4.1 for the Franke function (7.2) are shown. The corruption errors are drawn from N (10, 1), rather big errors. The rate of corruption is α = 0.05, that is, 5% of the samples are corrupted. On the left of Fig. 7.50, the approximation errors based on m = 664 samples are shown, with respect to increasing number of basis functions. We clearly observe that LAD solutions stay very close to the p-LSQ solutions. This indicates that the `1 minimization problem (P0) effectively removes the corruption errors and its solution is very close to the LSQ solution with perfect data.

The standard LSQ solutions, however, are inevitably contanimated by the corruption errors and have no accuracy at all. On the right of Fig. 7.50, the results based on m = 366 sample

points are shown. Again, we observe that the standard LSQ method has no accuracy

because of the presence of the corruption errors. The LAD method again stays close to

the idealized p-LSQ with perfect data, as expected. Its performance deteriorates when the

number of basis functions becomes larger. This is due to the relatively smaller number of

uncorrupted samples, 0.95×366 ≈ 348. For a regression method to converge, the number of

samples needs to grow appropriately when the number of basis functions grows. This is an

issue independent of the topic of this dissertation. Consequently, hereafter we will present

more results in term of Legendre polynomial basis, as this issue is much better understood

in term of polynomial regression.

157 101 101 LAD LAD LSQ LSQ p-LSQ p-LSQ

100 100

10-1 10-1

10-2 10-2 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90

Figure 7.50: Errors with respect to the number of basis functions by radial basis functions in d = 2 for the Franke function, with α = 0.05 and es ∼ N (10, 1). Left: m = 664; Right: m = 366.

Four dimensional examples: Legendre polynomial basis

We now present the results in four dimension d = 4 by using normalized Legendre polyno- mials. We examine the approximation errors according to the total degree of the Legendre n+d polynomials. Let n ≥ 0 be the degree, then the number of basis functions is p = d . To obtain stable LSQ polynomial regression result, it is necessary to choose the number of samples by m ∼ p log p ([31]). This is used in the examples here. We remark that it is not clear whether the m ∼ p log p rule holds for LAD polynomial regression. Nevertheless, our numerical tests indicate this seems to be sufficient not incurring instability.

In Fig. 7.51, the results for approximating f1 and f2 (7.1) are shown. The corruption rate is fixed at 5%. It is evident that the LAD solutions are very close to the idealized p-LSQ solutions with non-corrupted data. This verifies that the `1 minimization (P0) (6.5) can effectively remove the corruption errors. The standard LSQ, on the other hand, succumbs to the corruption errors and fails to converge at all.

Fig. 7.52 shows the approximation errors for the CORNER PEAK function f3 (7.1). On the left, the results of 5% corruption rate are shown. Again, we observe that the LAD effectively eliminates the corruption errors and its results stay very close to the idealized p-LSQ with perfect data. The results of 30% corruption rate are shown on the right of

158 100 100 LAD LSQ p-LSQ

10-1

10-2 10-1

10-3

LAD LSQ p-LSQ 10-4 10-2 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

Figure 7.51: Approximation errors with respect to the increasing degree of Legendre poly- nomials in d = 4 and with α = 0.05. Left: GAUSSIAN f1 with σi = 2 and χi = 0.5. Corruption error es ∼ N (0, 1); Right: CONTINUOUS f2 with σi = i and χi = 0.5. Cor- ruption error es ∼ N (0, 1).

Fig. 7.52. The LAD solution exhibits larger approximation errors, compared to that of p-LSQ with perfect data. But the LAD solution converges at about the same rate. This can be expected by Theorem 6.3.2, where the upper bound of the LAD approximation error depends explicitly on s, the sparsity of the corruption error, which becomes larger when

α = 30%. Again, the standard LSQ has no accuracy at all.

100 100

10-1 10-1

10-2 10-2

10-3 10-3

10-4 10-4 LAD LAD LSQ LSQ p-LSQ p-LSQ 10-5 10-5 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

Figure 7.52: CORNER PEAK f3. Approximation errors with respect to the increasing degree of Legendre polynomials in d = 4 and with corruption error es ∼ N (0, 1). Left: α = 0.05; Right: α = 0.30.

159 We now examine the performance of the method with respect to the perturbation level of the corruption errors. In Fig. 7.53 we plot the approximation results for the GAUSSIAN function f1 with a fixed corruption rate of 5%. On the left the corruption level is at N (0, 0.3). We observe that the LAD solution is almost identical to the p-LSQ solution for all polynomial degrees. The LSQ result is similar to those of LAD and p-LSQ, for up to polynomial degree n = 3. At this stage the approximation error of LSQ is sufficiently large such that the N (0, 0.3) corruption errors are treated by LSQ as noise. For higher degree

polynomials n > 3, the approximation errors by LSQ shall be smaller and the corruption

errors can not be treated as noise. Therefore, for n > 3, the LSQ method starts to lose

accuracy, as it can not eliminate the corruption errors. Similar results are shown on the

right of Fig. 7.53, for a smaller corruption error N (0, 0.1). Here the LSQ starts to lose

accuracy at a higher degree n > 5, where the approximation errors by LSQ become so

small that the corruption errors can not be treated as noise. On the other hand, the LAD

produces results similar to p-LSQ for all polynomial degrees. For degree n ≤ 5, it treats

the corruption errors as noises; and for n > 5 it eliminates the corruption errors.

100 100 LAD LSQ p-LSQ

10-1

10-1

10-2

10-2

10-3

LAD LSQ p-LSQ 10-4 10-3 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

Figure 7.53: GAUSSIAN f1 with σi = 2 and χi = 0.5. Approximation errors with respect to the increasing degree of Legendre polynomials in d = 4 and with α = 0.05. Left: 2 2 es ∼ N (0, 0.3 ); Right: es ∼ N (0, 0.1 ).

We now examine the case of much bigger corruption errors. Fig. 7.54 shows the results

160 for the PRODUCT PEAK function f4 (7.1). Again, the corruption rate is fixed at 5%.

The results on the left of Fig. 7.54 are obtained with es ∼ N (0, 1) corruption error. It is clear that both LAD and LSQ are able to produce accurate results, compared to p-

LSQ with perfect data. This is because the function f4, with the parameters chosen here, has f ∼ O(105), with a peak value 216 = 65, 536. Consequently, the corruption errors from N (0, 1) are too small to be detected. Instead, they can be effectively considered as the standard random noises. Both LAD and LSQ are able “filter” them out to produce unbiased

5 2 estimates. To generate “true” corruption errors, we set es ∼ N (10 , 20 ), sufficiently large to be considered “nontrivial” from the assumption (A2). The results are shown on the right of Fig. 7.54. It is clear that the LAD can eliminate these corruption errors, even at

O(105), and produce results almost as accurate as those by the p-LSQ with perfect data.

The standard LSQ fails to produce any accuracy, which is not surprising.

104 105

104

103

103

102

102

LAD LAD LSQ LSQ p-LSQ p-LSQ 101 101 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

Figure 7.54: PRODUCT PEAK f4 with σi = d, χi = 0. Approximation errors with respect to the increasing degree of Legendre polynomials in d = 4 and with α = 0.05. Left: 5 2 es ∼ N (0, 1); Right: es ∼ N (10 , 20 ).

These results indicate that the LAD method is insensitive to the magnitude of the corruption errors, as long as they are sparse. When the corruption errors are small, they will be treated as the standard noise and the LAD behaves like the standard p-LSQ (as if there is no corruption). When the corruption errors are large, the LAD will effectively

161 eliminate them (as long as they are sparse) and produce results similar to p-LSQ. Either way, the LAD results are always similar to p-LSQ. LSQ, however, shall suffer when the corruption errors exist.

Ten dimensional examples: Legendre polynomial basis

For high dimensional examples, a variety of dimensionality was tested. Here we present the results in d = 10. In Fig. 7.55, the results for the GAUSSIAN function f1 (7.1) are shown. Again, we observe that at lower rate of corruption α = 10%, the LAD results are very close to the ideal p-LSQ with perfect data. At higher rate of corruption α = 30%, the LAD still exhibits similar convergence rate as the p-LSQ, even though it does have larger error. This is an expected result from Theorem 6.3.2.

The results for the CORNER PEAK function f3 (7.1) are shown in Fig. 7.56. Very similar results are obtained from the rest of the test functions and not repeated here.

100 100 LAD LAD LSQ LSQ p-LSQ p-LSQ

10-1 10-1

10-2 10-2 1 2 3 4 5 1 2 3 4 5

Figure 7.55: GAUSSIAN function f1 with σi = 2 and χi = 0.5. Approximation errors with respect to the increasing degree of Legendre polynomials in d = 10, and with corruption error es ∼ N (0, 1). Left: α = 0.1; Right: α = 0.3.

162 100 100 LAD LAD LSQ LSQ p-LSQ p-LSQ

10-1 10-1

10-2 10-2

10-3 10-3 1 2 3 4 5 1 2 3 4 5

Figure 7.56: CORNER PEAK function f3 with σi = 2 and χi = 0.5. Approximation errors with respect to the increasing degree of Legendre polynomials in d = 10, and with corruption error es ∼ N (0, 1). Left: α = 0.1; Right: α = 0.3.

163 BIBLIOGRAPHY

[1] N.N. Abdelmalek. On the discrete linear l1 approximation and l1 solutions of overde- termined linear equations. J. Approx. Theory, 11:38–53, 1974.

[2] A.Doostan and G. Iaccarino. A least-squares approximation of partial differential equations with high-dimensional random inputs. J. Comput. Phys., 228(12):4332– 4345, 2009.

[3] A.C. Atkinson, A.N. Donev, and R.D. Tobias. Optimum Experimental Designs, with SAS. Oxford Univ. Press, 2007.

[4] I. Babuska, F. Nobile, and R. Tempone. A stochastic collocation method for ellip- tic partial differential equations with random input data. SIAM J. Numer. Anal., 45(3):1005–1034, 2007.

[5] R.A. Bailey. Design of Comparative Experiments. Cambridge Univ. Press, 2008.

[6] I. Barrodale and F.D.K. Roberts. An improved algorithm for discrete l1 linear ap- proximation. SIAM J. Numer. Anal., 10(5):839–848, 1973.

[7] G. Bassett and R. Koenker. Asymptotic theory of least absolute error regression. J. American Stat. Assoc., 73:618622, 1978.

[8] R.J. Berman. Bergman kernels and equilibrium measures for line bundles over pro- jective manifolds. Am. J. Math, 131(5):1485–1524, 2009.

[9] R.J. Berman. Bergman kernels for weighted polynomials and weighted equilibrium measures of cn. Indiana Univ. Math. J., 58(4):1921–1946, 2009.

[10] R.J. Berman, S. Boucksom, and D.W. Nystr¨om. Fekete points and convergence to- wards equilibrium measures on complex manifolds. Acta Math., 207(1):1–27, 2011.

[11] P. Bloomfield and W. Steiger. Least absolute deviations curve-titting. SIAM J. Sci. Comput., 1(2):290–301, 1980.

[12] P. Bloomfield and W. Steiger. Least absolute deviations: theory, applications and algorithms. Birkhauser, 1983.

[13] L. Bos, S. De Marchi, A. Sommariva, and M. Vianello. Computing multivariate fekete and leja points by . SIAM J. Numer. Anal., 48(5):1984, 2010.

164 [14] G.E. Box, J.S. Hunter, and W.G. Hunter. for Experimenters: Design, Innovation, and Discovery. Wiley, 2nd edition, 2005. [15] C. Brezinski and M. RedivoZaglia. Convergence acceleration of Kaczmarz’s method. J. Eng. Math., 93:3–19, 2015.

[16] J.A. Cadzow. Minimum `1, `2, and `∞ norm approximate solutions to an overdeter- mined system of linear equations. Digital Signal Proc., 12(4):524–560, 2002. [17] T.T. Cai, L. Wang, and G. Xu. New bounds for restricted isometry constants. Infor- mation Theory, IEEE Transactions on, 56(9):4388 –4394, 2010. [18] T.T. Cai, L. Wang, and G. Xu. Shifting inequality and recovery of sparse signals. Signal Processing, IEEE Transactions on, 58(3):1300–1308, 2010. [19] E. Cand`esand J. Romberg. l1-magic: Recovery of sparse signals via convex program- ming. URL: www. acm. caltech. edu/l1magic/downloads/l1magic. pdf, 4:14, 2005. [20] E. Cand`es, M. Rudelson, T. Tao, and R. Vershynin. Error correction via linear programming. In 2005 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05), Pittsburg, PA, USA, pages 668–681. IEEE, 2005. [21] E.J. Cand`es. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique, 346(9):589–592, 2008. [22] E.J. Cand`es,J.K. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8):1207–1223, 2006. [23] E.J. Cand`esand T. Tao. Decoding by linear programming. IEEE Trans. Inform. Theory, 51(12):4203–4215, 2005. [24] E.J. Cand`esand T. Tao. Near-optimal signal recovery from random projections: universal encoding strategies? IEEE Trans. Inform. Theory, 52(12):5406–5425, 2006.

[25] E.J. Cand`es,M.B. Wakin, and S.P. Boyd. Enhancing sparsity by reweighted `1 min- imization. J. Fourier Anal. Appl., 14:877–905, 2008. [26] A. Charnes, W.W. Cooper, and R.O. Ferguson. Optimal estimation of executive compensation by linear programming. Management Sci., 1:138151, 1955. [27] R. Chartrand. Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Proc. Lett., 14(10):707–710, 2007. [28] R. Chartrand and W. Yin. Iteratively reweighted algorithms for compressive sensing. In IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, USA, March 31-April 4, 2008, pages 3869–3872. IEEE, 2008. [29] X. Chen and A.M. Powell. Almost sure convergence of the Kaczmarz algorithm with random measurements. J. Fourier Anal. Appl., 18:1195–1214, 2012. [30] H. Cheng and A. Sandu. Collocation least-squares polynomial chaos method. In Proceedings of the 2010 Spring Simulation Multiconference, SpringSim ’10, San Diego, CA, USA, 2010. Society for Computer Simulation International. 165 [31] A. Cohen, M. A. Davenport, and D. Leviatan. On the stability and accuracy of least squares approximations. Found. Comput. Math., 13(5):819–834, 2013.

[32] R. Cools. An encyclopaedia of cubature formulas. J. Complexity, 19:445–453, 2003.

[33] D.R. Cox. Principles of . Cambridge University Press, 2006.

[34] D. L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4):1289–1306, 2006.

[35] A. Doostan and H. Owhadi. A non-adaptive sparse approximation for PDEs with stochastic inputs. J. Comput. Phys., 230(8):3015–3034, 2011.

[36] Alireza Doostan and Houman Owhadi. A non-adapted sparse approximation of pdes with stochastic inputs. J. Comput. Phys., 230(8):3015–3034, 2011.

[37] C.F. Dunkl and Y. Xu. Orthogonal polynomials of several variables. Cambridge Univ. Press, 2001.

[38] Y.C. Eldar and D. Needell. Acceleration of randomized Kaczmarz method via the Johnson-Lindenstrauss lemma. Numer. Algo., 58(2):163–177, 2011.

[39] M. Eldred. Recent advances in non-intrusive polynomial chaos and stochas- tic collocation methods for uncertainty analysis and design. In 50th AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference, volume AIAA-2009-2249, 2009.

[40] D.A. Freedman. Statistical Models: Theory and Practice. Cambridge University Press, 2005.

[41] B. Ganapathysubramanian and N. Zabaras. Sparse grid collocation methods for stochastic natural convection problems. J. Comput. Phys., 225(1):652–685, 2007.

[42] A. Gelman. Analysis of variance? why it is more important than ever. Ann. Statis., 33:153, 2005.

[43] A. Genz. Testing multidimensional integration routines. In B. Ford, J.C. Rault, and F. Thomasset, editors, Tools, Methods, and Languages for Scientific and Engineering Computation, pages 81–94. North-Holland, 1984.

[44] T. Gerstner and M. Griebel. Numerical integration using sparse grids. Numer. Alg., 18:209–232, 1998.

[45] T. Gerstner and M. Griebel. Dimension–adaptive tensor–product quadrature. Com- puting, 71(1):65–87, 2003.

[46] R.G. Ghanem and P. Spanos. Stochastic Finite Elements: a Spectral Approach. Springer-Verlag, 1991.

[47] P. Goos and B. Jones. Optimal Design of Experiments: A Case Study Approach. Wiley, 2011.

166 [48] R. Gordon, R. Bender, and G.T. Herman. Algebraic reconstruction techniques (ART) for three-dimensional electron microscopy and X-ray photography. J. of theor. Bio., 29(3):471–481, 1970.

[49] L. Gyørfi, M. Kohler, A. Krzy˙zak, and H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer Series in Statistics, Springer-Verlag, Berlin, 2002.

[50] S. Haber. Numerical evaluation of multiple integrals. SIAM Rev., 12(4):481–526, 1970.

[51] P.C. Hammer and A.H. Stroud. Numerical evaluation of multiple integrals ii. Math. Comp., 12(64):272–280, 1958.

[52] J. Hampton and A. Doostan. Coherence motivated sampling and convergence analysis of least squares polynomial chaos regression. Computer Methods Appl. Mech. and Eng., 290:73–97, 2015.

[53] W.K. Hastings. Monte carlo sampling methods using markov chains and their appli- cations. Biometrika, 57(1):97–109, 1970.

[54] G.T. Herman. Fundamentals of computerized tomography: image reconstruction from projections. Springer Science & Business Media, 2009.

[55] S. Hosder, R. W. Walters, and M. Balch. Point-collocation nonintrusive polynomial chaos method for stochastic computational fluid dynamics. AIAA J., 48:2721–2730, 2010.

[56] S. Kaczmarz. Angen¨aherteaufl¨osungvon systemen linearer gleichungen. Bull. Inter- nat. Acad. Polonaise Sci. Lett., 35:355–357, 1937.

[57] M.-J. Lai, Y. Xu, and W. Yin. Improved iteratively reweighted least squares for unconstrained smoothed `q minimization. SIAM J. Numer. Anal., 51(2):927–957, 2013.

[58] J. Liu and S.J. Wright. An accelerated randomized Kaczmarz algorithm. Math. Comp., 85(297):153–178, 2016.

[59] Y. Lou, P. Yin, Q. He, and J. Xin. Computing sparse representation in a highly coherent dictionary based on difference of l1 and l2. J. Sci. Comput., 64:178–196, 2015.

[60] R.L. Mason, R.F. Gunst, and J.L. Hess. Statistical design and analysis of experiments with applications to engineering and science. 1989.

[61] L Mathelin and Gallivan K.A. A compressed sensing approach for partial differential equations with random input data. Comm. Comput. Phys., 12:919–954, 2012.

[62] G. Migliorati and F. Nobile. Analysis of discrete least squares on multivariate polyno- mial spaces with evaluations at low-discrepancy point sets. J. Complexity, 31(4):517– 542, 2015.

167 [63] G. Migliorati, F. Nobile, E. Von Schwerin, and R. Tempone. Analysis of discrete L2 projection on polynomial spaces with random evaluations. Found. Comput. Math., 14(3):419–456, 2014.

[64] G. Migliorati, F. Nobile, E. von Schwerin, and R. Tempone. Approximation of quanti- ties of interest in stochastic PDEs by the random discrete L2 projection on polynomial spaces. SIAM J. Sci. Comput., 35(3):A1440–A1460, May 2013.

[65] G. Migliorati, F. Nobile, E. von Schwerin, and R. Tempone. Analysis of the discrete L2 projection on polynomial spaces with random evaluations. Found. Comput. Math., 14(3):419–456, 2014.

[66] A. Narayan, J.D. Jakeman, and T. Zhou. A Christoffel function weighted least squares algorithm for collocation approximations. Math. Comput., 86(306):1913–1947, 2017.

[67] A. Narayan and D. Xiu. Constructing nested nodal sets for multivariate polynomial interpolation. SIAM J. Sci. Comput., 35(5):A2293–A2315, 2013.

[68] S.C. Narula and J.F. Wellington. The minimum sum of absolute errors regression: a state of the art survey. Inter. Stat. Rev., 50(3):317326, 1982.

[69] B.K. Natarajan. Sparse approximate solutions to linear systems. SIAM J. Comput., 24(2):227–234, 1995.

[70] D. Needell. Randomized Kaczmarz solver for noisy linear systems. BIT Numer. Math., 50(2):395–403, 2010.

[71] D. Needell, N. Srebro, and R. Ward. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Math. Programming, 155:549–573, 2016.

[72] Paul Nevai. G´ezafreud, orthogonal polynomials and christoffel functions. a case study. Journal of Approximation Theory, 48(1):3–167, 1986.

[73] M.R. Osborne and G.A. Watson. On an algorithm for discrete nonlinear l1 approxi- mation. Computer J., 14(2):184–188, 1971.

[74] D. Pollard. Asymptotics for least absolute deviation regression estimators. Econo- metric Theory, 7:186–199, 1991.

[75] S. Portnoy and R. Koenker. The Gaussian hare and the Laplacian tortoise: Com- putability of squared-error vs. absolute-error estimators, with discussion. Stat. Sci- ence, 12:279–300, 1997.

[76] J.L. Powell. Least absolute deviations estimation for the censored regression model. J. Econometrics, 25:303–325, 1984.

[77] F. Pukelsheim. Optimal Design of Experiments. SIAM, 2006.

[78] H. Rauhut. Compressive sensing and structured random matrices. In Massimo For- nasier, editor, Theoretical Foundations and Numerical Methods for Sparse Recovery, pages 1–92. Berlin, New York (DE GRUYTER), 2010.

168 [79] H. Rauhut and R. Ward. Sparse Legendre expansions via `1-minimization. J. Approx. Theory, 164:517–533, 2012.

[80] M.T. Reagan, H.N. Najm, R.G. Ghanem, and O.M. Knio. Uncertainty quantification in reacting-flow simulations through non-intrusive spectral projection. Combustion & Flame, 132(3):545–555, 2003.

[81] H. Scheffe. The Analysis of Variance. Wiley, New York, 1959.

[82] Y. Shin, K. Wu, and D. Xiu. Sequential function approximation using randomized samples. J. Comput. Phys., submitted, 2017.

[83] Y. Shin and D. Xiu. Correcting data corruption errors for multivariate function approximation. SIAM J. Sci. Comput., 38(4):A2492–A2511, 2016.

[84] Y. Shin and D. Xiu. Nonadaptive quasi-optimal points selection for least squares linear regression. SIAM J. Sci. Comput., 38(1):A385–A411, 2016.

[85] Y. Shin and D. Xiu. On a near optimal sampling strategy for least squares polynomial regression. J. Comput. Phys., 326:931–946, 2016.

[86] Y. Shin and D. Xiu. A randomized algorithm for multivariate function approximation. SIAM J. Sci. Comput., 39(3):A983–A1002, 2017.

[87] K. Spyropoulos, E. Kiountouzis, and A. Young. Discrete approximation in the l1 norm. Computer J., 16(2):180–186, 1972.

[88] T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl., 15(2):262–278, 2009.

[89] A.H. Stroud. Approximate calculation of multiple integrals. Prentic-Hall, Inc., Engle- wood Cliffs, New Jersey, 1971.

[90] Timothy John Sullivan. Introduction to uncertainty quantification, volume 63. Springer, 2015.

[91] G. Szeg¨o. Orthogonal Polynomials. American Mathematical Society, Providence, RI, 1939.

[92] P.D. Tao and L.T.H. An. Convex analysis approachj to D.C. programming: theory, algorithms and applicatins. Acta Math. Vietnamica, 22(1):289–355, 1997.

[93] M.A. Tatang, W.W. Pan, R.G. Prinn, and G.J. McRae. An efficient method for parametric uncertainty analysis of numerical geophysical model. J. Geophy. Res., 102:21925–21932, 1997.

[94] T. Wallace and A. Sekmen. Acceleration of Kaczmarz using orthogonal subspace projections. In Biomedical Sciences and Engineering Conference (BSEC),Oak Ridge, Tennessee, USA, May 21-23, 2013, pages 1–4. IEEE, 2013.

[95] C.F. Wu and M.S. Hamada. Experiments: planning, analysis, and optimization. John Wiley & Sons, 2009. 169 [96] K. Wu, Y. Shin, and D. Xiu. A randomized tensor quadrature method for high dimensional polynomial approximation. SIAM J. Sci. Comput., 39(5):A1811–A1833, 2017.

[97] D. Xiu. Efficient collocational approach for parametric uncertainty analysis. Comm. Comput. Phys., 2(2):293–309, 2007.

[98] D. Xiu. Numerical integration formulas of degree two. Appl. Numer. Math., 58(10):1515–1520, 2008.

[99] D. Xiu. Numerical methods for stochastic computations. Princeton Univeristy Press, Princeton, New Jersey, 2010.

[100] D. Xiu and J.S. Hesthaven. High-order collocation methods for differential equations with random inputs. SIAM J. Sci. Comput., 27(3):1118–1139, 2005.

[101] D. Xiu and G.E. Karniadakis. The Wiener-Askey polynomial chaos for stochastic differential equations. SIAM J. Sci. Comput., 24(2):619–644, 2002.

[102] L. Yan, L. Guo, and D. Xiu. Stochastic collocation algorithms using `1-minimization. Int. J. Uncertainty Quantification, 2(3):279–293, 2012.

[103] L. Yan, Y. Shin, and D. Xiu. Sparse approximation using `1 −`2 minimization and its application to stochastic collocation. SIAM J. Sci. Comput., 39(1):A229–A254, 2017.

[104] X. Yang and G.E. Karniadakis. Reweighted l1 minimization method for stochastic elliptic differential equations. J. Comput. Phys., 248:87–108, 2013.

[105] P. Yin, Y. Lou, Q. He, and J. Xin. Minimization of `1−2 for compressed sensing. SIAM J. Sci. Comput., 37(1):A536–A563, 2015.

[106] A. Zouzias and N.M. Freris. Randomized extended Kaczmarz for solving least squares. SIAM J. Matrix Anal. Appl., 34(2):773–793, 2013.

170 APPENDIX A

PROOFS OF CHAPTER 3

A.1 Proof of Theorem 3.2.1

To prove Theorem 3.2.1, we first recall the following result.

Theorem A.1.1 (Sylvester’s determinant theorem). Let A be a m × p matrix and B be a

p × m matrix. Then,

det(Im + AB) = det(Ip + BA), (A.1)

where Im and Ip are the m × m and p × p identity matrices, respectively.

Based on this, we immediately derive the following results.

Corollary A.1.1. Let X be an invertible m × m matrix and c and r be length-m vectors.

Then,

det(X + crT ) = det(X)(1 + rT X−1c). (A.2)

Proof.

T −1 T −1 T det(X + cr ) = det(X(Im + X cr )) = det(X) det(Im + X cr )

T −1 T −1 = det(X) det(I1 + r X c) = det(X)(1 + r X c).

Corollary A.1.2. Let A be a m × m invertible square matrix, r and c length-m vectors,

171 and d a real number. Then,   A c   T −1 det   = det A · (d − r A c). (A.3) rT d

We also recall the following theorem.

Theorem A.1.2 (Sherman–Morrison formula). Let A be an invertible square matrix and u,v be vectors. Suppose that 1 + vT A−1u 6= 0. Then,

A−1uvT A−1 (A + uvT )−1 = A−1 − . (A.4) 1 + vT A−1u

Using these results, the proof of Theorem 3.2.1 immediately follows.

Proof. Since AeT Ae = AT A + rrT ,

det(AeT Ae) = det(AT A) · (1 + rT (AT A)−1r).

Therefore,

det(AeT Ae) S(A)2p = e Qp (i) 2 2 i=1(kA k + ri ) 1 + rT (AT A)−1r = (det AT A) . Qp (i) 2 2 i=1(kA k + ri )

A.2 Proof of Theorem 3.2.2

Proof. Note that   DAT c + γr T   T T Ae Ae =   ,D = A A + rr . cT A + γrT cT c + γ2

Using the definition (3.9) , we have

det AeT Ae S(Ae)2(m+1) = Qm+1 (i) 2 i=1 kAe k cT c + d2 − (cT A + γrT )D−1(cT A + γrT )T = det D , Qm+1 (i) 2 i=1 kAe k 172 where Corollary A.1.1 is applied. Upon using Corollary A.1.2, we have

det D = det(AT A) · (1 + rT (AT A)−1r) = det(AT A) · (1 + rT b)

Also, upon applying the Sherman-Morrison formula A.1.2, we have

(AT A)−1rrT (AT A)−1 D−1 = (AT A + rrT )−1 = (AT A)−1 − . 1 + rT (AT A)−1r

Then,

 (AT A)−1rrT (AT A)−1  D−1(cT A + γrT )T = (AT A)−1 − (AT c + γr) 1 + rT (AT A)−1r  (AT A)−1rrT  = I − ((AT A)−1AT c + γ(AT A)−1r) 1 + rT (AT A)−1r  brT  = I − (g + γb). 1 + rT b

After combining all derivations, the proof is established.

A.3 Proof of Theorem 3.4.1

Proof. Let E = B(0, 1) ⊂ C be in the complex domain. Let A and M be the Vandermonde matrix constructed by the orthogonal polynomial φ and monomials, respectively. Then we have A = M · U where U is an upper triangular matrix. Let us define

 2/n(n−1) Y δn(Θn) :=  |zi − zj| , (A.5) 1≤i

2/n(n−1) where Θn = {z1, ··· , zn} ∈ E. Since δn(Θn) = (det M) , we have

S S F Θn = argmax δn (Θn), Θn = argmax δn(Θn), Θn Θn

F where Θn is the Fekete points.

173 We now define   X 1 En(Θn) := log | det U| + log  . (A.7) |zi − zj| 1≤i

S En (Θn) := En(Θn) + βn(Θn), (A.8) where v n−1 u n X uX 2 βn(Θn) := log t |φl(zk)| . (A.9) l=0 k=1

Since (det U) is independent of Θn, the maximization of (A.5) and (A.6) is equivalent to the minimization of (A.7) and (A.8), respectively. Since

n(n − 1) 1 En(Θn) = log + log | det U| = Een(Θn) + log | det U|, 2 δn(Θn) where n(n − 1) 1 Een(Θn) = log , 2 δn(Θn) we have

F S S S F F En(Θn ) + βn(Θn) ≤ En(Θn) + βn(Θn) ≤ En(Θn ) + βn(Θn )

and subsequently,

E (ΘS) E (ΘF ) β (ΘF ) β (ΘS) en n en n n n n n − ≤ − . n(n − 1) n(n − 1) n(n − 1) n(n − 1)

Since kφlk∞ grows at most polynomially with l,

n−1 Y √ k n kφl(Θn)k ≤ (C n(n + 1)) , for any Θn, l=0 for some constants C, and k < ∞. Also, the uniform boundedness from below implies

Qn−1 S n n log α1 l=0 kφl(Θn)k ≥ α1 = e for some α1 > 0. We then have

√ k S n log C n(n + 1) ≥ βn(Θn) ≥ n log α1.

F Similarly, for Θn , the uniform boundedness of the determinant from below implies

174 F F | det A(Θn )| ≥ α2 > 0, for some α2 independent of n. Then, En(Θn ) ≥ log α2 > −∞, and we have

F F n log nM ≥ βn(Θn ) ≥ En(Θn ) ≥ log α2.

Consequently, β (ΘS) β (ΘF ) lim n n = 0, lim n n = 0. n→∞ n(n − 1) n→∞ n(n − 1) From the fact that

F Een(Θn ) F lim = 0, lim δn(Θ ) = τ(E) = 1, n→∞ n(n − 1) n→∞ n

we then have

S Een(Θn) S lim = 0, lim δn(Θ ) = 1. n→∞ n(n − 1) n→∞ n

Therefore, by the uniqueness of the equilibrium measure,

1 X weak-* lim δz = µE (A.10) n→∞ n S z∈Θn

In the case of x ∈ [−1, 1], the equilibrium measure on E is the arcsine distribution, i.e.,

S 1 X weak-* F ν = lim δx = ν (A.11) n→∞ n S x∈<(Θn)

S S where <(Θn) is the real part of Θn.

A.4 Proof of Theorem 3.4.2

Proof. For notational convenience, we write A(Θ∞,n) and A(Θk,n) as A∞,n and Ak,n, respec-

tively. From the assumption that for each n, S(A∞,n) = 1, we have a sequence of {Θk,n}k>p n+d where p = d , and a sequence of row selection operators {Sk,n}k>p corresponding to

{Θk,n}p

T T Ak,nAk,n = W Aek,nAek,nW (A.12)

175 T and (i, j) component of Aek,nAek,n is

T X φi(x)φj(x) [Aek,nAek,n]ij = (A.13) kφikΘk,n kφjkΘk,n x∈Θk,n where s X 2 kφikΘk,n = φi(x) . x∈Θk,n

By substituting (A.12) into the definition of S(Ak,n), we have

q T det Ak,nAk,n q q T det W T S(Ak,n) = = det Ae Aek,n = det Ae Aek,n. (A.14) Qp (i) k,n Qp (i) k,n i=1 kAk,nk2 i=1 kAk,nk2 We now show the following intermediate result.

Lemma A.4.1. If limk→∞ S(Ak,n) = 1, then

T lim Aek,nAek,n = Ip, (A.15) k→∞ where Ip is the identity matrix of size p × p.

T Proof. Note that for each n, all diagonal entries of Aek,nAek,n are 1. Let {λi(k)} be the T T eigenvalues of Aek,nAek,n. Since Aek,nAek,n is positive definite, λi(k) > 0 for all 1 ≤ i ≤ p. Let T Pchar(Θk,n; x) be the characteristic polynomial of Aek,nAek,n. Then

p Y Pchar(Θk,n; x) = (x − λi). (A.16) i=1 Let us recall the arithmetic-geometric inequality

λ (k) + ··· + λ (k) q 1 p ≥ p λ (k) · λ (k) ··· λ (k). (A.17) p 1 2 p

Note that the equality holds when λ1(k) = λ2(k) = ··· = λp(k). Then (A.17) implies

T tr(A Ak,n) q q ek,n e p T p 2 1 = ≥ det Ae Aek,n = S(Ak,n) . (A.18) p k,n

After taking the limit on (A.18), the equality holds. This implies that all the eigenvalues

176 become 1, i.e., p Y p lim Pchar(Θk,n; x) = lim (x − λi(k)) = (x − 1) . k→∞ k→∞ i=1 Furthermore,

T Pchar(Θk,n; x) = det(xIp − Aek,nAek,n)   x − 1 a1,2 ··· a1,p      a1,2 x − 1 ··· a2,p  = det    . . . .   . . .. .   . . .    a1,p a2,p ··· x − 1   p X 2 p−1 = (x − 1) −  ai,j (x − 1) + ··· 1≤i

Since the coefficient of (x−1)p−1 has to converge to zero as k → ∞, the lemma is proved.

Based on the lemma, we have

X φi(x)φj(x) lim = 0, 1 ≤ i, j ≤ p, i 6= j. (A.19) k→∞ kφikΘk,n kφjkΘk,n x∈Θk,n

Since kφik∞ ≤ Cp for all 1 ≤ i ≤ p (in compact domain), we have

1 1 X X φi(x)φj(x) 2 φi(x)φj(x) ≤ . Cp k kφikΘk,n kφjkΘk,n x∈Θk,n x∈Θk,n By taking the limit on both sides, we have

1 X lim φi(x)φj(x) = 0, 1 ≤ i, j ≤ p, i 6= j. (A.20) k→∞ k x∈Θk,n

Let νk,n be the empirical measure of Θk,n. We then have a sequence of empirical measures R {νk,n}k>p such that Γ dνk,n = 1 for any n and k > p. We now introduce a result from pluripotential theory. Let Γ be a compact subset of

d R and C(Γ) be the space of continuous functions f :Γ → R, equipped with the usual sup-norm. Let P(Γ) be the collection of all Borel probability measure on Γ.

177 Definition A.4.2. A sequence (µn)n≥1 in P(Γ) is weak*-convergent to µ ∈ P(Γ), if Z Z fdµn → fdµ for each f ∈ C(Γ). (A.21) Γ Γ

Lemma A.4.3. Every sequence (µn)n≥1 in P(Γ) has a subsequence which is weak*-convergent to some µ ∈ P(Γ).

By using the Lemma A.4.3 on the sequence {νk,n}k>p for each n, there exists a subse-

∞ ∗ quence {νkj ,n}j=1 such that it converges to some measure νn in the weak* sense, and we immediately have

Z Z ∗ 1 X φiφjdνn = lim φiφjdνkl,n = lim φi(x)φj(x) = 0. (A.22) l→∞ l→∞ kl Γ Γ x∈Θ kl,n

∗ In other words, for fixed n ∈ N, there exists a measure νn which satisfies the polynomial orthogonality for up to degree n. Furthermore, by applying Lemma A.4.3 on the collection

∗ ∗ ∗ of νn, there exists a subsequence νnj that converges to a measure ν with  Z  0 if i 6= j, ∗  φiφjdν = (A.23) Γ  γi if i = j.

Therefore, by the uniqueness of orthogonal probability measure, we have ν∗ = µ.

178 APPENDIX B

PROOFS OF CHAPTER 4

B.1 Proof of Theorem 4.2.1

Proof. Let Ax¯ be the noise free observation. Then, the triangle inequality leads to

opt opt kAx − Ax¯k2 ≤ kAx − bk2 + kAx¯ − bk2 ≤ 2. (B.1)

opt opt opt Sincex ¯ is feasible, we have kx k1−kx k2 ≤ kx¯k1−kx¯k2, where x can be decomposed

opt as x =x ¯ + h. Let T0 be the support ofx ¯, and hT0 be the vector which coincides with h

on the indices in T0 and is zero elsewhere outside T0. Then,

opt opt kx k1 − kx¯k1 ≤ kx k2 − kx¯k2 ≤ khk2,

opt kx k1 ≤ kx¯k1 + khk2,

and

opt kx k = kx¯ + hk ≥ kx¯k − kh k + kh c k . 1 1 1 T0 1 T0 1

These inequalities then lead to

kh c k ≤ kh k + khk . (B.2) T0 1 T0 1 2

c The rest of the proof follows from [22]. We begin by dividing T0 into subsets of size M and enumerate T c as n , n , ··· , n in decreasing order of magnitude of hc . Set 0 1 2 N−|T0| T0

Tj = {nl :(j − 1)M + 1 ≤ l ≤ jM}. That is, T1 contains the indices of the M largest coefficients of hc , T contains the indices of the next M largest coefficients, and so on. T0 2

179 With this decomposition, the `2-norm of h is concentrated on T01 = T0 ∪ T1. Indeed, the

kth largest value of h c obeys T0

|h c | ≤ kh c k /k, T0 (k) T0 1

and, therefore, N 2 2 X 2 2 kh c k ≤ kh c k 1/k ≤ kh c k /M. T0 2 T0 1 T0 1 k=M+1 Furthermore, (B.2) gives p khT0 k1 + khk2 |T0|khT0 k2 + khk2 khT c k2 ≤ √ ≤ √ . 0 M M

Consequently,

p !2 2 2 2 2 |T0|khT0 k2 + khk2 khk = khT k + khT c k ≤ khT k + √ 2 01 2 01 2 01 2 M r  |T | |T |r 1 1 ≤ 1 + 0 kh k2 + 2 0 kh k khk + khk2, M T01 2 M M T01 2 2 M 2

which gives r  1  |T |r 1  |T | 1 − khk2 − 2 0 kh k khk − 1 + 0 kh k2 ≤ 0. M 2 M M T01 2 2 M T01 2

Therefore,

khk2 ≤ B|T0|,M khT01 k2, (B.3) where

p|T | + pM|T | + M(M − 1) B := 0 0 . |T0|,M M − 1

Note that by construction, the magnitude of each coefficient in Tj+1 is less than the average

of the magnitudes in Tj, i.e.,

|hTj+1 (t)| ≤ khTj k1/M.

Then,

2 2 khTj+1 k2 ≤ khTj k1/M

180 and it follows that

√ √ p X X khT0 k1 + khk2 |T0|khT0 k2 + khk2 kh k ≤ kh k / M = kh c k / M ≤ √ ≤ √ . Tj 2 Tj 1 T0 1 j≥2 j≥1 M M

Thus, we have

q p X kAhk2 ≥ 1 − δM+|T0|khT01 k2 − 1 + δM khTj k2 j≥2 p q p |T0|khT0 k2 + khk2 ≥ 1 − δM+|T |khT k2 − 1 + δM √ 0 01 M

≥ C|T0|,M · khT01 k2 − DM khk2,

√ q p p 1+δM where C|T0|,M := 1 − δM+|T0| − |T0|/M 1 + δM and DM = M .

It then follows from (B.3) and kAhk2 ≤ 2 that

kAhk2 + DM khk2 khk2 ≤ B|T0|,M khT01 k2 ≤ B|T0|,M · (B.4) C|T0|,M 2 DM ≤ B|T0|,M + B|T0|,M khk2. (B.5) C|T0|,M C|T0|,M

Therefore, we have

  DM 2 1 − B|T0|,M khk2 ≤ B|T0|,M . (B.6) C|T0|,M C|T0|,M

Let M = 3|T0| and |T0| = s, the left hand side is positive if the following satisfies

 3s − 1 2 a(s) = √ √ > 1, 3s + 4s − 1 and A satisfies

δ3s + a(s)δ4s < a(s) − 1.

Thus we have

khk2 ≤ CS · , (B.7) where √ 2[ 3s − ps · a(s)] CS := p √ . a(s)(1 − δ4s) − 1 + δ3s

181 This proves the proof.

B.2 Proof of Theorem 4.2.2

Proof. In order to prove Theorem 4.2.2, we first prove the following theorem.

m Theorem B.2.1. Suppose x¯ is s-sparse and b = Ax¯ + e, where e ∈ R is any perturbation with kek2 ≤ τ. Let s1, s2 be positive integers such that s1 ≥ s and 8(s1 − s) ≤ s2. Let r √ s1 s2 2(s1 − s) a(s1, s2; s) := + √ − √ (B.8) s2 4 s1 s1s2 r 1 r 1 b(s1) := + + 2 (B.9) 4s1 4s1

Then, if A satisfies √ δs1 + [a(s1, s2; s) + b(s1)/ s2]θs1,s2 < 1 (B.10) the solution xopt of (4.2) satisfies p opt 2b(s1) 1 + δs1 kx − x¯k2 ≤ √ · τ (B.11) 1 − δs1 − [a(s1, s2; s) + b(s1)/ s2]θs1,s2

Proof. The proof mostly follows from [17]. Let h = xopt − x¯. We can assume |h(1)| ≥

|h(2)| ≥ · · · ≥ |h(n)| after rearranging the indices if necessary. Let T = {1, ··· , s} and Ω be the support ofx ¯. Since

opt kx k1 = kx¯ + hk1 = kx¯Ω + hΩk1 + khΩc k1

≥ kx¯k1 − khΩk1 + khΩc k1

opt it follows from kx k1 ≤ kx¯k1 + khk2 that

khΩc k1 ≤ khΩk1 + khk2. (B.12)

Since Ωc ∩ T and Ω ∩ T c both have the same cardinality s − |Ω ∩ T | , we have

khΩ∩T c k1 ≤ khΩc∩T k1. (B.13)

182 From (B.12) and (B.13), it can be shown that

khT k1 = khΩ∩T k1 + khΩc∩T k1 = khΩk1 + khΩc∩T k1 − khΩ∩T c k1

≥ −khk2 + khΩc k1 + khΩc∩T k1 − khΩ∩T c k1

= −khk2 + khΩc∩T c k1 + 2khΩc∩T k1 − khΩ∩T c k1

= −khk2 + khT c k1 + 2khΩc∩T k1 − 2khΩ∩T c k1

≥ −khk2 + khT c k1

Thus, we have

khT c k1 ≤ khT k1 + khk2 (B.14)

From [17], we have

√   kxk1 p kxk2 − √ ≤ max |xi| − min |xi| (B.15) p 4 1≤i≤p 1≤i≤p

p for any x ∈ R . Let us partition {1, ··· , p} into the following sets:

S0 = {1, 2, ··· , s1},S1 = {s1 + 1, ··· , s1 + s2}

S2 = {s1 + s2 + 1, ··· , s1 + 2s2}, ······

Then, it follows from (B.15) that √ X X khSi k1 s2 X khSi k2 ≤ √ + (|h(s1 + (i − 1)s2 + 1)| − |h(s1 + is2)|) s2 4 i≥1 i≥1 i≥1 √ kh c k S0 1 s2 ≤ √ + |h(s1 + 1)| s2 4 Ps1−s √ khT c k1 − j=1 |h(s + j)| s2 = √ + |h(s1 + 1)| s2 4 √ khT c k1 − (s1 − s)|h(s1 + 1)| s2 ≤ √ + |h(s1 + 1)| s2 4 √ khk2 + khT k1 − (s1 − s)|h(s1 + 1)| s2 ≤ √ + |h(s1 + 1)| s2 4 √ khk2 + khS0 k1 − 2(s1 − s)|h(s1 + 1)| s2 ≤ √ + |h(s1 + 1)| s2 4

183 r √  khk2 s1 s2 2(s1 − s) ≤ √ + khS0 k2 + − √ |h(s1 + 1)| s2 s2 4 s2 r √  khk2 s1 s2 2(s1 − s) ≤ √ + + √ − √ khS0 k2 s2 s2 4 s1 s1s2 khk2 = √ + a(s1, s2; s)khS0 k2. s2

For notational convenience, let us write a(s1, s2; s) and b(s1) as a and b, respectively. Thus, we obtain

X |hAh, AhS i| = hAhS , AhS i + hAhS , AhS i 0 0 0 i 0 i≥1

2 X ≥ (1 − δs1 )khS0 k2 − θs1,s2 khS0 k2 khSi k2 i≥1   2 khk2 ≥ (1 − δs1 )khS0 k2 − θs1,s2 khS0 k2 √ + akhS0 k2 , s2 which implies

θs1,s2 |hAh, AhS0 i| khS0 k2 ≤ √ khk2 + . (B.16) s2(1 − δs1 − aθs1,s2 ) (1 − δs1 − aθs1,s2 )khS0 k2

Also, from the following relation

2 2 khS0 k1 khS0 k1 + khS0 k1khk2 2 1 kh c k ≤ kh c k ≤ ≤ kh k + kh k khk , S0 2 S0 1 S0 2 √ S0 2 2 s1 s1 s1

we have

2 2 2 2 1 khk = kh k + kh c k ≤ 2kh k + kh k khk . 2 S0 2 S0 2 S0 2 √ S0 2 2 s1

Or, equivalently,

2 1 2 khk2 − √ khS0 k2khk2 − 2khS0 k2 ≤ 0. s1

Then, the quadratic formula gives

r 1 r 1  khk2 ≤ + + 2 khS0 k2 = bkhS0 k2. (B.17) 4s1 4s1

184 By combining (B.16) and (B.17), we obtain

khk2 ≤ bkhS0 k2

bθs1,s2 b|hAh, AhS0 i| ≤ √ khk2 + . s2(1 − δs1 − aθs1,s2 ) (1 − δs1 − aθs1,s2 )khS0 k2

This gives

  bθs1,s2 b|hAh, AhS0 i| 1 − √ khk2 ≤ s2(1 − δs1 − aθs1,s2 ) (1 − δs1 − aθs1,s2 )khS0 k2 bkAhk kAh k ≤ 2 S0 2 (1 − δs1 − aθs1,s2 )khS0 k2 2bp1 + δ ≤ s1 · τ. 1 − δs1 − aθs1,s2

Therefore, p 2b 1 + δs1 khk2 ≤ √ · τ. 1 − δs1 − (a + b/ s2)θs1,s2

We now return to the proof of Theorem 4.2.2. In the previous theorem, let us set s1 = s

4 and s2 = 9 s, where s ≡ 0 (mod 9). Then, r 5 b(s1) 3 3 1 2 a(s1, s2; s) = and √ = + 2 + . 3 s2 4s 2 4s s

If A satisfies ! 5 3 3r 1 2 δs + + + + θs, 4 s < 1, 3 4s 2 4s2 s 9

the solution xopt of (4.2) satisfies √ opt 2b(s) 1 + δs kx − x¯k2 ≤ · τ.  5 3 3 q 1 2  1 − δs − + + 2 + θ 4 3 4s 2 4s s s, 9 s Note that ! ! 5 3 3r 1 2 5 3 3r 1 2 δs + + + + θs, 4 s = δs + + + + θ 9 5 s, 4 s 3 4s 2 4s2 s 9 3 4s 2 4s2 s 5 9 9

185 ! r9 5 3 3r 1 2 ≤ δs + + + + θ 5 s, 4 s 5 3 4s 2 4s2 s 9 9 ! √ 9 9 r 1 2 ≤ 1 + 5 + √ + √ + δs 4 5s 2 5 4s2 s

= (1 + c(s))δs < 1.

Therefore, under the condition

1 δ < , s 1 + c(s) we have √ opt 2b(s) 1 + δs kx − x¯k2 ≤ · τ  5 3 3 q 1 2  1 − δs − + + 2 + θ 4 3 4s 2 4s s s, 9 s 2b(s)p1 + 1/(1 + c(s)) ≤ · τ. 1 − (1 + c(s))δs

B.3 Proof of Theorem 4.2.3

Proof. The proof mostly follows from [21]. From Lemma 2.1 on [21], we have

0 0 |hAx, Ax i| ≤ δs+s0 kxk2kx k2, (B.18)

for all x, x0 supported on the disjoint subsets T,T 0 ⊂ {1, ··· , p} with |T | ≤ s,|T 0| ≤ s0.

Note that xT is the vector equal to x on the index set T and zero elsewhere. From the triangle inequality, we have

opt kA(x − x)k2 ≤ 2τ. (B.19)

opt Set x = x + h and decompose h into a sum of vectors hTj , j = 0, 1, ··· , each of which

has a sparsity at most s. Here, T0 corresponds to the locations of the s largest coefficients

of x; T to the locations of the s largest coefficients of h c , and so on. 1 T0 For each j ≥ 2,

1/2 −1/2 khTj k2 ≤ s khTj k∞ ≤ s khTj−1 k1,

186 and thus

X −1/2 kh k ≤ s kh c k . (B.20) Tj 2 T0 1 j≥2 In particular,

X X −1/2 kh c k = k h k ≤ kh k ≤ s kh c k . (B.21) (T0∪T1) 2 Tj 2 Tj 2 T0 1 j≥2 j≥2

opt Since x is a solution of the `1 − `2 minimization,

opt khk + kxk ≥ kx k = kx + hk ≥ kx k − kh k + kh c k − kx c k , 2 1 1 1 T0 1 T0 1 T0 1 T0 1

which implies

kh c k ≤ kh k + 2kx c k + khk . (B.22) T0 1 T0 1 T0 1 2

1/2 By substituting (B.22) into (B.21), it follows from khT0 k1 ≤ s khT0 k2 that

−1/2 −1/2 kh c k ≤ kh k + 2s kx c k + s khk . (B.23) (T0∪T1) 2 T0 2 T0 1 2

P From the fact that AhT0∪T1 = Ah − j≥2 AhTj , we obtain

2 X kAhT0∪T1 k2 = hAhT0∪T1 , Ahi − hAhT0∪T1 , AhTj i. j≥2

It follows from (B.19) and the RIP that

p |hAhT0∪T1 , Ahi| ≤ kAhT0∪T1 k2kAhk2 ≤ 2τ 1 + δ2skhT0∪T1 k2.

Also, it follows from (B.18) that

|hAhT0 , AhTj i| ≤ δ2skhT0 k2khTj k2,

√ and since khT0 k2 + khT1 k2 ≤ 2khT0∪T1 k2,

2 2 X (1 − δ2s)khT0∪T1 k2 ≤ kAhT0∪T1 k2 ≤ |hAhT0∪T1 , Ahi| + |hAhT0∪T1 , AhTj i| j≥2 p X ≤ 2τ 1 + δ2skhT0∪T1 k2 + δ2s(khT0 k2 + khT1 k2) khTj k2 j≥2

187 p √ X ≤ khT0∪T1 k2(2τ 1 + δ2s + 2δ2s khTj k2). j≥2

Upon diving (1 − δ2s)khT0∪T1 k2 on both sides, it follows from (B.22) that √ √ 2 1 + δ2s 2δ2s X khT0∪T1 k2 ≤ τ + khTj k2 1 − δ2s 1 − δ2s j≥2 √ √ 2 1 + δ2s 2δ2s −1/2 ≤ τ + s khT c k1 1 − δ 1 − δ 0 √ 2s √ 2s 2 1 + δ2s 2δ2s −1/2 ≤ τ + s (khT k1 + 2kxT c k1 + khk2) 1 − δ 1 − δ 0 0 √ 2s √ 2s 2 1 + δ2s 2δ2s −1/2 1/2 ≤ τ + s (s kh k + 2kx c k + khk ), T0∪T1 2 T0 1 2 1 − δ2s 1 − δ2s which gives

√ !−1 √ √ ! 2δ2s 2 1 + δ2s 2δ2s −1/2 kh k ≤ 1 − τ + s (2kx c k + khk ) , T0∪T1 2 T0 1 2 1 − δ2s 1 − δ2s 1 − δ2s √ if δ2s < 2 − 1. Finally,

−1/2 −1/2 khk ≤ kh k + kh c k ≤ kh k + kh k + 2s kx c k + s khk 2 T0∪T1 2 (T0∪T1) 2 T0∪T1 2 T0 2 T0 1 2

−1/2 −1/2 ≤ 2kh k + 2s kx c k + s khk T0∪T1 2 T0 1 2 √ !−1 √ √ ! 2δ2s 2 1 + δ2s 2δ2s −1/2 ≤ 2 1 − τ + s (2kx c k + khk ) T0 1 2 1 − δ2s 1 − δ2s 1 − δ2s

−1/2 −1/2 + 2s kx c k + s khk T0 1 2 √ √ √ 4 1 + δ 2 + (2 2 − 2)δ kxT c k1 1 + ( 2 − 1)δ khk = √ 2s τ + √ 2s √0 + √ 2s √ 2 , 1 − ( 2 + 1)δ2s 1 − ( 2 + 1)δ2s s 1 − ( 2 + 1)δ2s s which gives √ ! √ √ kx c k 1 + ( 2 − 1)δ2s 1 4 1 + δ2s 2 + (2 2 − 2)δ2s T0 1 1 − √ √ khk2 ≤ √ τ + √ √ . 1 − ( 2 + 1)δ2s s 1 − ( 2 + 1)δ2s 1 − ( 2 + 1)δ2s s

√ s−1 For δ < √ √ √ , we immediately have 2s ( 2+1) s+ 2−1

khk ≤ C τ + C kx c k , 2 1 2 T0 1

188 where √ 4 1 + δ2s C1 = h√ √ i , √1 √2−1 1 − s − 2 + 1 + s δ2s √ 2 + (2 2 − 2)δ2s C2 = √  √ √ √  . s − 1 − ( 2 + 1) s + 2 − 1 δ2s

189 APPENDIX C

PROOFS OF CHAPTER 5

C.1 Proof of Theorem 5.3.1

To prove Theorem 5.3.1, we first introduce two preliminary results.

Preliminary

Let

T c˜ = [˜c1,..., c˜p] , c˜j = hf, φjiν˜.

At the k-th iteration, the expectation Ek is understood as the expectation of the random

sample zk = Xk, conditioned upon the previous (k − 1) random variables {Xj}1≤j≤k−1, i.e.,

Ek(·) := EXk (·|X1, ··· ,Xk−1). (C.1)

Lemma C.1.1. The k-th iterative solution of the algorithm (5.3) satisfies

k j−1 1 X  1  (c(k)) = I − Σ c˜. (C.2) E p˜ p˜ j=1

Proof. Let us define g(c(k),X) as follow;

(k) (k) f(X) − hΦ(X), c i g(c ,X) = 2 Φ(X). (C.3) kΦ(X)k2

Then,

c(k) = c(k−1) + g(c(k−1),X). (C.4)

190 By taking the expectation on (C.4) we obtain

1 1 (c(k)) = c(k−1) + (g(c(k−1),X)) = c(k−1) + c˜ − Σ · c(k−1) Ek Ek p˜ p˜  1  1 = I − Σ c(k−1) + c˜. p˜ p˜

Then, by applying this recursive relation iteratively and taking full expectation, we have

k j−1 1 X  1  (c(k)) = I − Σ c˜. E p˜ p˜ j=1

Lemma C.1.2. The k-th iterative solution of the algorithm (5.3) satisfies

2 2 k kfkν˜ (k) 2 k kfkν˜ (1 − r` ) ≤ E(kc k2) ≤ (1 − ru) , (C.5) λmax(Σ) λmin(Σ)

where ru = 1 − λmin(Σ)/p˜ and r` = 1 − λmax(Σ)/p˜.

Proof. From the relation c(k) = c(k−1) + g(c(k−1),X), we have

(k) 2 (k−1) 2 (k−1) (k−1) (k−1) 2 Ek(kc k2) = kc k2 + 2Ek(hc , g(c ,X)i) + Ek(kg(c ,X)k2).

Then second term can be computed as

1 1 T (hc(k−1), g(c(k−1),X)i) = c˜T c(k−1) − c(k−1) Σc(k−1), Ek p˜ p˜ and the third term becomes

(k−1) 2 ! (k−1) 2 (f(X) − hc , Φ(X)i) Ek(kg(c ,X)k2) = Ek 2 kΦ(X)k2 2 kfk 2 1 T = ν˜ − c˜T c(k−1) + c(k−1) Σc(k−1). p˜ p˜ p˜

We then obtain

2 1 T kfk (kc(k)k2) = kc(k−1)k2 − c(k−1) Σc(k−1) + ν˜ Ek 2 2 p˜ p˜ 2 T  1  kfk = c(k−1) I − Σ c(k−1) + ν˜ (C.6) p˜ p˜  λ (Σ) kfk2 ≤ 1 − min kc(k−1)k2 + ν˜ . p˜ 2 p˜ 191 Similarly the lower bound can be derived as,

 λ (Σ) kfk2 (kc(k)k2) ≥ 1 − max kc(k−1)k2 + ν˜ . Ek 2 p˜ 2 p˜

After repeating the same operation (k − 1) times, the k-th term can be bounded by

2 2 k kfkν˜ (k) 2 k kfkν˜ (1 − r` ) ≤ E(kc k2) ≤ (1 − ru) , (C.7) λmax(Σ) λmin(Σ)

where ru = 1 − λmin(Σ)/p˜ and r` = 1 − λmax(Σ)/p˜.

Proof of Theorem 5.3.1

We now complete the proof of Theorem 5.3.1.

Proof.

(k) 2 2 T (k) (k) 2 E(kc − cˆk2) = kcˆk2 − 2cˆ E(c ) + E(kc k2) 2 2 T (k) k kfkν˜ ≤ kcˆk2 − 2cˆ E(c ) + (1 − (ru) ) λmin(Σ) k j−1 2 X  1  kfk2 = kcˆk2 − cˆT · I − Σ c˜ + (1 − (r )k) ν˜ . 2 p˜ p˜ u λ (Σ) j=1 min

Similarly,

k j−1 2 X  1  kfk2 (kc(k) − cˆk2) ≥ kcˆk2 − cˆT · I − Σ c˜ + (1 − (r )k) ν˜ . E 2 2 p˜ p˜ ` λ (Σ) j=1 max

From the eigenvalue decomposition of Σ, we have

k j−1 X  1  I − Σ = QT D(k)Q p˜ j=1 where

k (k) (k) (k) (k) 1 − (1 − λj/p˜) D = diag[d1 , . . . , dp ], dj = . λj/p˜

Pp Note that tr(Σ) =p ˜ = j=1 λj. Thus 0 < λj/p˜ < 1. Therefore the second term in the

192 (k) 2 lower and upper bound for E(kc − cˆk2) can be written as follow:

k j−1 2 X  1  2 cˆT · I − Σ c˜ = cˆT QT D(k)Qc˜ p˜ p˜ p˜ j=1

From the expansion of PΠf, we have

p X f(x) = cˆjφj(x) + f(x) − PΠf(x) → c˜ = Σcˆ + e j=1

where e = [1, . . . , p] and k = hf − PΠf, φkiν˜. Thus we obtain

k j−1 2 X  1  2 cˆT · I − Σ c˜ = cˆT QT D(k)Qc˜ p˜ p˜ p˜ j=1 2 2 = cˆT QT D(k)QΣcˆ + cˆT QT D(k)Qe p˜ p˜ D(k)Λ 2 = 2cˆT QT Qcˆ + cˆT QT D(k)Qe p˜ p˜

k 2 (k) ≥ 2(1 − ru)kcˆk2 − 

(k) 2 T T (k) where  := − p˜cˆ Q D Qe. Similarly we have

k j−1 2 X  1  cˆT I − Σ c˜ ≤ 2(1 − rk)kcˆk2 − (k). p˜ p˜ ` 2 j=1

Therefore,

2 (k) 2 2 k 2 (k) k kfkν˜ Ekc − cˆk2 ≤ kcˆk2 − 2(1 − ru)kcˆk2 +  + (1 − ru) λmin(Σ)  2   2  kfkν˜ 2 k 2 kfkν˜ (k) = − kcˆk2 + ru 2kcˆk2 − +  λmin(Σ) λmin(Σ) and

 2   2  (k) 2 kfkν˜ 2 k 2 kfkν˜ (k) Ekc − cˆk2 ≥ − kcˆk2 + r` 2kcˆk2 − +  . λmax(Σ) λmax(Σ)

193 C.2 Proof of Theorem 5.3.3

Proof. The proof mostly follows Theorem 6.2 in [29]. Let

−k (k) 2 2 Yk = r (kc − cˆk2 − kf − PΠfkω), r = 1 − 1/p,

and Fk be the sigma algebra generated by the i.i.d. random variables {Xj}1≤j≤k. Since

−k (k) 2 f ∈ Π, we have Yk = r kc − cˆk2 ≥ 0 for all k. From (C.5) and (C.6), we have

(k) 2 2 T (k) (k) 2 Ek(kc − cˆk2) = kcˆk2 − 2cˆ Ek(c ) + Ek(kc k2)  1  1 = kcˆk2 − 2cˆT rc(k−1) + cˆ + r(kc(k−1)k2) + kfk2 2 p 2 p ω 1 = r(kc(k−1)k2 − 2cˆT c(k−1) + kcˆk2) + kf − P fk2 2 2 p Π ω

(k−1) 2 2 2 = r(kc − cˆk2 − kf − PΠfkω) + kf − PΠfkω

where Ek is defined in (C.1). Hence

−k (k) 2 2 E[Yk|Fk−1] = Ek[r (kc − cˆk2 − kf − PΠfkω)]

−k (k) 2 2 = r (Ek[kc − cˆk2] − kf − PΠfkω)

−k (k−1) 2 2 = r r(kc − cˆk2 − kf − PΠfkω)

= Yk−1.

2 2 ∞ It follows from E[|Yk|] = kPΠfkω − kf − PΠfkω < ∞ that {Yk, Fk}k=1 is a martingale. Then by the Martingale Convergence Theorem there exists a finite-valued random variable

Y∞ such that

lim Yk = Y∞ almost surely. k→∞

−k (k) 2 2 Since Yk = r (kc − cˆk2 − kf − PΠfkω), we have

−k (k) 2 2 lim r (kc − cˆk2 − kf − PΠfkω) = Y∞ almost surely. k→∞

194 This leads to

(k) 2 2 lim kc − cˆk2 = kf − PΠfkω almost surely. k→∞

195 APPENDIX D

PROOFS OF CHAPTER 6

D.1 Proof of Theorem 6.2.1

Proof. We slightly modify the proof of Theorem 1.2 in [21] to incorporate the term σ2s,1(x). It can be easily seen that

∗ ∗ kF (x − x)k2 ≤ kF x − yk2 + kF x − yk2 ≤ 2τ. (D.1)

∗ Set x = x + h and decompose h into a sum of vectors hT0 , hT1 , hT2 ,... , each of which has

sparsity at most s. Here T0 corresponds to the locations of the s largest coefficients of x;

T to the locations of the s largest coefficients of h c ; and so on. Note that for each j ≥ 3, 1 T0

−1/2 −1/2 khTj k2 ≤ s khTj k∞ ≤ s khTj−1 k1, and thus

X −1/2 X −1/2 c khTj k2 ≤ s khTj−1 k1 ≤ s kh(T0∪T1) k1. (D.2) j≥3 j≥3

In particular,

X −1/2 c c kh(T0∪T1∪T2) k2 ≤ khTj k2 ≤ s kh(T0∪T1) k1. (D.3) j≥3

∗ Since x is the `1 minimizer,

X X kxk1 ≥ kx + hk1 = |xi + hi| + |xi + hi| c i∈T0∪T1 i∈(T0∪T1)

c c ≥ kxT0∪T1 k1 − khT0∪T1 k1 + kh(T0∪T1) k1 − kx(T0∪T1) k1,

196 which gives

c kh(T0∪T1) k1 ≤ khT0∪T1 k1 + 2σ2s,1(x). (D.4) √ By applying (D.3), it follows from (D.4) and khT0∪T1 k1 ≤ 2skhT0∪T1 k2 that √ −1/2 c kh(T0∪T1∪T2) k2 ≤ 2khT0∪T1 k2 + 2s σ2s,1(x). (D.5)

P Since F hT0∪T1∪T2 = F h − j≥3 F hTj , we have

2 X kF hT0∪T1∪T2 k2 = hF hT0∪T1∪T2 , F hi − hF hT0∪T1∪T2 , F hTj i. j≥3

Then (D.1), the restricted isometry property and Lemma 2.3.3 give

p |hF hT0∪T1∪T2 , F hi| ≤ kF hT0∪T1∪T2 k2kF hk2 ≤ 2τ 1 + δ3skhT0∪T1∪T2 k2.

Also, it follows from Lemma 2.3.3 that for j ≥ 3

|hF hT0∪T1 , F hTj i| ≤ δ3skhT0∪T1 k2khTj k2,

|hF hT1∪T2 , F hTj i| ≤ δ3skhT1∪T2 k2khTj k2,

|hF hT2∪T0 , F hTj i| ≤ δ3skhT2∪T0 k2khTj k2.

From the Arithmetic-Geometric mean, we have √ kh k + kh k + kh k 6 T0∪T1 2 T1∪T2 2 T2∪T0 2 ≤ kh k , 2 2 T0∪T1∪T2 2

1 and since hT0∪T1∪T2 = 2 (hT0∪T1 + hT1∪T2 + hT2∪T0 ),

X X 1 X |hF h , F h i| ≤ |hF h , F h i| T0∪T1∪T2 Tj 2 Ti∪Tk Tj j≥3 0≤i

khT ∪T k2 + khT ∪T k2 + khT ∪T k2 X ≤ 0 1 1 2 2 0 δ kh k 2 3s Tj 2 j≥3 √ 6 X ≤ δ kh k kh k . 2 3s T0∪T1∪T2 2 Tj 2 j≥3

Hence, we obtain √ p 6 X (1 − δ )kh k2 ≤ kF h k2 ≤ kh k (2τ 1 + δ + δ kh k ). 3s T0∪T1∪T2 2 T0∪T1∪T2 2 T0∪T1∪T2 2 3s 2 3s Tj 2 j≥3 197 Therefore, it follows from (D.2) that

−1/2 c khT0∪T1∪T2 k2 ≤ ατ + ρs kh(T0∪T1) k1, (D.6)

where √ √ 2 1 + δ 6 δ α ≡ 3s , ρ ≡ 3s . 1 − δ3s 2 1 − δ3s Now we conclude from the last inequality and (D.4) that

−1/2 khT0∪T1∪T2 k2 ≤ ατ + ρkhT0∪T1∪T2 k2 + 2ρs σ2s,1(x),

and by solving for khT0∪T1∪T2 k2, we obtain

α 2ρ kh k ≤ τ + s−1/2σ (x). T0∪T1∪T2 2 1 − ρ 1 − ρ 2s,1

Note that the condition on δ < 2√ guarantees ρ < 1. Finally, by using (D.5) we have 3s 2+ 6

c khk2 ≤ khT0∪T1∪T2 k2 + kh(T0∪T1∪T2) k2 √ −1/2 ≤ (1 + 2)khT0∪T1∪T2 k2 + 2s σ2s,1(x) √ √ ! 1 + 2 (1 + 2)ρ ≤ ατ + 2 + 1 s−1/2σ (x). 1 − ρ 1 − ρ 2s,1

This completes the proof.

D.2 Proofs for the probabilistic results of the samples

This chapter contains the proofs for the two theorems regarding the probabilistic properties of the i.i.d. samples in Chapter 6.2

D.2.1 Proof of Theorem 6.2.2

Z Proof. For fixed  and r, let Em,s be the event that for each Zi ∈ Θes there exist at least s one random variable Zji , ji > s which lies in B(Zi, r) and {Zji }i=1 are all distinct. Since c c c P(Em,s) = 1 − P(Em,s), it is suffice to compute P(Em,s). Em,s is the event that there Z exist at least one Zi ∈ Θes such that for all j > s, Zj ∈/ B(Zi, r). We can partition the c event Em,s in term of the number of such random variables. To be more precise, let Ωk 198 Z be the event that there exist exactly k random variables Zi ∈ Θes such that for all j > s,

Zj ∈/ B(Zi, r). Then

s c [ Em,s = Ωk, Ωi ∩ Ωj = ∅ if i 6= j, k=1 and furthermore,

s c X P(Em,s) = P(Ωk). (D.7) k=1 Ps Let us introduce the multi-index notation α = (α1, . . . , αs) with |α| = i=1 |αi| and αi is

the number of random variables Zj, j > s which lies in B(Zi, r). In our setting, |α| ≤ m−s.

Then for each k, the event Ωk can be written in term of α as follow;

Ωk = {α : |α| ≤ m − s, kαk0 = s − k}.

Then the probability of the event Ωk is

s s !m−s−|α| X m − s Y X P(Ω ) = pαi 1 − p k α i i α∈Ωk i=1 i=1

and by (D.7) the proof is completed.

D.2.2 Proof of Theorem 6.2.3

Proof. Let us partition Ω into the following sets;

s [ Ω = Ωk, where Ωk = {α ∈ Ω: kαk0 = s − k}. k=1

Note that kαk0 = s − k implies that the number of zero entries in α is k. Furthermore each

Ωk can be partitioned as

[ Ωk = Ωk,I where Ωk,I = {α ∈ Ωk : αI = 0}, I

k where I = {ji}i=1 ⊂ {1, 2, . . . , s} is an index set such that αji = 0. Then for each I,

s s !m−s−|α| s !m−s−|α| X m − s Y X X m − s Y X pαi 1 − p ≤ pαi 1 − p α i i α i i α∈Ωk,I i=1 i=1 α i/∈I i=1

199 !m−s−|α| X = 1 − pi . i/∈I Therefore, by adding up all the subcomponents, we obtain

m−s s !m−s s   X X X P(m, s, r, µ) ≥ 1 − 1 − pi − 1 − pi i=1 j=1 i6=j  m−s X X − 1 − pi − · · · 1≤j1

As letting m → ∞, the property is proved.

If pi =p ¯ for all i, we have

s P(m, s, r, µ) ≥ 1 − (1 − sp¯)m−s − (1 − (s − 1)¯p)m−s 1 s  s  − (1 − 2p)m−s − · · · − (1 − p¯)m−s 2 s − 1  s  s  s > 1 − 1 + + ··· + + (1 − p¯)m−s 1 s − 1 s

= 1 − 2s(1 − p¯)m−s.

200