TOPICS IN HIGH-DIMENSIONAL APPROXIMATION THEORY
DISSERTATION
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of the Ohio State University
By Yeonjong Shin Graduate Program in Mathematics
The Ohio State University 2018
Dissertation Committee: Dongbin Xiu, Advisor Ching-Shan Chou Chuan Xue c Copyright by Yeonjong Shin 2018 ABSTRACT
Several topics in high-dimensional approximation theory are discussed. The fundamental problem in approximation theory is to approximate an unknown function, called the target function by using its data/samples/observations. Depending on different scenarios on the data collection, different approximation techniques need to be sought in order to utilize the data properly. This dissertation is concerned with four different approximation methods in four different scenarios in data.
First, suppose the data collecting procedure is resource intensive, which may require expensive numerical simulations or experiments. As the performance of the approximation highly depends on the data set, one needs to carefully decide where to collect data. We thus developed a method of designing a quasi-optimal point set which guides us where to collect the data, before the actual data collection procedure. We showed that the resulting quasi-optimal set notably outperforms than other standard choices.
Second, it is not always the case that we obtain the exact data. Suppose the data is corrupted by unexpected external errors. These unexpected corruption errors can have big magnitude and in most cases are non-random. We proved that a well-known classical method, the least absolute deviation (LAD), could effectively eliminate the corruption er- rors. This is a first systematic and mathematical analysis of the robustness of LAD toward the corruption.
Third, suppose the number of data is insufficient. This leads an underdetermined system which admits infinitely many solutions. The common approach is to seek a sparse solution via the `1-norm due to the work of [22, 24]. A key issue is to promote the sparsity using as few data as possible. We consider an alternative approach by employing `1−`2, motivated
by its contours. We improve the existing theoretical recovery results of `1−`2, extended it
ii to the function approximation.
Lastly, suppose the data is sequentially or continuously collected. Due to its large volume, it could be very difficult to store and process all data, which is a common challenge in big data. Thus a sequential function approximation method is presented. It is an iterative method which requires ONLY vector operations (no matrices). It updates the current solution using only one sample at a time, and does not require storage of the data set. Thus the method can handle sequentially collected data and gradually improve its accuracy. We establish sharp upper and lower bounds and also establish the optimal sampling probability measure. We remark that this is the first work on this new direction.
iii ACKNOWLEDGMENTS
With the completion of this dissertation, one big chapter of my life as a student finally has come to an end. At the moment of writing my dissertation acknowledgments, I find myself somewhat speechless and unsure of where to begin. It has been a long adventurous journey and it would not be possible without the help of many individuals.
First and foremost, I express my deepest gratitude to my Ph.D. thesis advisor, Professor
Dongbin Xiu. During my graduate years, he has been a kind and thoughtful mentor and an enthusiastic and insightful advisor. His scientific insight is exceptional and his ability to share it is second to none. I am truly grateful for the support and opportunities he has provided. He has not only taught me many invaluable academic lessons, but many important life lessons that I will treasure for a lifetime. It was a great pleasure to be his student and to work with him.
Secondly, I extend my gratitude to three of my undergraduate professors. Professor
Hyoung June Ko is at the Department of Mathematics, Yonsei University in South Korea.
He is the one who taught me how to study and understand mathematics and the underlying common concept flowing through all fields of mathematics. Also he has taught what the true treasures in my life are. Professor Jeong-Hoon Kim is also at the Department of
Mathematics, Yonsei University. He is the one who helped me a lot in preparing and applying the graduate schools. Without his help, I might not have an opportunity to study abroad. Professor Yoon Mo Jung was at Yonsei University, now at the Department of Mathematics, Sungkyunkwan University in South Korea. He was my undergraduate research advisor. It was him who guided me to the field of computational mathematics and scientific computing. He has been provided so many realistic advice and valuable comments and shared his experience of a path I have yet to tread. Also he is the one who introduces
iv Professor Dongbin Xiu to me at Yonsei University, back in 2013.
Thirdly, I express my appreciation to my dissertation committee members, Professor
Ching-Shan Chou and Professor Chuan Xue, for sparing their precious time. Also I thank to Dr. Jeongjin Lee who recognized my mathematical talent and encouraged me to study more advanced mathematics in my teenage years.
Last but not the least, my utmost thanks and immense gratitude go to my family.
My parents, Kangseok Shin and Junghee Hwang in South Korea, and my older brother
Yeonsang Shin who is an architect in Tokyo, Japan, have always been there for me and believed in me in every step of my life. My parents-in-law, Jangwon Lee and Mihyung Yoo in South Korea, and my brother-in-law Daesop Lee in California, have helped me in various ways in my life in the United States. Altogether my entire family have supported me in all those years and made my life a much better and a much more comfortable one. Without their support and love, this would probably never have been written.
Above all I would like to thank my beloved wife, a great pianist, Yunjin Lee, who is a doctoral student at the University of Texas at Austin. Ever since we were undergraduates together at Yonsei University, she has been supportive, encouraging, and taking care of me. From writing thoughtful cards, to listening to my thoughts, fears, and excitement, to doing everything in her power to make sure I had the strength I needed to face the many challenges along the way, she has always been there doing anything she could to lighten the load, and make me smile along the way.
v VITA
1988 ...... Born in Seoul, South Korea
2013 ...... Bachelor of Science in Mathematics, Yonsei University
2013 ...... Bachelor of Arts in Economics, Yonsei University
Present ...... Graduate Research Associate, The Ohio State University
PUBLICATIONS
[7] Y. Shin, K. Wu and D. Xiu, Sequential function approximation using randomized samples, J. Comput. Phys., 2017 (submitted for publication).
[6] K. Wu, Y. Shin and D. Xiu, A randomized tensor quadrature method for high dimensional polynomial approximation, SIAM J. Sci. Comput., 39(5), A1811-A1833 (2017).
[5] Y. Shin and D. Xiu, A randomized algorithm for multivariate function approximation, SIAM J. Sci. Comput., 39(3), A983-A1002 (2017).
[4] L. Yan, Y. Shin and D. Xiu, Sparse approximation using `1-`2 minimization and its applications to stochastic collocation, SIAM J. Sci. Comput., 39(1), A229-A254 (2017).
[3] Y. Shin and D. Xiu, Correcting data corruption errors for multivariate function approximation, SIAM J. Sci. Comput., 38(4), A2492-A2511 (2016).
vi [2] Y. Shin and D. Xiu, On a near optimal sampling strategy for least squares polynomial regression, J. Comput. Phys., 326, 931-946 (2016).
[1] Y. Shin and D. Xiu, Nonadaptive quasi-optimal points selection for least squares linear regression, SIAM J. Sci. Comput. 38(1), A385-A411 (2016).
FIELDS OF STUDY
Major Field: Mathematics
Specialization: Approximation theory
vii TABLE OF CONTENTS
Page
Abstract ...... ii
Acknowledgments ...... iv
Vita...... vi
List of Figures ...... xii
List of Tables ...... xvi
Chapters
1 Introduction ...... 1
1.1 Expensive data and Least squares ...... 3
1.2 Few data and Advanced sparse approximation ...... 6
1.3 Big data and Sequential approximation ...... 7
1.4 Corrupted data and Least absolute deviation ...... 10
1.5 Application: Uncertainty Quantification ...... 11
1.6 Objective and Outline ...... 13
2 Review: Approximation, regression and orthogonal polynomials .... 14
2.1 Function approximation ...... 14
2.2 Overdetermined linear system ...... 17
2.3 Underdetermined linear system ...... 19
2.3.1 Sparsity ...... 20
2.4 Orthogonal Polynomials ...... 21
viii 2.5 Tensor Quadrature ...... 24
2.6 Christoffel function and the pluripotential equilibrium ...... 26
2.6.1 Pluripotential equilibrium measure on Bounded domains ...... 27
2.6.2 Pluripotential equilibrium measure on Unbounded domains . . . . . 27
2.7 Uncertainty Quantification ...... 30
2.7.1 Stochastic Galerkin method ...... 31
2.7.2 Stochastic collocation ...... 32
3 Optimal Sampling ...... 34
3.1 Quasi-optimal subset selection ...... 35
3.1.1 S-optimality ...... 39
3.2 Greedy algorithm and implementation ...... 41
3.2.1 Fast greedy algorithm without determinants ...... 43
3.3 Polynomial least squares via quasi-optimal subset ...... 44
3.4 Orthogonal polynomial least squares via quasi-optimal subset ...... 45
3.4.1 Asymptotic distribution of quasi-optimal points ...... 46
3.5 Near optimal subset: quasi-optimal subset for Christoffel least squares . . . 51
3.5.1 Asymptotic distribution of near optimal points ...... 54
3.6 Summary ...... 57
3.6.1 Quasi optimal sampling for ordinary least squares ...... 57
3.6.2 Near optimal sampling for Christoffel least squares ...... 58
4 Advanced Sparse Approximation ...... 59
4.1 Review on `1-`2 minimization ...... 59
4.2 Recovery properties of `1-`2 minimization ...... 61
4.3 Function approximation by Legendre polynomials via `1-`2 ...... 64
4.4 Function approximation by Legendre polynomials via a weighted `1-`2 . . . 68 4.5 Summary ...... 71
5 Sequential Approximation ...... 72
ix 5.1 Randomized Kaczmarz algorithm ...... 72
5.2 Sequential function approximation ...... 73
5.3 Convergence and error analysis ...... 75
5.4 Randomized tensor quadrature approximation ...... 80
5.4.1 Computational aspects of discrete sampling ...... 83
6 Correcting Corruption Errors in Data ...... 89
6.1 Assumptions ...... 91
6.2 Auxiliary results ...... 92
6.2.1 Sparsity related results ...... 92
6.2.2 Probabilistic results of the samples ...... 93
6.3 Error Analysis ...... 94
7 Numerical examples ...... 102
7.1 Optimal sampling ...... 104
7.1.1 Matrix stability ...... 105
7.1.2 Function approximation: Quasi-optimal sampling ...... 108
7.1.3 Function approximation: Near-optimal sampling ...... 115
7.1.4 Stochastic collocation for SPDE ...... 117
7.2 Advanced Sparse Approximation ...... 122
7.2.1 Function approximation: Sparse functions ...... 122
7.2.2 Function approximation: Non-sparse functions ...... 128
7.2.3 Stochastic collocation: Elliptic problem ...... 131
7.3 Sequential Approximation ...... 134
7.3.1 Continuous sampling ...... 134
7.3.2 Discrete sampling on the tensor Gauss quadrature ...... 142
7.4 Correcting Data Corruption Errors ...... 156
Bibliography ...... 164
x Appendices
A Proofs of Chapter 3 ...... 171
A.1 Proof of Theorem 3.2.1 ...... 171
A.2 Proof of Theorem 3.2.2 ...... 172
A.3 Proof of Theorem 3.4.1 ...... 173
A.4 Proof of Theorem 3.4.2 ...... 175
B Proofs of Chapter 4 ...... 179
B.1 Proof of Theorem 4.2.1 ...... 179
B.2 Proof of Theorem 4.2.2 ...... 182
B.3 Proof of Theorem 4.2.3 ...... 186
C Proofs of Chapter 5 ...... 190
C.1 Proof of Theorem 5.3.1 ...... 190
C.2 Proof of Theorem 5.3.3 ...... 194
D Proofs of Chapter 6 ...... 196
D.1 Proof of Theorem 6.2.1 ...... 196
D.2 Proofs for the probabilistic results of the samples ...... 198
D.2.1 Proof of Theorem 6.2.2 ...... 198
D.2.2 Proof of Theorem 6.2.3 ...... 199
xi LIST OF FIGURES
Figure Page
3.1 Empirical distribution of the quasi-optimal set in 1D for determined system
by four differnet types of orthogonal polynomials ...... 49
3.2 Empirical distribution of the quasi-optimal set in 1D for over-determined
system by Legendre and Chebyshev ...... 50
3.3 Empirical distribution of the quasi-optimal set in 1D for over-determined
system by Jacobi(1,1) and Jacobi(0,2) ...... 50
3.4 Convergence of the empirical measures of the quasi-optimal sets ...... 51
3.5 Empirical distribution of the near optimal set for over-determined system . 55
7.1 Condition numbers of the optimal model matrix by Legendre at d = 2 . . . 106
7.2 Condition numbers of the optimal model matrix by Legendre at d = 4 . . . 106
7.3 Condition numbers of the optimal model matrix by Hermite at d = 2 . . . . 107
7.4 Condition numbers of the optimal model matrix by Laguerre at d = 2 . . . 107
7.5 Errors of quasi-optimal approximation results by Legendre at d = 2 . . . . . 109
7.6 Errors of quasi-optimal approximation for the Franke by Legendre at d = 2 111
7.7 Results of quasi-optimal approximation for the Franke by Legendre at d = 2 111
7.8 Errors of quasi-optimal approximation results by monomials at d = 2 . . . . 112
7.9 Errors of quasi-optimal approximation for the Franke by monomials at d = 2 113
7.10 Results of quasi-optimal approximation for the Franke by monomials at d = 2 114
xii 7.11 Quasi-optimal function approximation results by Legendre at d = 5 . . . . . 114
7.12 Near-optimal function approximation results by Legendre at d = 2 . . . . . 116
7.13 Near-optimal function approximation results for the Franke by Legendre at
d =2...... 117
7.14 Near-optimal function approximation results by Legendre at d = 4 . . . . . 118
7.15 Near-optimal function approximation results by Hermite at d = 3 ...... 119
7.16 Convergence of optimal SPDE approximation at d = 10 ...... 120
7.17 Errors of optimal SPDE approximation at d = 10 ...... 121
7.18 `1-`2 approx. recovery rates for a sparse function w.r.t. the number of sam- ples in d =1...... 124
7.19 `1-`2 approx. errors for a sparse function w.r.t. the number of samples in d = 1124
7.20 `1-`2 approx. recovery rates for a sparse function w.r.t. the sparsity in d = 1 125
7.21 `1-`2 approx. errors for a sparse function w.r.t. the sparsity in d = 1 . . . . 125
7.22 `1-`2 approx. recovery rates for a sparse function w.r.t. the number of sam- ples in d =3...... 126
7.23 `1-`2 approx. errors for a sparse function w.r.t. the number of samples in d = 3127
7.24 `1-`2 approx. recovery rates and errors for a sparse function w.r.t. the sparsity in d =3 ...... 128
7.25 `1-`2 approx. errors for a sparse function w.r.t. the number of samples in d = 10 at s = 10 ...... 129
7.26 `1-`2 approx. errors for a sparse function w.r.t. the number of samples in d = 10 at s = 20 ...... 129
7.27 `1-`2 approx. errors for Runge function w.r.t. the number of samples at d = 1 130
7.28 `1-`2 approx. errors for non-sparse functions w.r.t. the number of samples at d =2...... 131
7.29 Stochastic Collocation by `1-`2 approximation at d = 3 ...... 132
7.30 Stochastic Collocation by `1-`2 approximation at d = 10 ...... 133 7.31 Sequential approximation via RBF basis at d = 2 ...... 136
7.32 Sequential approximation via Legendre at d = 2 ...... 137
xiii 7.33 Sequential approximation via Legendre at d = 10 ...... 139
7.34 Approximation errors in the converged solutions in d = 10 ...... 139
7.35 Error convergence verus the iteration count in d = 20, 40 ...... 141
7.36 Optimal sampling sequential approximation at d = 2 ...... 142
7.37 Optimal sampling sequential approximation at d = 40 ...... 142
7.38 Sequential approximation on tensor quadrature gird by Legendre at d = 2 . 144
7.39 Sequential approximation on tensor quadrature gird by Hermite at d = 2 . . 145
7.40 Sequential approximation on tensor quadrature gird by trigonometric poly-
nomials at d =2 ...... 147
7.41 Sequential approximation on tensor quadrature grid by Legendre at d = 10 148
7.42 Numerical and theoretical errors of sequential approximation on tensor quadra-
ture grid by Legendre at d = 10 ...... 149
7.43 Numerical and theoretical errors of sequential approximation on tensor quadra-
ture grid by Legendre at d = 40 ...... 150
7.44 Sequential approximations on tensor quadrature grid by Legendre at d = 40
via a non-optimal sampling ...... 151
7.45 Sequential approximation on tensor quadrature grid by Hermite at d = 10 . 152
7.46 Sequential approximation on tensor quadrature grid by Hermite at d = 40 . 152
7.47 Numerical and theoretical errors of sequential approximation on tensor quadra-
ture grid by Legendre at d = 100 ...... 153
7.48 Sequential approximation on tensor quadrature grid by Hermite at d = 100 154
7.49 Sequential approximation on tensor quadrature grid at d = 500 ...... 155
7.50 Correcting data corruption using LAD by Radial basis for Franke function
in d =2...... 158
7.51 Correcting data corruption using LAD by Legendre basis for Gaussian and
continuous functions in d =4...... 159
7.52 Correcting data corruption using LAD by Legendre basis for corner peak and
product peak functions in d = 4 ...... 159
xiv 7.53 Correcting data corruption using LAD by Legendre basis for Gaussian in
d = 4 with different corruption levels ...... 160
7.54 Correcting data corruption using LAD by Legendre basis for product peak
in d = 4 with different corruption levels ...... 161
7.55 Correcting data corruption using LAD by Legendre basis for Gaussian in
d =10...... 162
7.56 Correcting data corruption using LAD by Legendre basis for corner peak in
d =10...... 163
xv LIST OF TABLES
Table Page
7.1 Convergence rate of sequential approximation at d = 2 ...... 137
7.2 The number of iterations used to reach the convgered solutions with the
cardinality of the polynomial space at d = 10 ...... 140
7.3 Convergence rate of sequential approximation at d = 10 ...... 140
7.4 The number of iterations used to reach the convgered solutions with the
cardinality of the polynomial space at d = 20 and d = 40 ...... 141
xvi CHAPTER 1
INTRODUCTION
Modern approximation theory is concerned with approximating an unknown target function using its data/samples/observations. Modern problems in approximation are driven by applications in a large number of fields, including biology, medicine, image, engineering, uncertainty quantification, etcetera, just to name a few. And more importantly, those are being formulated in very high dimensions.
What makes the high dimension so special in approximation theory? Let us begin by briefly answering this question. In theory (under mild assumptions), there exists the best approximation for any given approximants. Such approximation is obtained by orthogonally projecting the target function onto the linear space of approximants. The best approxima- tion is determined by the orthogonal projection coefficients or the best coefficients. However evaluating such coefficients requires the computation of integrals and computing such inte- grals is a challenging task. Thus in practice, the best coefficients are unknown. Therefore, many efforts have been devoted to find an estimation or an approximation to the best co- efficients. One straightforward approach is to approximate integrals using the quadrature rule (cubature for multivariate). The quadrature rule is the numerical integration which approximates the integrals. This approach (if applicable) produces near best approxima- tions. Thus its solution is often used as a reference solution. In one dimension, Gaussian quadrature is one of the most commonly used one and the resulting rules are optimal [91], i.e., the minimal number of data is needed for a fixed accuracy. In high dimensions, the most straightforward way is to extend one dimensional quadrature rules to high-dimensional space by using tensor construction. However the number of data needed for the rule grows
1 exponentially with dimension. This becomes too large in a very high dimension and makes the approximation problem intractable. For extensive reviews and collections of cubature rules, see, for example [32, 50, 51, 89, 98, 44, 45].
Therefore, it is the high dimension that makes the approximation challenging as the volume of data and/or the complexity of approximation grows rapidly in dimension. This is so-called the curse of dimensionality. Hence other tractable high-dimensional approximation methods are needed.
This dissertation is concerned with the fundamental problem of function approximation
d in high dimensions. Let f(x) be an unknown function defined in a domain D in R , d ≥ 1.
We are interested in approximating f(x) using its sample values f(xi), i = 1, . . . , m.
Remark 1.0.1. In this dissertation, we employ the linear regression model for approxima- tion. Let φi(x), i = 1, . . . , p, be a set of basis in D. We then seek an approximation which has the form of a linear combination of basis functions
p X fe(x) = ciφi(x). i=1
T m T p Let y = [f(x1), . . . , f(xm)] ∈ R be the data vector, c = [c1, . . . , cp] ∈ R be the
coefficient vector, and A = (aij) = (φj(xi)), 1 ≤ i ≤ m, 1 ≤ j ≤ p, be the model matrix. In the oversampled case, m > p, the standard approach seeks to find an approximation
fe whose coefficients minimize the error, i.e.,
min ky − Ack. (1.1) c
When the vector 2-norm is used, this results in the well known least squares (LSQ) method, whose literature is too large to mention here. When the vector 1-norm is used, this results in the least absolute deviations (LAD) method. The LAD method has been studied rather extensively, dated back to the early work of [1, 6, 73, 87, 11, 68]. Since the reduction of
LAD to linear programming was established ([26]), the computation of LAD becomes a straightforward procedure. Much of the statistical properties of LAD has also been studied, cf. [7, 76, 74]. Also, see [12, 75] and the references therein. Despite of these progresses, the
2 LSQ method has remained the more popular choice in practical regression, largely due to its convenience of inference procedure and connection to analysis of variance (ANOVA) (cf,
[81, 42, 40, 33]).
In the undersampled case, m < p, the common approach seeks to find a spare approx- imation fe whose coefficients are sparse. Typically, it is obtained by solving a constrained optimization problem, i.e.,
min F(c), subject to Ac = y (1.2) c where F is an objective function measuring sparsity. Among other choices of F, one of the most common choice is the `1-norm, F(c) = kck1, following the remarkable success of compressive sensing (CS). Although sparsity is naturally measured by `0 norm, `1 norm is widely adopted in practice, as its use leads to convex optimization. Following the early fundamental work on CS, c.f., [24, 22, 34], a large amount of literature on the `1 minimization has emerged. We shall not conduct a review of the CS literature here, as this dissertation does not contain any `1 norm approach. The objective of this dissertation is on the introduction to four different high-dimensional function approximation methods developed by the author in [84, 85, 103, 86, 96, 82, 83].
Each method is designed for a specific scenario in data. In the rest of this chapter, we describe these data related scenarios and its corresponding methods along with the literature reviews.
1.1 Expensive data and Least squares
Linear regression is one of the most used approximation model for estimation, prediction, and learning of complex systems. Least squares method (LSQ), dated back to Gauss and
Legendre, is the simplest and thus most common approach in linear regression. Countless work has been devoted to the least squares method, and the amount of literature is too large to mention.
The performance of LSQ obviously depends on the sample points. The choices of the
3 samples typically follow two distinct approaches: random samples and deterministic sam- ples. In random sampling one draws the samples from a probability measure, which is often defined by the underlying problem, whereas in deterministic sampling the samples follow certain fixed and deterministic rules. The rules aim to fill up the space, in which the sam- ples are allowed to take place, in a systematical manner. For example, quasi Monte Carlo methods, lattice rules, orthogonal arrays, etc. The study of these generally falls into the topic of design of experiments (DOE), see, for example, [3, 5, 14, 47, 49, 60, 77, 95], and the references therein.
Here we address a rather different question regarding the choice of sample points. Let
ΘM be a set of candidate sample points, where M > 1 is the total number of points.
We assume ΘM is a large and dense set with M 1 such that an accurate regression result can be obtained. For an accurate result, however, it is required one to collect a large number (M 1) of sample data. It can be resource intensive if the data collection procedure requires expensive numerical simulations or experimentations. Let m be a number
1 ≤ m < M, denoting the number of samples one can afford to collect. We then seek to find a subset Θm ⊂ ΘM such that the LSQ result based on Θm is as close as possible to that
based on ΘM . However, theoretically, it is impossible to obtain the optimal subset without having all data information on ΘM . This dissertation introduces a non-adaptive quasi-optimal subset, originally proposed in
[84], which seeks to make the LSQ result on Θm close to that on ΘM without the knowledge of the data. The determination of the quasi-optimal subset is based on the optimization of a quantity that measures the determinant and the column space of the model matrix.
The non-adaptive feature of the quasi-optimal subset is useful for practical problems, as it allows one to determine, prior to conducting expensive simulations or experimentations, where to collect the data samples. Moreover, a greedy algorithm is presented to compute the quasi-optimal set efficiently.
We then apply the quasi-optimal sampling strategy to function approximation. The polynomial least squares solution is sought and it is obtained by solving an over-determined linear system of equations for the coefficients. Let the size of the model matrix be M ×
4 p, where M is the number of samples and p is the number of unknown coefficients in the polynomial or the number of basis. A well accepted rule-of-thumb is to use linear over-sampling by letting M = αp, where α = 1.5 ∼ 3. On the other hand, some recent mathematical analysis revealed that such a linear over-sampling is asymptotically unstable for polynomial least squares using Monte Carlo sampling and quasi Monte Carlo, cf. [31,
65, 64, 62]. In fact, for asymptotically stable polynomial regression, one should have at least M ∝ p log p and in many cases M ∝ p2.
In the setup of function approximation, we introduce two different sampling strategies
[84, 85]. One is a direct application of the quasi-optimal sampling to the ordinary least squares. The other is the combination of the quasi-optimal sampling and a method termed
Christoffer least squares (CLS) [66]. CLS is a weighted least squares problem, where the weights are derived from the Christoffel function of orthogonal polynomials. Instead of using the standard MC samples from the measure of the underlying problem, one samples instead from the pluripotential equilibrium measure. Analysis and extensive numerical tests in both bounded domain and unbounded domains were presented in [66] and demonstrated much improved stability property. Again, an important feature of these strategies is nonadap- tiveness. Prior to collecting the data at the samples, we apply the quasi-optimal sampling method to obtain the quasi-optimal subset, for any given number of samples specified by users. The data of the target function are then collected at this subset only and the cor- responding (weighted) polynomial least squares problem is solved for the approximation of the target function.
Extensive numerical examples are provided in Chapter 7.1 showing that the quasi- optimal sampling methods can deliver accurate approximations with O(1) oversampling.
Compared to the well accepted linear oversampling rate (1.5 ∼ 3), these represented no- table improvements. Also the quasi-optimal set notably outperforms than other standard design of experiment choices.
5 1.2 Few data and Advanced sparse approximation
In high dimensional approximations, we often face the number of samples is (severely) less than the cardinality of the linear space from which the approximant is sought. Consequently, this leads to an underdetermined system that admits an infinite number of solutions. The common approach, following the remarkable success of compressive sensing (CS), is to seek a sparse approximation via `1 norm minimization. In this dissertation, we introduce an alternative approach to construct sparse approxi- mations for underdetermined systems. Specifically, we employ `1−`2 minimization method which corresponds to (1.2) with F(c) = kck1 − kck2. The method was originally proposed
and studied in [59, 105]. Here we introduce a further study [103] of `1−`2 minimization.
The motivation for using `1−`2 is that since the contours of the `1−`2 are closer to those of the `0 norm, compared to those of the `1 norm, the use of `1−`2 minimization should be more sparsity promoting. Numerical examples in [59, 105] suggest that this is indeed the case. The theoretical estimates of the recoverability in [59, 105], however, can not verify the numerical findings, as the estimates for `1−`2 minimization tend to be worse than those for the standard `1 minimization. Nevertheless, the more sparsity promoting property of the `1−`2 minimization seems to be evident in a large number of numerical tests in [59, 105], which also discussed and analyzed in details the optimization algorithms based on DCA (difference of convex algorithm [92]) for the `1−`2 minimization problem. It has been recognized in the CS literature that non-convex methods can be sparsity promoting (cf.
[27]), e.g., reweighted `1 minimization ([25]), `p minimization with p < 1 ([28, 57]). From this point of view, it is not surprising that the `1−`2 minimization, as a non-convex method, appears to be more sparsity promoting. Comparisons between the `1−`2 minimization and other non-convex methods, such as `p minimization with p < 1, can be found in [59]. This dissertation contains the following results. First, a set of recoverability estimates for exact recovery of sparse signals is presented. The new estimates improve those in [59, 105], although they are still unable to verify the improvement of the `1−`2 minimization over the `1 minimization. New theoretical estimates of the `1−`2 recoverability for general non-
6 sparse signals are presented, which are not available before. Again, these estimates are not better than those for the standard `1 minimization.
We then extend the `1−`2 minimization technique to orthogonal polynomial approxima- tion. In particular, we focus on the use of Legendre polynomials and the case of multiple dimensions. This is similar to the study of the `1 minimization for Legendre polynomials in [36, 61, 79, 102]. Both the straightforward `1−`2 minimization and Chebyshev weighted
`1−`2 minimization are discussed together with the recoverability results for both sparse polynomials and general non-sparse polynomials. The estimates suggest that the Cheby- shev weighted `1−`2 minimization is more efficient in low dimensions. In high dimensions, the standard non-weighted `1−`2 minimization with uniform random samples should be preferred. We remark that the similar result holds for the `1 minimization of Legendre polynomials [102].
Based on various numerical results presented in Chapter 7.2, it is observed that the `1−`2
minimization performs consistently better than the standard `1 minimization. The available theoretical estimates, in [59, 105] and the improved ones in here ([103]), do not validate this
finding. Nevertheless, our study suggests that the `1−`2 minimization can be a viable option for sparse orthogonal polynomial approximations for underdetermined systems. This is especially useful in stochastic collocation, as the samples are often extraordinarily expensive to acquire.
1.3 Big data and Sequential approximation
In extremely high dimensions, any classical approximation technique becomes intractable due to its gigantic complexity of approximation, p 1. We thus introduce a sequen- tial function approximation method proposed in [86, 96, 82]. The resulting algorithm is particularly suitable for high dimensional problems when data collection is cheap.
This method is motivated by the randomized Kaczmarz algorithm developed in [88].
Kaczmarz method was first proposed in [56] as an iterative algorithm for solving a consistent overdetermined system of equations Ax = b. In its original form, the Kaczmarz algorithm
7 updates the k-th solution to
(k) (k+1) (k) bi − hx ,Aii x = x + 2 Ai, kAik2 where Ai is i-th row of the matrix A and is chosen in a cyclic manner to sweep through the rows of A. Due to its simplicity, the Kaczmarz algorithm has found use in applications such as computer tomography and image processing, cf., [48, 54]. In the randomized Kaczmarz
2 method, the i-th row is randomly drawn with probability proportional to kAik2. It was shown that the randomized Kaczmarz (RK) method converges exponentially in expectation
([88]). Many works followed, analyzing the properties and improving the performance of the RK method, cf. [70, 106, 29, 38, 58, 15, 94, 71], all of which focus on linear system of equations.
In this dissertation, we introduce a sequential function approximation method [86] which extends the idea of RK method beyond linear systems of equations and apply it to function approximation in multiple dimensions. We remark that function approximation by the RK method was demonstrated in [88] in one of the examples. It was for a bandlimited function in Fourier basis. This corresponds to the case that the target function resides in the linear subspace of the approximation and also with a finite number of samples. Here, however, we consider a more general setting, where the target function does not need to reside in the linear subspace of the approximation.
The sequential function approximation method [86] has a distinct feature — the model matrix A of size m × p is never formed. This is particularly useful for high dimensional problems with d 1. In this case, the dimensionality of the linear subspace of approximants is also large, i.e., p 1. When data collection is cheap and the amount of data is large, known as the so-called big data problem, we have m p 1. Even the declaration of the model matrix A could easily exceed the system memory of any computer. The sequential approximation method provides a viable option for this, as it iteratively seeks the approximation one data point at a time. This implies that every iteration of the method requires only operations on a vector of size 1 × p, thus inducing O(p) operations. The implementation of the method becomes irrespective of the size of the data, allowing one to
8 handle extremely large number of data and/or continuously arriving data.
A general convergence analysis is presented. The analysis reveals that by drawing the sample data randomly from a probability measure, the method converges under mild con- ditions. Both the upper bound and lower bound for the convergence rate are established, which translate into the bounds for the approximation error. The convergence and error are measured against the best L2 approximation of the target function, which is the or- thogonal projection onto the linear subspace of approximants. The established convergence rate and error bounds are in expectation with respect to different sampling sequences. We also present an almost surely convergence result. Under a more restrictive condition, the bounds for the convergence rate and the approximation error are valid almost surely. Fur- thermore, we introduce an optimal sampling probability which results in the optimal rate of convergence, 1 − 1/p. With the optimal sampling, the error analysis is expressed as an equality, rather than inequalities which are common in standard error estimates. When orthogonal polynomials are used as the basis functions, this optimal sampling measure has a close connection with the equilibrium measure of the pluripotential theory ([9]). This connection has been discussed in the context of polynomial least squares ([52, 66]).
Upon the introduction of this sequential approximation method, we apply this method on the full tensor quadrature girds [96]. As mentioned in the early introduction, due to the underlying tensor structure, the number of samples grow exponentially. This limits its applicability to very high dimensions. However, the sequential approximation method allows us to take advantage of the desirable mathematical properties of the full tensor quadrature points in an efficient manner. Both the theoretical analysis and the numerical examples indicate that highly accurate approximation results can be reached at K ∼ γp
iterations, where p is the dimensionality of the polynomial space and γ = 5 ∼ 10. Let the
total number of the full tensor points be md. This implies that only a very small porition of the full tensor Gauss points are used, as γp md when d 1. The operation count of the method is O(Kp), which becomes O(p2) and can be notably smaller than the operation count of the standard least squares method O(p3). Also, since the sequential method always operates on row vectors of length p, it requires only O(p) storage of real numbers and avoids
9 matrix operations which would require O(p2) storage. This feature makes the method highly efficient in very high dimensions d 1.
1.4 Corrupted data and Least absolute deviation
We now turn our attention to the quality of data. So far, we have assumed that the collected
data is exact (or maybe upto some white noises). In practice, however, this is not always
the case. Let us consider the case when the data vector is corrupted by unexpected external
errors. That is, the data vector is
b = f + es, (1.3)
m where es ∈ R is the corruption error vector with sparsity s. Let the model matrix A be overdetermined, m > p. And let sparsity s be reasonably small, i.e., all but a relatively
small number s entries of es are non-zero. However, the locations of the s non-zero entries are arbitrary and their magnitude (in absolute value) can be large. This can be a likely
situation in practice when the machines collecting the samples, via either experimentation
or computation, become faulty with a small probability and produce samples with erratic
errors. (Sometimes these are referred to as “soft errors”, as opposed to the standard noisy
errors which are modeled as i.i.d. random variables.) Note that the underlying mechanism
causing these corruption errors could be completely deterministic, e.g., via faulty machinery,
however, the location and amplitude of the errors are unknown.
In this dissertation, we introduce results by [83] which show that these corruption
errors can be effectively eliminated (with overwhelming probability) by conducting `1- minimization (LAD method), i.e.,
min kb − Ack1. (1.4) c
By “effectively eliminated”, we mean the approximation fe(x) constructed using the solution of (1.4) has an error that can be bounded by the approximation errors of the `1- and `2 models using the uncorrupted data. We remark that it has long been recognized that the
LAD method is more robust (than the LSQ method) with respect to outliers. In a way,
10 one may regard the corruption errors discussed here as one kind of outliers. From this perspective, the work of [83] provided a first systematic and quantitative analysis of the
“robustness” of LAD towards the outliers. However the corruption is a much more general concept than outliers as it can occur much more frequently (e.g., 30 - 40%).
The analysis presented here [83] is motivated by the work of [23, 20], where it was shown that the `1-minimization for overdetermined systems can remove sparse errors. The major, and important, difference between the work of [83] and that of [23, 20] is that [23, 20] deals with the linear system of equations f = Ac, and more importantly, assumes the system is consistent. That is, the system f = Ac can be solved exactly in the overdetermined case m > p. Here we concern with linear regression and function approximation. Consequently, the system f = Ac is never consistent and has an error e = f − Ac 6= 0. The error e is
directly related to the approximation error of fe(x)−f(x), which is usually not sparse. With the additional sparse corruption errors es added to f, the overall error e + es is not sparse. Therefore, the theory of [23, 20] does not readily apply.
The theoretical analysis for (1.4) was developed in [83] and it was shown there that
the approximation fe(x) constructed by the solution of (1.4) has an error bounded by the errors of the LAD and LSQ approximations obtained by using the uncorrupted data. In
this dissertation, we aim to give a comprehensive introduction to these results. Extensive
numerical examples are provided in Chapter 7.4.
1.5 Application: Uncertainty Quantification
The high dimensional approximation theory is deeply related to uncertainty quantification
(UQ). And two of approximation methods presented in this dissertation can directly apply
to UQ problems. Here we briefly introduce what UQ is and how UQ is related to the
approximation theory.
Wikipedia says that uncertainty quantification is the science of quantitative character-
ization and reduction of uncertainties in both computational and real world applications.
Based on the U.S. Department of Energy, 2009, p. 135, uncertainty quantification studies
11 all sources of error and uncertainty. More precisely, UQ is the end-to-end study of the reliability of scientific inferences. A comprehensive introduction to UQ can be found in [90].
In recent years, stochastic computation has drawn enormous attention because of the needs for UQ in complex physical systems. Many approaches have focused on solving large and complex systems under uncertainty. One of the most popular approaches is to employ orthogonal polynomials, following the work of generalized polynomial chaos (gPC),[46, 101].
In the most standard gPC expansion, the basis functions are orthogonal polynomials satis- fying Z E [φiφj] = φi(x)φj(x)dµ(x) = δij, (1.5) where E stands for expectation operator and δij is the Kronecker delta function. The type of orthogonal polynomials is then determined by the probability distribution via the or- thogonality condition. For example, Gaussian distribution leads to Hermite polynomials, uniform distribution leads to Legendre polynomials, etc. Many variations of the gPC ex- pansion exist since its introduction in [101]. For detailed reviews, see [99]. A brief review can be found in Chapter 2.7.
The basic idea of the gPC approach is to construct orthogonal polynomial approxima- tion of the solution of the stochastic system. To obtain the gPC expansion numerically,
Galerkin and collocation methods are widely used. Chapter 2.7 compactly reviews the gen- eral frameworks. Galerkin approaches require one to modify existing deterministic codes.
On the other hand, collocation approaches allow one to reuse existing deterministic codes.
This makes the collocation approach highly popular in practical implementation.
In the stochastic collocation, we first choose a set of samples of the input random vari- ables. Then, given these samples, the problem becomes deterministic, which can be solved by the existing code repetitively. Finally, a polynomial approximation of the solution can be constructed via the solution samples. This is where high-dimensional multivariate polyno- mial approximation becomes important. The popular approaches for stochastic collocation include sparse grids interpolation ([100, 39, 4, 41, 67]), discrete expansion ([80, 97]), sparse
`1-minimization approximation ([35, 102, 104]), etc. The least squares has also been ex-
12 plored ([30, 93, 2, 55]).
1.6 Objective and Outline
The objective of this dissertation is to give a comprehensive introduction to four different high-dimensional approximation methods, developed by the author in [84, 85, 103, 86, 96,
82, 83]. We remark that each method is designed for a specific data related scenario.
The dissertation is organized as follow. In Chapter 2, we review preliminary mathe- matics which will frequently be used throughout this dissertation. This includes some of concepts or results from approximation theory, compressive sensing, pluripotential theory, orthogonal polynomials, tensor-quadrature, and uncertainty quantification. Chapter 3 is about an optimal sampling strategy for the least squares method. Assuming collecting data is expensive, we discuss the optimal locations to collect data, prior to the actual data col- lection procedure, in order to achieve the best least square performance. This chapter is based on [84, 85]. In Chapter 4, assuming only few number of data is available, we discuss an advanced sparse approximation using `1−`2 minimization. This is based on the work of [103]. In Chapter 5, a sequential function approximation method is introduced. This method is designed for high dimensional problems with d 1 and it is especially suitable
when the data is sequentially or continuously collected. It is an iterative method which
requires only vector operations (no matrices) and it never forms a matrix. The contents of
this chapter is based on [86, 96, 82]. Chapter 6 discusses a way to eliminate the effect of
the corruption errors in data. In this case, we assume the data is corrupted by unexpected
external errors. This chapter is aimed to present a mathematical analysis which shows that
the least absolute deviation method can effectively eliminate the corruption errors. It is
based on the work of [83]. At last, extensive numerical examples for all four methods are
provided in Chapter 7.
13 CHAPTER 2
REVIEW: APPROXIMATION, REGRESSION AND ORTHOGONAL POLYNOMIALS
In this chapter, we review preliminary mathematics that will be frequently used through out this dissertation.
2.1 Function approximation
d 2 Let D ⊂ R , d ≥ 1 be a domain, equipped with a (probability) measure ω. Let Lω(D) be the standard Hilbert space with the inner product
Z 2 hg, hiω = g(x)h(x)dω, ∀g, h ∈ Lω(D), (2.1) D
and the corresponding induced norm k · kω. We assume the measure ω is absolutely contin- uous and often explicitly write
dω = w(x)dx. (2.2)
2 Let f(x) be an unknown function in Lω(D). Let φi(x), i = 1, . . . , p, be a set of basis in D and Π be the linear space of those basis, i.e.,
Π = span{φj(x) | 1 ≤ j ≤ p}. (2.3)
Then we seek an approximation in Π which can be written as
p X T fe(x; c) = ciφi(x), c = [c1, ··· , cp] . (2.4) i=1
2 For given a measure ω (2.2) and a linear space Π (2.3), one can always find the best Lω
14 approximation of f in Π. It is defined as
PΠf = argmin kf − gkω. g∈Π
Equivalently, this can be written as a problem of finding expansion coefficients which min-
2 imize the error with respect to Lω, i.e.,
p X T cˆ = argmin kf − ciφikω, where c = [c1, ··· , cp] . (2.5) p c∈R i=1
2 Then the best Lω approximation of f can be written as
p X T PΠf = cˆiφi, where cˆ = [ˆc1, ··· , cˆp] . i=1 ˆ Let V be a matrix of size p × p whose (i, j)-entry is hφi, φjiω and fφ be a vector of size p × 1 whose i-th entry is hf, φiiω. By differentiating (2.5), it readily follows that the best
2 Lω projection coefficients cˆ is −1ˆ cˆ = V fφ. (2.6)
ˆ We remark that if the basis {φj(x), j = 1, . . . , p} is orthonormal, cˆ is simply fφ. If the basis
φj is not orthonormal, there exists a linear transformation T (matrix) which transforms the non-orthonormal basis φj it into an orthonormal basis, for example, Gram Schmidt process. That is,
−1 T T T [φ1(x), φ2(x), . . . , φp(x)] = [ψ1(x), ψ2(x), . . . , ψp(x)] , (2.7) where
hψi, ψjiω = δij, 1 ≤ i, j ≤ p. (2.8)
2 In terms of the orthonormal basis ψj, the best Lω approximation is written as
p X ˆ ˆ PΠf = fjψj(x), fj = hf, ψj.iω (2.9) j=1
ˆ ˆ ˆ T −1ˆ ˆ Let fψ = [f1, ··· , fp] . Using the transformation (2.7), we have T fφ = fψ. Also, cˆ = −T ˆ T fψ. It then follows from (2.6) that
−1ˆ −T ˆ −T −1ˆ T −1ˆ V fφ = cˆ = T fψ = T T fφ = (TT ) fφ
15 which implies V = TTT .
If we adopt matrix-vector notation to simplify the exposition,
T T Φ(x) = [φ1(x), . . . , φp(x)] , Ψ(x) = [ψ1(x), . . . , ψp(x)] (2.10)
as the general basis and the orthonormal basis, respectively, the orthogonal projection can
be written as
ˆ −T ˆ T −1ˆ PΠf = hΦ(x), cˆi = hΨ(x), fψi, cˆ = T fψ = (TT ) fφ, (2.11)
where h·, ·i denotes vector inner product.
2 ˆ ˆ Remark 2.1.1. The best Lω projection coefficients cˆ is available only if either fψ or fφ is ˆ ˆ available. And fψ or fφ is usually unavailable as it requires knowledge of the unknown target
2 function f(x). Hence, the best Lω projection coefficients cˆ are not available in practice.
We thus are interested in approximating f(x) using its sample values f(xi), i = 1, . . . , m. Let
A = (aij), aij = φj(xi), 1 ≤ i ≤ m, 1 ≤ j ≤ p, (2.12)
be the model matrix, and
T m y = [f(x1), . . . , f(xm)] ∈ R
be the samples of f, and
b = y + e,
be the data vector, where either e = 0 (exact data) or e may be the noise or the external
corruption error. For an easy exposition, let us assume (for now) e = 0, i.e, b = y.
Remark 2.1.2. The data y consists of two kinds of information. One is the location where
the sample is collected, i.e., Θ = {xi}1≤i≤m. The other is the function value on each
location. Altogether, we have {(xi, f(xi)}1≤i≤m.
Remark 2.1.3. For the rest of this dissertation, the model matrix A is assumed to be full rank.
16 2.2 Overdetermined linear system
In overdetermined linear system, the number of data m is greater than the number of basis p, i.e., m > p. The standard approach seeks to find an approximation fe whose coefficients minimize the error, i.e.,
c∗ = argmin ky − Ack (2.13) c∈Rp Different norms measuring the error result in different approximation methods.
Least squares method
When the Euclidean norm (2-norm) is used, the minimization (2.13) results in the well- known ordinary least squares (OLS) method. That is,
m 2 ∗ 2 X c = argmin ky − Ack2 = argmin f(xi) − fe(xi; c) (2.14) p p c∈R c∈R i=1 where fe(x; c) is defined in (2.4). One can generalize the ordinary least squares to the weighted least squares (WLS) method. Let W = diag(w1, . . . , wm) be a diagonal matrix with positive entries. Then the weighted least squares minimizes the weighted error, i.e.,
√ √ ∗ c = argmin k W y − WAck2. (2.15) c∈Rp
Its solution is well known and has an analytical closed form
c∗ = (AT WA)−1AT W y. (2.16)
Note that when W is an identity matrix, the WLS (2.15) becomes the OLS (2.14).
We remark that the least squares can be considered as a discrete projection. Let consider
m a discrete measure σ on the data locations Θ = {xi}i=1 such that
m X σ(x) = wiδ(x − xi). i=1 With this discrete measure σ, one can define the discrete indefinite inner product
m X hg, hiσ = wig(xi)h(xi) (2.17) i=1 17 2 which does not satisfy non-degenerate condition on Lω(D). And let denote its corresponding semi-norm by k · kσ. Then (2.15) can be viewed as minimizing the error with respect to σ, i.e.,
√ √ Z p 2 2 X 2 k W y − W Ack2 = kf(x) − fe(x; c)kσ = (f(x) − ciφi(x)) dσ(x). (2.18) Θ i=1 If the discrete indefinite inner product (2.17) satisfies non-degenerate condition on a
2 linear subspace Π of Lω, this discrete measure σ indeed induces an inner product and its corresponding norm on Π. Then the least squares is a discrete projection of f onto Π. Such
Π exists, however, since this is not directly related to our topic, we do not discuss it further here. The interesting readers can consult [63].
Least absolute deviation
When 1-norm is used in (2.13), this results in the least absolute deviation (LAD) method.
That is, m ∗ X c = argmin ky − Ack1 = argmin f(xi) − fe(xi; c) . (2.19) p p c∈R c∈R i=1 Though the idea of LAD is just as straightforward as that of LSQ, the LAD is not as simple to compute. Unlike least squares, least absolute deviations do not have an analytical solution. Also its solution is not necessarily unique. However, since the LAD can be recasted as a linear program, its solution can be found by the following linear program m X Ac − u − y ≤ 0 min ui subject to (2.20) c,y i=1 Ac + u − y ≥ 0.
At the optimum point, u is equal to the absolute errors of the LAD solution. By using any linear programming solver, the LAD problem (2.19) can be solved. For example, `1-magic [19].
The overdetermined `1 minimization problem of (2.19) is equivalent to a constrained
`1-norm minimization problem (2.22). This equivalence relation is a central mechanism for the correcting corruption errors of Chapter 6.
18 (m−p)×p Let F ∈ R be the kernel of the model matrix A. One way to construct it is via QR factorization. Let
A = QR = QˆRˆ be the full QR factorization and the skinny QR factorization of A, respectively, where, by following the traditional notation, we write
m×m m×p Q = [Q,ˆ Qe] ∈ R ,R = [Rˆ; 0] ∈ R .
We can then set F = QeT and immediately obtain
FA = QeT QˆRˆ = 0. (2.21)
It is then easy to verify that the following problem (2.22) is equivalent to (2.19).
min kgk1 subject to F g = F y. (2.22)
Proof of equivalence between (2.19) and (2.22). Note that any g satisfying F g = F y has a
form of g = y − Ac for some c. Suppose c∗ is a solution to (2.19) and set g∗ = y − Ac∗.
Then g∗ satisfies F g∗ = F (y − Ac∗) = F y. Since c∗ is the minimizer of (2.19), we have
∗ ∗ kg k1 ≤ kgk1 for any g satisfying F g = F y, which implies g is a minimizer of (2.22). Conversely, let g∗ be a minimizer of (2.22). Since F g∗ = F y, there exists c∗ satisfying
g∗ = y − Ac∗. For any c, let g = y − Ac. Then it satisfies F g = F y. Since g∗ is a minimizer,
∗ ∗ ∗ we have ky − Ac k1 = kg k1 ≤ kgk1 = ky − Ack1 which implies c is a minimizer of (2.19). Hence, the equivalence is shown.
For further discussion on LAD, the interesting readers can consult the early work of
[1, 6, 73, 87, 11, 68] and [7, 76, 74] for the statistical properties. Also, see [12, 75] and the references therein.
2.3 Underdetermined linear system
When the number of data is smaller than the number of basis, that is, m < p, the minimiza- tion problem (2.13) allows infinitely many solutions, regardless of the choice of the norm,
19 as all solutions satisfy Ac = y.
Since A is full rank of m, a set of all solutions is
p p {c ∈ R | y = Ac} = {cp + z ∈ R | z ∈ N (A)}
where cp is any particular solution satisfying Acp = y and N (A) is the null space of A. Therefore one can choose a specific solution which has a feature of interest among the
solution set. Typically such specific choice can be made by an optimization problem
min F(c), subject to Ac = y (2.23) c
where F is an objective function which measures the feature of interest one wants to enforce
on the solution.
2.3.1 Sparsity
Sparsity plays an important role in many fields of science. For example, sparsity leads to
dimensionality reduction and efficient modeling. Thus among many other features, sparsity
is one of the most common and popular choice. Mathematically, sparsity is measured by
p `0-norm. For v ∈ R ,
kvk0 = |{i | vi 6= 0, for 1 ≤ i ≤ p}|.
That is, kvk0 is the number of non-zero entries of v. Ideally, one wants to solve (2.23) with
F(c) = kck0. Solving this combinatorial optimization problem directly is infeasible even for modest size problems.
Following the remarkable success of compressive sensing (CS), the use of `1-norm (F(c) =
kck1) becomes so popular. A large amount of results have been obtained for CS. We review some of definitions and results relevant to our topic. By following the standard notations,
we denote the sparsity by s.
Definition 2.3.1. For T ⊆ {1, . . . , p} and each number s, s-restricted isometry constant of
A is the smallest δs ∈ (0, 1) such that
2 2 2 (1 − δs)kxk2 ≤ kAT xk2 ≤ (1 + δs)kxk2 (2.24) 20 |T | for all subsets T with |T | ≤ s and all x ∈ R . The matrix A is said to satisfy the s-RIP
(restricted isometry property) with δs.
Here AT stands for the submatrix of A formed by the columns chosen from the index set T .
0 0 Definition 2.3.2. If s + s ≤ n, the s, s -restricted orthogonality constant θs,s0 of A is the smallest number such that
0 0 |hAx, Ax i| ≤ θs,s0 kxk2kx k2, (2.25)
where x and x0 are s-sparse and s0-sparse, respectively.
The following two properties shall be used in the later chapter. The first one relates the
constants δs and θs,s0 ([23]),
θs,s0 ≤ δs+s0 ≤ θs,s0 + max(δs, δs0 ). (2.26)
The second one ([18]) states, for any a ≥ 1 and positive integers s, s0 such that as0 is an
integer, √ θs,as0 ≤ aθs,s0 . (2.27)
Lemma 2.3.3 (Lemma 2.1, [21]). We have
0 0 |hAx, Ax i| ≤ δs+s0 kxk2kx k2 (2.28)
for all x, x0 supported on disjoint subsets T,T 0 ⊂ {1, . . . , m} with |T | ≤ s, |T 0| ≤ s0.
Proof of Lemma 2.3.3. The proof is readily followed by combining Definition 2.3.2 and
(2.26).
2.4 Orthogonal Polynomials
2 We seek to find an approximation fe ≈ f in a finite dimensional linear space. Let Π ⊂ Lω(D)
be the linear subspace with dim Π = p, and {φj(x), j = 1, . . . , p} be a basis. Note that
21 a set of basis determines only a single linear subspace. However a linear subspace can be spanned by various choices of basis, e.g., rotations.
One of the most widely used linear subspaces is polynomial space. Let Λ be an index
d set for multi-index α = (α1, . . . , αd) ∈ N0 with |α|1 = α1 + ··· + αd and |α|∞ = maxi |αi|. For each index set Λ, a polynomial space is defined
α α1 αd PΛ = {x = x1 ··· xd , α ∈ Λ}. (2.29)
There are two most popular polynomial spaces: the total degree polynomial space, and
d tensor product space. The total degree space Pn is the linear space of d-variate polynomials of degree up to n ≥ 1, defined as
d α Pn = span{x , |α|1 ≤ n}, (2.30) whose dimension is n + d (n + d)! dim d = = . (2.31) Pn d n!d! The tensor degree space of degree n is
TP α Pd,n = span{x , |α|∞ ≤ n}, (2.32)
whose dimension is
TP d dim Pd,n = (n + 1) .
Classical orthogonal polynomials, such as those of Legendre, Laguerre and Hermite, but
also discrete ones, due to Chebyshev, Krawtchouk and others, have found widespread use in
all areas of mathematics, science, engineering and computations. A comprehensive review
can be found in [91, 99]. In this dissertation, orthogonal polynomials will be used as our
primary basis functions in approximating f.
In principle, one can always orthogonalize any polynomial basis via certain orthogonal-
ization procedure, e.g., Gram-Schmidt. In order to define what we mean by orthogonality,
an inner product has to be introduced. As it can be seen in (2.1), for different choice of
(probability) measures ω (2.2), different inner products are defined. In univariate case, the
22 explicit form of the orthogonal polynomials is known for many measures ω. For exam- ple, uniform measure corresponds to Legendre polynomials, Gaussian measure to Hermite polynomials, etc. In multivariate case, one straightforward way to construct orthogonal polynomials is to tensorize univariate orthogonal polynomials.
d Let x = (x1, ··· , xd) ∈ D ⊂ R for d > 1. Let x ∈ R and {ψj(x)}j≥1 be univariate orthogonal polynomials with respect to one dimensional measure $(x)dx. That is,
Z ψi(x)ψj(x)$(x)dx = δij. R Remark 2.4.1. In principle, one can employ different univariate orthogonal polynomials
in different coordinate. However, for simplicity, we employ the same univariate orthogonal
polynomials in all coordinate.
Let consider a d-variate measure ω which is a product of univariate measure, i.e.,
dω = $(x)dx = $(x1) ··· $(xd)dx1 ··· dxd. (2.33)
Let Λ be a multi-index set and PΛ be its corresponding polynomial space, defined (2.29).
For each i = (i1, ··· , id) ∈ Λ, the multivariate orthogonal polynomials ψi(x) are constructed
by tensor products of their corresponding one dimensional polynomials ψii (xi), i.e.,
d Y ψi(x) = ψij (xj), i ∈ Λ. (2.34) j=1
For notational convenience, we will employ a linear ordering between the multi-index i and
the single index i. That means, for any 1 ≤ i ≤ |Λ|, there exists a unique multi-index i such
that
i ←→ i = (i1, . . . , id). (2.35)
Throughout this dissertation, we shall frequently interchange the use of the single index i
and the multi-index i, with the one-to-one correspondence (2.35) assumed. By the tensor
structure and univariate orthogonality, it can be readily checked that
Z ψi(x)ψj(x)dω = δij. D
23 For reference, an overview of the basic results on orthogonal polynomial of several variables can be found in [37].
2.5 Tensor Quadrature
The fundamental problem in numerical integration is to compute an approximate solution to a high-dimensional integral Z f(x)dx
2 to a given degree of accuracy. This is directly related to compute the best Lω projection coefficients cˆ (2.5).
First, we briefly review the well-known Gaussian quadrature in 1D as it is the most popular choice. Generally a quadrature rule is a set of pairs of points and its weights
{(xi, wi)}1≤i≤m such that
m X Z b Q(f) = wif(xi) ≈ f(x)dx. i=1 a Its accuracy is often measured by the exactness of polynomial degree. A quadrature rule has the exactness of degree upto n if
Z b f(x)dx = Q(f), a whenever f(x) is a polynomial of degree at most n, but
Z b f(x)dx 6= Q(f), a for some polynomial f(x) of degree n + 1.
For any distinct points {xj}1≤j≤m, one can always find its corresponding weights {wj}1≤j≤m such that the quadrature rule {(xj, wj)}1≤j≤m has the exactness of degree m−1. The weight can be found by solving a linear system of equations
T Aw = b, w = [w1, ··· , wm] ,
R i−1 i−1 where the i-th entry of b is bi = x dx and the (i, j)-th entry of A is aij = (xj) .
24 As points are distinct, A is invertible. Thus w = A−1b is the quadrature weights whose
exactness is m − 1.
In the Gaussian quadrature rule, the points are designed to maximize the exactness of
polynomial degree. Given m points, the maximum exactness is 2m − 1 and the Gaussian
quadrature achieves such maximum exactness. The Gaussian quadrature incorporates with
a family of orthogonal polynomials. Let ψj(x) be orthogonal polynomials of degree j and the orthogonality holds with respect to $(x)dx, i.e.,
Z ψi(x)ψj(x)$(x)dx = δij. (2.36)
Let {zj}1≤j≤m be zeros of ψm, i.e, ψm(zj) = 0 for all j, and let
m !−1 X 1 2 2 wj = ψk(zj) , γk = kψkkω. (2.37) γk k=1
Then {(zj, wj)}1≤j≤m is a Gaussian quadrature rule. Depending on a choice of orthogo- nal polynomial, a different Gaussian quadrature is obtained. For example, when Legendre
polynomials are used, this results in Gauss-Legendre quadrature. Similarly when Hermite
polynomials are used, this results in Gauss-Hermite quadrature. Details of Gauss quadra-
tures are well documented in the literature, see, for example, [91].
Based on 1D quadrature, we now construct the tensor quadrature. For each dimension (i) (i) i, i = 1, . . . , d, let (z` , w` ), ` = 1, . . . , m, be a set of one-dimensional quadrature points and weights such that m Z X (i) (i) w` f(z` ) ≈ f(xi)$i(xi)dxi `=1 for any integrable function f. We then proceed to tensorize the univariate quadrature rules.
Let
n (i) (i)o Θi,m = z1 , . . . , zm , i = 1, . . . , d, (2.38) be the one dimensional quadrature point set. The tensor product is taken to construct a d-dimensional point set
ΘM = Θ1,m ⊗ · · · ⊗ Θd,m. (2.39)
25 d Obviously, M = #ΘM = m . As before, an ordering scheme can be employed to order the points via a single index, i.e., for each j = 1,...,M,
z = z(1), . . . , z(d) , j ←→ j ∈ {1, . . . , m}d. (2.40) j j1 jd
That is, each single index j corresponds to a unique combination of j := (j1, . . . , jd) with
|j|∞ ≤ m. Each point has the scalar weight
w = w(1) × · · · × w(d), j = 1,...,M, j j1 jd
with the same j ↔ j correspondence. Then {(zj, wj)}1≤j≤M is a tensor quadrature. Note that when Gaussian quadrature is used for one-dimensional rule, the resulting tensor quadra-
ture is called the tensor Gauss quadrature.
2.6 Christoffel function and the pluripotential equilibrium
d 2 Here we briefly introduce some results from the pluripotential theory. Let Pn ⊂ Lω(D) d d+n be the total degree polynomial space of degree upto n. So that p = dim Pn = d . Let d {φj(x), j = 1, . . . , p} be a basis of Pn. The “diagnoal” of the reproducing kernel of Pn is defined by p X 2 Kn(x) := φi (x). (2.41) i=1
Note that Kn does not depend on the choice of the orthonormal basis. The (normalized) Christoffel function from the theory of orthogonal polynomials ([72]) is
p λn(x) := . (2.42) Kn(x)
Let define a sampling measure µD such that
dµD = v(z)dz (2.43) where v(z) is a sampling weight function. In bounded domains, this sampling measure is the equilibrium measure of the domain. In unbounded domains, this is a measure whose √ weight function is a scaled version of the w-weighted equilibrium measure. Please see
26 below for the details.
2.6.1 Pluripotential equilibrium measure on Bounded domains
Let D is a compact set of nonzero d-dimensional Lebesgue measure and nonempty interior,
R 2 and w is a continuous function on the interior of D such that 0 < D p (z)w(z)dz < ∞ for any nontrivial algebraic polynomial p(z). Then the pair (D, w) is called “admissible”.
Suppose w(z) is an admissible weight function on a compact domain D. It is known that ω(x) lim λn(x) = , a.e. x ∈ D, (2.44) n→∞ v(x) where v(x) is the Lebesgue weight function (a probability density) of the pluripotential
equilibrium measure of D ([9])
dµD(x) = v(x)dx.
In some specific domain D, the equilibrium measure µD is well knwon. For example, the equilibrium measure for D = [−1, 1] is the arcsine measure with “Chebyshev” density and
its explicit formula is 1 vD(z) = √ , z ∈ [−1, 1]. π 1 − z2 For d ≥ 2, the equilibrium measure for D = [−1, 1]d is the product measure of the univariate measure, i.e., dµ 1 v(z) = D = , z ∈ [−1, 1]d. (2.45) dz Qd q 2 j=1 π 1 − zj When D is the unit simplex, the equilibrium density is
d Y −1/2 vD(z) = C[(1 − kzk1) zj] j=1 where C is a normalization constant.
2.6.2 Pluripotential equilibrium measure on Unbounded domains
Suppose D is an origin-centered unbounded conic domain with nonzero d-dimensional
Lebesgue measure, and w = exp(−2Q(z)), with Q satisfying (i) lim|z|→∞ Q(z)/|z| > 0,
27 and (ii) there is a constant t ≥ 1 such that
Q(cz) = ctQ(z), ∀z ∈ D, c > 0. (2.46)
P d As special cases, the one-sided exponential weight w(z) = exp(− j zj) on D = [0, ∞) and P 2 d the Gaussian density function w(z) = exp(− j zj ) on D = R satisfy (2.46) with t = 1 and t = 2, respectively.
Let µD,Q be the weighted pluripotential equilibrium measure and denote the Lebesgue density of the corresponding weighted equilibrium measure as
dµD,Q(z) = vD,Q(z)dz,
(assuming such a density exists).
(n) n For each polynomial degree n ∈ N, let φi (x) be the family orthonormal under w , i.e.,
Z (n) (n) n φi (x)φj (x)w (x)dx = δij. (2.47) D d n √ 2 For the whole space R with w (x) = exp(−k nxk2), the corresponding Hermite poly- (n) d √ d n nomial satisfies hi (x) = n 4 hi( nx). For the tensor half space [0, ∞) with w (x) = (n) d exp(−knxk1), the corresponding Laguerre polynomial satisifies Li (x) = n 2 Li(nx). Thus, (n) d/2t 1/t we can write these as φi (x) = n φi(n x), where the homogeneity exponent t = 2 for the whole space and t = 1 for the tensor half space. Then the reproducing kernel associated with the total degree polynomial space of degree n with wn is defined by
(n) X (n) 2 Kn = (φi (x)) . (2.48) 1≤i≤p
Then the following result is known.
d Theorem 2.6.1. [[9, 10, 8]] Let D ⊂ R be an unbounded convex cone with w. Then
1 n (n) dµD,Q lim w (x)K (x) = vD,Q(x) = . (2.49) n→∞ p n dx
Note that the weighted equilibrium measure µD,Q here has compact support even though √ D is unbounded. Thus the sampling weight function is a scaled version of the w-weighted
28 equilibrium measure
−d/t −1/t −d/t dµD.Q −1/t v(z) = n vD,Q(n z) = n (n z), n = max |α| dz α∈Λ
For convenience, let
dµD = v(z)dz be the sampling measure for unbounded domain. Since the measure µD,Q has compact support, we require scaling factor n1/t to effectively expand the support to contain the unbounded domain.
Explicit formulae for multivariate weighted equilibrium measures µD,Q on real-valued sets D are not yet fully known. For a few special cases, the forms of the measures were conjectured in [66]. More specifically,
d d • Tensor “half space” D = [0, ∞) ⊂ R with
d X p ω(z) = exp − zj ,Q(z) = − log w(z). j=1
The equilibrium measure is v u Pd d dµD,Q u(4 − j=1 zj) v (z) = = Ct , (2.50) D,Q dz Qd j=1 zj
where C is a normalization constant. The measure µD is supported only on the set 4T d with
d d d X T = z ∈ R |zj ≥ 0, kzk1 = |zj| ≤ 1 . (2.51) j=1
d • Whole space D = R with
w(z) = exp(−|z|2),Q(z) = − log pw(z).
The equilibrium measure is
dµ d/2 v (z) = D,Q = C 2 − |z|2 , (2.52) D,Q dz
29 where C is a normalization constant. The measure µD is supported only on the set √ 2Bd, where Bd is the d-dimensional unit ball.
2.7 Uncertainty Quantification
Most of the contents in this chapter can be found in [99]. Let consider a system of partial
` differential equations (PDEs) defined in a spatial domain D ⊂ R , ` = 1, 2, 3, and a time domain [0,T ] with T > 0, ∂ u(x, t, Z) = L(u; Z),D × (0,T ] × Ω, ∂t B(u; Z) = 0, ∂D × [0,T ] × Ω, (2.53) u = u0,D × {t = 0} × Ω,
where L is a (nonlinear) differential operator, B is the boundary condition operator, u0 is
d the initial condition, and Z = (Z1, ··· ,Zd) ∈ Ω ⊂ R , d ≥ 1 is the set of independent random variables. The solution is then a random quantity,
dim u u(x, t, Z): D¯ × [0,T ] × Ω 7→ R , (2.54)
where dim u ≥ 1 is the dimension of u. For ease of presentation, we shall set dim u = 1.
The fundamental assumption here is that (2.53) is a well-posed system almost surely.
That is, for each realization of the random variables Z, the problem of (2.53) is well posed
in the deterministic sense, almost surely.
p Let ω be the probability measure for Z, {φi(Z)}i=1 be a set of basis for the random p variable Z and Π be the linear space of Z spanned by {φi(Z)}i=1. The key idea is to
construct an approximation vp ∈ Π which has the form of a linear combination of the basis functions of Π, i.e., p X vp(x, t, Z) = ci(x, t)φi(Z). (2.55) i=1 2 The best Lω approximation is defined by
2 PΠu(x, t, Z) = argmin E|g(x, t, Z) − u(x, t, Z)| (2.56) g∈Π
30 where the expectation E is taken over Z. Note that the following are equivalent Z 2 2 2 E|g(x, t, Z) − u(x, t, Z)| = kg(x, t, Z) − u(x, t, Z)kω = |g(x, t, z) − u(x, t, z)| dω(z). Ω (2.57)
2 When the basis are orthonormal, the best Lω approximation can be explicitly written as
p X PΠu(x, t, Z) = uˆi(x, t)φi(Z), uˆi(x, t) = E[u(x, t, Z)φi(Z)]. i=1
2 Though this is the optimal (in the Lω sense) approximation in Π, it is not of practical use since the projection requires knowledge of the unknown solution.
2.7.1 Stochastic Galerkin method
The stochastic Galerkin procedure is a straightforward extension of the classical Galerkin approach for deterministic equations. For any given x and t,(x and t are fixed), we seek an approximation vp ∈ Π, p X G vp(x, t, Z) = ci (x, t)φi(Z) i=1 G G G T whose coefficients c (x, t) = [c1 (x, t), ··· , cp (x, t)] satisfy for 1 ≤ i ≤ p, ∂ E vp(x, t, Z) − L(vp; Z) · φi(Z) = 0,D × (0,T ] × Ω, ∂t (2.58) E [B(vp; Z) · φi(Z)] = 0, ∂D × [0,T ] × Ω, E [(vp(x, 0,Z) − u(x, 0,Z)) · φi(Z)] = 0,D × {t = 0} × Ω,
That is, we seek a solution in Π such that the residue of (2.53) is orthogonal to the space
Π. Upon evaluating the expectations in (2.58), the dependence in Z disappears. The result is a system of (usually coupled) deterministic equations. The size of the system is p, the dimension of the approximation space Π. For a fixed (x, t) = (xfix, tfix), one can obtain a numerical solution vp(xfix, tfix,Z) only at (xfix, tfix) by solving (2.58) via a well-established deterministic code.
When the orthogonal polynomials are used as the basis and the distribution of the
31 random variable Z is the orthogonal probability measure, that is, Z ∼ ω and
Z φi(z)φj(z)dω(z) = δij, the method refers to the generalized polynomial chaos (gPC) Galerkin method. Further- more, if L is a linear differential operator, (2.58) becomes for 1 ≤ i ≤ p, ∂ G Pp h G i ci (x, t) − E L(cj (x, t); Z)φj(Z)φi(Z) = 0,D × (0,T ] × Ω, ∂t j=1 (2.59) E [B(vp; Z) · φi(Z)] = 0, ∂D × [0,T ] × Ω, G ci (x, 0) − E [u(x, 0,Z)φi(Z)] = 0,D × {t = 0} × Ω.
Typically, this results in a coupled system of equations.
2.7.2 Stochastic collocation
The collocation methods are a popular choice for complex systems where well-established deterministic codes exist. In deterministic numerical analysis, collocation methods are those that require the residue of the governing equations to be zero at discreet nodes in the computational domain. The nodes are called collocation points. The same definition can be extended to stochastic simulations.
Let IZ be the support of Z and consider the same problem of (2.53). For any given x and t, let p X νp(x, t, Z) = ci(x, t)φi(Z) i=1
be a numerical approximation of u. In general, νp(·, ·,Z) ≈ u(·, ·,Z) in a proper sense in
IZ , and the system (2.53) cannot be satisfied for all Z after substituting u for νp.
Let Θ = {z1, ··· , zm} ⊂ IZ be a set of m either prescribed nodes or random realizations of Z in the random space, where m ≥ 1 is the number of nodes. Then in the collocation
32 method, for all j = 1, ··· , m, we enforce (2.60) at the node zj by solving for u(x, t, zj), ∂ u(x, t, zj) = L(u; zj),D × (0,T ], ∂t B(u; zj) = 0, ∂D × [0,T ], (2.60) u = u0,D × {t = 0}.
Because the value of the random parameter Z is fixed, (2.60) is a deterministic problem.
Therefore, solving the system poses no difficulty provided one has a well-established deter-
(j) ministic algorithm. For fixed (x, t) = (xfix, tfix), let u = u(xfix, tfix, zj), for j = 1, ··· , m, be the solution of (2.60). We remark that the more complex problem of (2.60), the more expensive to obtain a sample solution of it. This results in a deterministic uncoupled system of equations. In other words, we have a problem of constructing an accurate approximation
(j) to u(xfix, tfix,Z) using its samples {u = u(xfix, tfix, zj)}1≤j≤m by determining the expansion
T coefficients c = [c1(xfix, tfix), ··· , cp(xfix, tfix)] . Let
A = (aij), aij = φj(zi), be the model matrix and
u = [u(1), ··· , u(m)]T ,
be the vector of solutions at (xfix, tfix, zi), j = 1, ··· , m. Then this can be compactly written as
Ac ≈ u. (2.61)
This is typically achieved by utilizing the multivariate approximation theory. The popular
approaches for stochastic collocation include sparse grids interpolation ([100, 39, 4, 41, 67]),
discrete expansion ([80, 97]), sparse `1-minimization approximation ([35, 102, 104]), etc. The least squares has also been explored ([30, 93, 2, 55]).
33 CHAPTER 3
OPTIMAL SAMPLING
In this chapter, a sampling methodology for least squares (LSQ) regression is introduced.
The strategy designs a quasi-optimal sample set in such a way that, for a given number of samples, it can deliver the regression result as close as possible to the result obtained by a
(much) larger set of candidate samples. The quasi-optimal set is determined by maximizing a quantity measuring the mutual column orthogonality and the determinant of the model matrix. This procedure is non-adaptive, in the sense that it does not depend on the sample data. This is useful in practice, as it allows one to determine, prior to the potentially expensive data collection procedure, where to sample the underlying system. We also introduce its efficient implementation via a greedy algorithm
We then apply the quasi-optimal sampling strategy on polynomial least squares. Two strategies are presented. One is a direct application for ordinary least squares. The other is a combination of quasi-optimal sampling and a method, termed Christoffel least squares
(CLS) [66]. Since the quasi-optimal set allows one to obtain almost optimal regression results under any affordable number of samples, the new strategies result in polynomial least squares methods with high accuracy and robust stability at almost minimal number of samples. Extensive numerical examples in Chapter 7.1 demonstrate that it notably outperforms than other standard choices of sample sets.
An immediate application of the quasi-optimal set is stochastic collocation for UQ, where data collection usually requires expensive PDE solutions. It is demonstrated that the quasi- optimal LSQ can deliver very competitive results, compared to gPC Galerkin method, and yet it remains fully non-intrusive and is much more flexible for practical computations.
34 3.1 Quasi-optimal subset selection
A general linear regression model can be written as
y = Aβ + ,
m m m×p where y ∈ R is the data vector, ∈ R is the error vector, A ∈ R is the model p matrix, and β ∈ R are the regression parameters. Here m stands for the number of data/observations/samples, p stands for the number of parameters or factors in the regres- sion model. In this chapter, we assume m ≥ p.
For a diagonal matrix W whose entries are positive, the well known weighted least
squares solution is
βˆ = (AT WA)−1AT W y,
which minimizes the squared sum of the residue
√ √ ˆ β = argmin W Aβ − W y . β∈Rp 2
For simplicity of presentation, we consider ordinary least squares (OLS), i.e., W is the identity matrix of size m × m. We remark that all results remain to be valid for the √ √ weighted least squares by substituting A → WA and y → W y.
Optimal subset
M×p M Let AM ∈ R and yM ∈ R with M 1, and
ˆ βM = argmin kAM β − yM k2 (3.1) β∈Rp be the least squares solution to an “ideal” system, in the sense that the system is sam- pled at a large number (M) of locations with data yM and can result in the “best” lin- ear regression approximation. Since data collection for many practical systems can be an resource-intensive procedure, requiring time consuming numerical simulations and/or ex- pensive experimentations, we recognize that the ideal system (3.1) is often unattainable due to the limited resource one possesses.
35 To circumvent the difficulty, we consider a sub-problem
βˆm = argmin kAmβ − ymk2, (3.2) β∈Rp where m < M (often m M) is the number of sample data one can afford to collect, given the resource, and Am is the sub-matrix of AM and ym is the sub-vector of yM . The performance of the sub-problem (3.2) obviously depends on the selection of the m samples.
Our goal is to determine the optimal m-point set, within the M-point candidate set, that delivers the best performance. This can be mathematically formulated as follow.
m×M Let Sm = S(1); ··· ; S(m) ∈ R be a row selection matrix, where each row S(i) =
[0, ··· , 0, 1, 0, ··· , 0] is a length-M vector with all zeros but a one at ki location, 1 ≤ ki ≤ M.
We further require that the locations of the entry 1 in each row are distinct, i.e., {ki}1≤i≤M
M×` are distinct. Then for any matrices C ∈ R with ` ≥ 1, SmC results in a m × `
submatrix consisting of the distinct k1, . . . , km rows of C. The sub-problem (3.2) can be written equivalently as
ˆ βm(Sm) = argmin kSmAM β − SmyM k2, (3.3) β∈Rp where the solution dependence on Sm is explicitly shown. We now define the optimal subset as follows.
ˆ Definition 3.1.1 (Optimal subset). Let βM be the OLS solution of the ideal system defined in (3.1), and βˆm be the OLS solution of the sub-problem (3.3), where 1 ≤ m < M. The m-point optimal subset is defined as the solution of
† ˆ ˆ Sm = argmin kβM − βm(Sm)k2 . (3.4) Sm
The optimal subset then delivers a OLS solution that is the closest to the OLS solution of the full problem. The definition is not highly useful, as its solution (if exists) obviously depends on the full data vector yM . In the remainder of this paper, we shall use Definition 3.1.1 to motivate our design of a quasi-optimal m-point subset that is non-adaptive, i.e., the resulting subset is independent of yM and can be determined prior to data collection.
36 Quasi-optimal point set
Here we present a quasi-optimal sample set. It is motivated by the definition of optimal subset in Definition 3.1.1. An important feature is that the quasi-optimal subset selection is non-adaptive, i.e., it is independent of the data yM .
Remark 3.1.2. Since the true optimality in Definition 3.1.1 can not be achieved without having the full data vector yM , this strategy is termed quasi-optimal.
Motivation
We start by estimating the difference between the least squares solutions of the sub-problem and the full problem.
Theorem 3.1.1. Let Sm be a m-row selection matrix with 1 ≤ m < M and
ˆ βm(Sm) = argmin kSmAM β − SmyM k2 , β (3.5) ˆ βM = argmin kAM β − yM k2 , β
be the least squares solutions of the m-row sub-problem and the M-row full problem, respec-
tively. Let AM = QM R be the QR factorization of AM . Then,
ˆ ˆ −1 kβM − βm(Sm)k ≤ kR k · kgk, (3.6)
where g is the least squares solution to
⊥ SmQM x = SmP yM (3.7)
satisfying
T T ⊥ ((SmQM ) SmQM )g = (SmQM ) SmP yM , (3.8)
⊥ T and P = (I − QM QM ) is projection operator to the orthogonal complement of the column
space of QM .
37 Proof. We note that SmAM = SmQM R. Then,
ˆ ˆ T −1 T T −1 T kβM − βmk = k(AM AM ) AM yM − ((SmAM ) SmAM ) (SmAM ) SmyM k
T −1 T T T T −1 T T = k(R R) R QM yM − (R (SmQM ) SmQM R) (R (SmQM ) )SmyM k
−1 T −1 T −1 T = kR QM yM − R ((SmQM ) SmQM ) (SmQM ) SmyM k
−1 T T −1 T ≤ kR kkQM yM − ((SmQM ) SmQM ) (SmQM ) SmyM k
= kR−1k · kgk,
where
T T −1 T g = −QM yM + ((SmQM ) SmQM ) (SmQM ) SmyM .
T Let PyM = QM QM yM be the orthogonal projection of the data vector yM onto the column
space of QM . Then yM can be decomposed as
⊥ T T yM = PyM + P yM = QM QM yM + (I − QM QM )yM .
We then have
T T −1 T T ⊥ g = −QM yM + ((SmQM ) SmQM ) (SmQM ) Sm(QM QM yM + P yM )
T −1 T ⊥ = ((SmQM ) SmQM ) (SmQM ) SmP yM .
And the proof is complete.
Note that if Sm is constructed in such a way that that SmQM retains the same or-
thogonality as QM , the right-hand-side of (3.8) will be zero and consequently g = 0. In
this case, the sub-problem solution βˆm becomes equivalent to the full size problem solution ˆ βM . Obviously, in general this will not be case for the reduced system with m < M and consequently g 6= 0. To achieve smaller kgk, we then seek to (a) maximize the column
orthogonality of SmQM so that the right-hand-side of (3.8) is minimized; and (b) maximize
the determinant of SmQM so that the non-zero solution of (3.8) is smaller. Note that both
condition (a) and (b) are irrespective of the data yM .
38 3.1.1 S-optimality
Based on the previous discussion, we now introduce S-optimality criterion. Let A be a m×p matrix, and A(i) its i-th column. We define
√ ! 1 det AT A p S(A) := . (3.9) Qp (i) i=1 A
Obviously, we assume A(i) 6= 0 for all columns i = 1, . . . , p. 1 | det A| p If A is a square matrix with m = p, then S(A) = Qp (i) ; and if m < p, S(A) = 0. i=1 kA k The following result also holds.
Theorem 3.1.2. For m × p matrix A with m ≥ p, S(A) = 1 if and only if all columns of
A are mutually orthogonal.
m×m Proof. Let A = QR be the QR factorization of A, where Q ∈ R is orthogonal and m×p R ∈ R is upper triangular r11 r12 ··· r1p 0 r22 ··· r2p . . . . . . .. . R = 0 0 ··· rpp . (3.10) 0 0 ··· 0 . . . . . . . . . . . . 0 0 ··· 0
T Qp 2 (i) 2 Pi 2 Then, det A A = i=1 rii, and kA k = j=1 rji, for 1 ≤ i ≤ p. (i) 2 2 Qp (i) 2 Qp 2 T (⇒) For each i, ||A || ≥ rii. Thus, i=1 kA k ≥ i=1 rii = det A A. Suppose √ Qp (i) T S(A) = 1, then i=1 kA k = det A A. Therefore, rji = 0 if 1 ≤ j < i for all 1 ≤ i ≤ p. This implies R is diagonal and consequently the columns of A are orthogonal.
(⇐) Now suppose the columns of A are mutually orthogonal. Then it can be written as
A = A(1) ··· A(p)
39 kA(1)k 0 ··· 0 0 kA(2)k · · · 0 = A(1) A(p) (1) ··· (p) . . . . kA k kA k . . .. . . . . 0 0 · · · kA(p)k
= QR,
√ T Qp (i) which is a natural QR decomposition. We then have det A A = i=1 kA k and S(A) = 1.
With this S-optimality criterion, we are in a position to define the quasi-optimal set.
Definition 3.1.3 (Quasi-optimal subset). Let AM and yM be the model matrix and data
over the M-point candidate sample set, as defined in (3.1), and AM = QM R be the QR factorization of AM . Let 1 ≤ m < M be a given integer. The quasi-optimal m-point subset
is defined by selecting the m rows out of AM such that
† Sm = argmax S(SmQM ). (3.11) Sm
If the columns of AM are mutually orthogonal, then the QR factorization is not necessary and
† Sm = argmax S(SmAM ). (3.12) Sm
Let Ik = {i1, . . . , ik} be a sequence consisting k unique indices, where 1 ≤ ij ≤ M for
all j = 1, . . . , k, and in 6= i` for any n 6= `. Also, let QIk = [Q(i1); ... ; Q(ik)] denote the
matrix constructed by the Ik rows out of the QM . Note the trivial case of QM = Q{1,...,M}. Then, the row selection problem (3.11) can be written equivalently as an index selection
problem
† Im = argmax S (QIm ) . (3.13) Im
We remark that (3.11) seeks to find the m-row submatrix SmQM with maximum S(SmQM ) value. From the definition of S in (3.9) and its property in Theorem 3.1.2, it is then clear
that maximizing this value leads to both larger determinant and better column orthogonal-
ity. This should lead to smaller difference between the sub-problem OLS solution and the 40 full-problem solution, via the bound in term of kgk (3.8) in Theorem 3.1.1.
We also emphasize that the maximization of S(SmQM ) does not necessarily imply the strict minimization of g in (3.8). Therefore, it is possible to define a different quasi-optimal subset using a different criterion to minimize g. Here we propose to use the S(SmQM ) value, as it is relatively appealing mathematically and fairly straightforward to compute
(see the next section).
3.2 Greedy algorithm and implementation
Selecting m rows out of M rows using (3.11) is an NP-hard optimization problem. To circumvent the difficulty, we propose to use a greedy algorithm, as commonly done in optimization problems. By using the index selection format (3.13), the greedy algorithm is as follows.
Given the current selection of k ≥ p rows, which results in a index sequence Ik = {i1, . . . , ik}, find the next row index such that ik+1 = argmax S QIk∪{i} , Ik+1 = Ik ∪ {ik+1}, (3.14) i∈IM \Ik
where IM = {1,...,M} is the set of all indices.
Initial index set
The greedy algorithm requires an initial choice of the first p indices to start, as since
S(A) = 0 for any m×p matrices A with m < p. In order to determine the first p indices, we
apply the above greedy algorithm to the column-truncated square version of the submatrices.
For any k < p, let QeIk be the k × k square matrix consisting the Ik rows and the first k
columns of QM .
For a given index Ik with 1 ≤ k < p, find ik+1 = argmax S QeIk∪{i} , Ik+1 = Ik ∪ {ik+1}. (3.15) i∈IM \Ik
This algorithm starts with an initial choice of index i1, which can be a random choice. It stops when k = p and resumes the standard greedy algorithm (3.14). 41 Evaluations of S(A)
The greedy algorithm (3.14), reps. (3.15), requires one to, at each kth step, search through the remaining candidate rows, construct a (k +1)×p matrix for each such row and compute its corresponding S value, and then find the row with the maximum S value. This can be an expensive procedure, as the S value requires computation of the determinant of every such matrices. Here we provide a procedure to avoid the computations of the determinants.
It is based on the following results.
Theorem 3.2.1. Let A be a m × p full rank matrix with m ≥ p and r be a length-p vector.
Consider the (m + 1) × p matrix Ae = [A; rT ]. Then, the following holds
1 + rT (AT A)−1r (S(A))2p = (det AT A) . (3.16) e Qp (i) 2 2 i=1(kA k + ri )
The proof can be found in Appendix A.1. This result indicates that, for any m × p
matrix A, the S value of the (m + 1) × p matrix with an arbitrary additional row rT
can be computed via (3.16). This indicates that in the greedy algorithm (3.14), to search
through the remaining rows in the candidate rows at the k-th step, one can find the row
that gives the maximum value of the right-hand-side of (3.16) without the (det AT A) term,
as it is a common factor for the current (k × p) matrix. This avoids the computation of the determinant altogether.
For the computation of the initial p rows in the greedy algorithm (3.15), we need to compute the S value of square matrices with one extra row (added from the remaining candidate rows) and one extra column (added from the remain columns). To avoid the explicit computation of the determinants, we introduce the following result.
Theorem 3.2.2. Let A be a m × m full rank square matrix, r and c be length-m vectors,
and γ be a real number. Consider the (m + 1) × (m + 1) matrix A c Ae = . rT γ
42 Then, the following holds
(1 + rT b) (cT c + γ2 − α) S(A)2(m+1) = det(AT A) · (3.17) e Qm (i) 2 2 T 2 i=1(kA k + ri ) c c + γ where
b = (AT A)−1r, g = (AT A)−1AT c, (3.18) brT α = (cT A + γrT ) I − (g + γb). 1 + rT b
The proof can be found in Appendix A.2.
3.2.1 Fast greedy algorithm without determinants
By using Theorem 3.2.1, the greedy algorithm (3.14) can be rewritten in the following equivalent form.
Given the current selection of k ≥ p rows, which results in a index sequence Ik = {i1, . . . , ik}, find the next row index such that
ik+1 = argmax δi, Ik+1 = Ik ∪ {ik+1}, (3.19) i∈IM \Ik
where 1 + Q (QT Q )−1QT (i) Ik Ik (i) δi = , (3.20) Qp (kQ(j)k2 + Q2 ) j=1 Ik i,j
where Q(i) is the i-th row of QM and Qi,j is its (i, j)-th entry.
Similarly, the greedy algorithm (3.15) to obtain the first p rows can be modified by using
Theorem 3.2.2.
For a given index Ik with 1 ≤ k < p, find ˆ ik+1 = argmax δi, Ik+1 = Ik ∪ {ik+1}, (3.21) i∈IM \Ik
where (k+1) 1 + Q β kQ k2 + Q2 − α e(i) i eIk i,k+1 i δˆi = · , (3.22) Qk (kQ(j)k2 + Q2 ) kQ(k+1)k2 + Q2 j=1 eIk e(i),j eIk i,k+1 and
β = (QT Q )−1QT , g = (QT Q )−1QT Q(k+1), i eIk eIk e(i) eIk eIk eIk eIk
43 ! βiQe(i) α = (Q(k+1)T Q + Q Q ) I − (g + Q β ). i eIk eIk i,k+1 e(i) i,k+1 i 1 + Qe(i)βi
3.3 Polynomial least squares via quasi-optimal subset
We now discuss the quasi-optimal subset in term of (orthogonal) polynomial least squares
regression, along with its special properties.
In polynomial regression, we are interested in approximating an unknown function f(x)
using its samples f(xi), i = 1, ··· , m. Let Π be a p-dimensional polynomial space and
{φj(x), 1 ≤ j ≤ p} be its polynomial basis. Given a set of points Θ = {xi}1≤i≤m, the model matrix A is constructed as follow
A = (aij), aij = φj(xi), 1 ≤ i ≤ m, 1 ≤ j ≤ p.
To emphasis its dependency on the point set Θ, we often denote the model matrix by A(Θ).
By employing least squares, one can construct an approximation
p X ∗ fe(x) = ci φi(x) ∈ Π j=1
∗ ∗ ∗ T where the coefficients c = [c1, ··· , cp] minimizes the squared sum of the errors
∗ T c = argmin kAc − yk2, y = [f(x1), ··· , f(xm)] . c∈Rp
Note that each row of A corresponds to a point and each column of A corresponds to a polynomial in the basis. Thus the quasi-optimal sample set in Definition 3.1.3 can be readily expressed in terms of the sample points in polynomial regression.
Let ΘM = {x1, . . . , xM } be the candidate sample point set, and Θm = {xi1 , . . . , xim } ⊂
ΘM be an arbitrary m-point subset with m ≥ p. Let A(Θm) = Q(Θm)R be the QR factorization of the model matrix A(Θm). Then, the quasi-optimal set, following Definition 3.1.3, is
† Θm = argmax S(Q(Θm)). (3.23) Θm⊂ΘM The greedy algorithm can also be readily constructed, similar to (3.14).
44 Given the current selection of k ≥ p samples, which result in the set Θk =
{xi1 , . . . , xik }, find the next sample point such that
xik+1 = argmax S(Q(Θk ∪ {x})), Θk+1 = Θk ∪ {xik+1 }. (3.24) x∈ΘM \Θk
Again, this greedy algorithm requires a choice of the first p samples, as S(Q(Θm)) = 0 whenever m < p. The greedy algorithm above can be adapted to this situation by truncating the p columns to its first m columns to form a m × m square sub-matrix Qe(Θm) and then applying (3.24), similar to the procedure in (3.15). Note in this situation, we require an ordering of the multi-variate polynomial basis in Π. Different ordering will lead to different choices of the first p samples.
3.4 Orthogonal polynomial least squares via quasi-optimal subset
2 Let Π be a linear subspace of Lω(D) and {φi(x)}1≤i≤p be its orthogonal basis which satisfies the mutual orthogonality
Z φi(x)φj(x)dω = δij, 1 ≤ i, j ≤ p. D
Remark 3.4.1. A popular choice of Π is the total degree polynomial space of degree upto
d d d+n n, Pn, defined in (2.30) whose dimension is p = dim Pn = d . Accordingly its orthogonal basis functions are orthogonal polynomials.
When orthogonal basis are used and the sample points are drawn from the underlying orthogonal probability measure ω in iid manners, the columns of the model matrix A are
approximately orthogonal. We then apply the model matrix A directly on the scheme,
avoiding the QR factorization.
Let A(Θm) be the model matrix constructed by a family of orthogonal polynomials and the iid samples from the underlying orthogonal probability measure. Then, the quasi-
optimal set, following Definition 3.1.3, is
† Θm = argmax S(A(Θm)). (3.25) Θm⊂ΘM
The greedy algorithm can also be readily constructed, similar to (3.24). 45 Given the current selection of k ≥ p samples, which result in the set Θk =
{xi1 , . . . , xik }, find the next sample point such that
xik+1 = argmax S(A(Θk ∪ {x})), Θk+1 = Θk ∪ {xik+1 }. (3.26) x∈ΘM \Θk
The first p samples can be chosen in a similar manner as in Chapter 3.3. That is, we apply the above greedy algorithm to the column-truncated square version of the submatrices.
sq For any k < p, let A (Θk) be the k ×k square matrix consisting the rows corresponding
Θk and the first k columns of AM = A(ΘM ). Then the first p points are selected as follow:
For a given set of k points Θk with 1 ≤ k < p, find
sq xk+1 = argmax S (A (Θk ∪ {x})) , Θk+1 = Θk ∪ {xk+1}. (3.27) x∈ΘM \Θk
This algorithm starts with an initial choice of point x1, which can be a random choice. It stops when k = p and resumes the standard greedy algorithm (3.26).
† † Let Sm be the row selection operator corresponding to the point set Θm (3.25). Then the quasi-optimal subset results in a small-size least squares problem
† † min kS AM c − S yM k2 (3.28) c m m
3.4.1 Asymptotic distribution of quasi-optimal points
d Let {φj(x)} be orthogonal polynomials in D ⊂ R satisfying Z φi(x)φj(x)dω = δij (3.29) D
d d Suppose the total degree polynomial space Pn is spanned by {φj(x)}1≤j≤p where p = dim Pn. Suppose also that the sample points are drawn from the underly orthogonal probability measure ω. In this setup, we present some unique properties regarding the asymptotic distribution of the quasi-optimal set. To this end, we focus on the empirical distribution of a finite point set. For any k-point set Θk = {x1, . . . , xk}, we define its empirical distribution as k 1 X ν (x) := δ (x). (3.30) k k xi i=1
46 Asymptotic distribution of determined systems in 1D
First, we examine the property of a set of points which maximizes S in the continuous setting and in bounded interval [−1, 1]. More specifically, we focus on the determined systems, i.e., the case of square model matrix A.
Let φj(x) be an orthogonal polynomial of degree j and (n − 1) be the highest degree of orthogonal polynomials. That is
Z φi(x)φj(x)dω = δij, 0 ≤ i, j < n. (3.31)
Then the columns of the model matrix consist of φ0(x), . . . , φn−1(x), with p = n. We seek
to find a n-points set Θn = {x1, . . . , xn} such that the S value of the square (n × n) model matrix A is maximized, i.e.,
S Θn = argmax S(A(Θn)). (3.32) Θn⊂[−1,1]
Note that since the matrix is square, we have
1 | det A| n S(A) = . (3.33) Qn (i) i=1 kA k
This resembles the so-called Fekete points, which seeks to find the n-point set that maximizes
the determinant of the Vandermonde-like matrix, which is indeed the square model matrix
in this case. That is,
F Θn = argmax | det(A(Θn))|. (3.34) Θn⊂[−1,1] The Fekete points are used to obtain stable polynomial interpolations (cf. [13]). It is well known that the empirical measure of the Fekete points converges to the equilibrium measure, which is the arcsin distribution in [−1, 1],
1 νF (x) = lim νF (x) = √ . (3.35) n→∞ n π 1 − x2
For notational convenience, let
n 2 X 2 kφk(Θn)k2 = φk(xi) i=1
47 where Θn = {xi}1≤i≤n and φk(x) is a basis function. The following theorem shows that the asymptotic distribution of the quai-optimal set from (3.32) is the same as that of the
Fekete point (3.34).
Theorem 3.4.1. Let the n-point set Θn be determined by (3.32), where the matrix A is constructed using orthonormal basis (3.29) of degree up to (n − 1) with measure µ(x) in x ∈ [−1, 1]. Assume the basis functions satisfy that kφkk∞ grows at most polynomially with
S F respect to the degree k and kφk(Θn)k2 are uniformly bounded from below for all n. Let Θn F be the Fekete points (3.34) and assume its absolute determinant | det A(Θn )| is uniformly bounded from below for all n. Then, the asymptotic distribution of the empirical measure
S νn of the quasi-optimal set Θn satisfies
S weak-* F ν (x) = lim νn(x) = ν (x) (3.36) n→∞
where νF is the equilibrium measure for the Fekete points (3.35).
The proof can be found in Appendix A.3. In Fig. 3.1, we show the convergence of
the empirical measures for a variety of orthogonal polynomials in [−1, 1]. These include
Legendre polynomial, Chebyshev polynomial, Jacobi polynomials with parameters (α, β) =
(1, 1) and (0, 2). The quasi-optimal points are computed using the greedy algorithm (3.26)
over a candidate set of 100, 000 uniformly distribution points in [−1, 1]. On the left of
Fig. 3.1, we observe that at polynomial order n = 30, the empirical distributions of the quasi-
optimal points from the four different orthogonal polynomials are all close to the asymptotic
distribution, as indicated by Theorem 3.4.1. On the right of Fig. 3.1, the maximum norm
differences between the empirical distributions and the asymptotic distribution are shown, at
increasing degree of polynomials. We observe clearly algebraic convergence of the empirical
measures towards the asymptotic distribution.
Asymptotic distribution of over-determined systems
We present the asymptotic behavior of the quasi-optimal subset for over-determined systems
m×p with model matrix A ∈ R , m > p. We set the approximation space to be the total degree
48 1 Asymptotic Dist 0.9 Legendre Chebyshev 0.8 Jacobi(1,1) Jacobi(0,2) 0.7
−1 0.6 10
0.5
0.4
0.3
0.2 Legendre Chebyshev 0.1 Jacobi(1,1) Jacobi(0,2) −2 0 10 −1 −0.5 0 0.5 1 5 10 15 20 25 30
Figure 3.1: Empirical distribution of the quasi-optimal set for determined system. Left: the empirical measures at order n = 30 for four different type of orthogonal polynomials; Right: Maximum norm differences between the empirical measures and the asymptotic distribution at increasing degrees of orthogonal polynomials.
d n+d polynomial space whose dimension is p = dim Pn = d for d ≥ 1. Let the orthonormal polynomials {φj(x)}1≤j≤p satisfying (3.29) be its basis. In order to discuss the asymptotic behaviors, we now extend the definition of S to matrices with infinite number of rows.
Definition 3.4.2. Let A∞ be a matrix of size ∞ × p and Sk be a k-row selection matrix, where k > p. Then ! S(A∞) = lim inf sup S(SkA∞) . (3.37) k→∞ Sk Theorem 3.4.2. Assume that for each degree n ≥ 1, there exists a candidate set of infinite
points Θ∞,n such that S(A(Θ∞,n)) = 1. Then there exists a sequence {Θk,n} where k > p = n+d d such that
lim S(A(Θk,n)) = 1 (3.38) k→∞
where Θk,n is the k-point quasi-optimal subset. Let νk,n be its empirical measure. Then, for
the sequence of measures νk,n, n ∈ N, there exists a convergent subsequence νki,nj such that
∗ ν (x) = lim lim νk ,n (x) = ω(x), (3.39) j→∞ i→∞ i j
where ω is the measure in the orthogonality (3.29).
The proof can be found in Appendix A.4. Numerical tests were conducted for four 49 different orthogonal polynomials, as shown in Fig. 3.2 for Legendre polynomial (left) and
Chebyshev polynomial (right), and in Fig. 3.3 for Jacobi polynomial with (α, β) = (1, 1)
(left) and (α, β) = (0, 2) (right). At order n = 30, it can be seen that the empirical mea- sures are close to the corresponding continuous measure µ from the respective orthogonality relations. The maximum norm differences between the empirical measures and the corre- sponding continuous measures ω are shown in Fig. 3.4, for increasing order of polynomials.
Algebraic convergence can be observed for all four different cases.
1 1 Empirical Dist Empirical Dist 0.9 Asymptotic Dist 0.9 Asymptotic Dist
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Figure 3.2: Empirical distribution of the quasi-optimal set for over-determined system at polynomial order n = 30. Left: Legendre polynomial; Right: Chebyshev polynomial.
1 1 Empirical Dist Empirical Dist 0.9 Asymptotic Dist 0.9 Asymptotic Dist
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Figure 3.3: Empirical distribution of the quasi-optimal set for over-determined system at polynomial order n = 30. Left: Jacobi polynomial with (α, β) = (1, 1); Right: Jacobi polynomial with (α, β) = (0, 2).
50
−1 10
−2 10 Legendre Chebyshev Jacobi(1,1) Jacobi(0,2)
5 10 15 20 25 30
Figure 3.4: Maximum norm difference between the empirical measures of the quasi-optimal sets and the measures in the orthogonality (3.29), for over-determined systems obtained by different orthogonal polynomials at increasing degrees.
3.5 Near optimal subset: quasi-optimal subset for Christoffel least squares
We now present another least squares strategy which combines the Christoffel least squares method (CLS) [66] and the quasi-optimal point selection method (S-optimality).
Let us first briefly review the CLS [66]. Let {φj(x)} be orthogonal polynomial satisfying Z φi(x)φj(x)dω = δij D
d d and Pn = span{φj(x), 1 ≤ j ≤ p} where p = dim Pn. The Christoffel least squares method is a weighted least squares whose weights come from the (normalized) Christoffel function
λn(x) (2.42) p λn(x) = Pp 2 . i=1 φi (x)
And more importantly, the samples are drawn from either the equilibrium measure µD (2.43) in iid manners (not from the underlying orthogonal probability measure ω). Let A be the model matrix constructed by sampling orthogonal polynomial basis {φj(x), 1 ≤ j ≤ p} at
T m iid samples Θm = {xi}1≤i≤m from µD. Let y = [f(x1), ··· , f(xm)] . Then CLS solves 51 the following weighted least squares problem (2.15)
√ √ min k W Ac − W yk2, c∈Rp
where
M W = diag(w1, . . . , wM ), wi = λn(xi), {xi}i=1 ∼ µD. (3.40)
In the setting of CLS, the quasi-optimal subset selection procedure can easily be applied.
Let ΘM = {x1, . . . , xM } be the candidate sample point set drawn from the equilibrium measure µD (2.43). And let Θm = {xi1 , . . . , xim } ⊂ ΘM be an arbitrary m-point subset with √ m ≥ p. Let WA(Θm) be the matrix constructed by a family of orthogonal polynomials, a set of samples Θm, and its corresponding Christoffel function weights. We then have a “near optimal” point set defined as follow:
√ S Θm = argmax S( WA(Θm)). (3.41) Θm⊂ΘM
Remark 3.5.1. We remark that by using the CLS method, the full size M-point model
matrix for the least squares problem achieves much better stability, as detailed by [66].
Once the much better conditioned model matrix is constructed by the CLS method, the S-
optimality strategy ([84]) allows one to determine the near optimal smaller subset with m points such that the m-point least squares solution is almost the closest to the (unavailable) full set CLS solution. It is in this sense that we term the current method “near optimal”, although true optimality can not be proved.
The greedy algorithm is readily constructed, similar to (3.26).
Given the current selection of k ≥ p samples, which result in the set Θk =
{xi1 , . . . , xik }, find the next sample point such that √ xik+1 = argmax S( WA(Θk ∪ {x})), Θk+1 = Θk ∪ {xik+1 }. (3.42) x∈ΘM \Θk
Again, since the quantity S is defined only for matrices of size k × p with k ≥ p, the greedy algorithm requires specification of the initial p points to start. Here we discuss two options to determine the initial p points. One option is to follow a similar manner as in
52 √ sq (3.27). That is, for any k < p, let WA (Θk) be the k × k square matrix consisting the √ √ rows corresponding Θk and the first k columns of WAM = WA(ΘM ).
For a given set of k points Θk with 1 ≤ k < p, find √ sq xk+1 = argmax S WA (Θk ∪ {x}) , Θk+1 = Θk ∪ {xk+1}. (3.43) x∈ΘM \Θk
This algorithm starts with an initial choice of point x1, which can be a random choice. It stops when k = p and resumes the standard greedy algorithm (3.42).
Another choice is to use the concept of Fekete points, which is used in polynomial inter- polation theory. The Fekete points seek to maximize the determinant of the Vandermonde- like matrix. Using the same principle, we can find the first p sample points Θp which √ sq maximize the determinant of the p × p square matrix WA (Θp), i.e.,
√ sq Θp = argmax | det WA (Θp)|. (3.44) Θp⊂ΘM
The motivation for using the Fekete points is that it is known that the asymptotic distribution of the Fekete points in bounded domain is the asymptotic measure [10]. This is the same property as the near optimal samples from Theorem 3.5.1 and 3.5.2. More importantly, the Fekete points can be easily computed using a greedy algorithm, resulting in the approximate Fekete points (AFP) [13]. The algorithm for computing the AFP is remarkably simple. This results in a much reduced computational effort for determine the initial p points. Our numerical examples demonstrate that the use of the AFP as the initial p points results in negligible difference in the final result, with the use of the square matrices approach slightly better.
S S Let Sm be the row selection operator corresponding to the point set Θm (3.41). Then the near-optimal subset results in a small-size least squares problem
√ √ S S min kS WAM c − S W yM k2 (3.45) c m m
53 3.5.1 Asymptotic distribution of near optimal points
We now discuss the asymptotic distributions of the near optimal samples. For a set of m
S points Θm selected by (3.41), its empirical distribution is
1 X ν (x) := δ (x). m m z S z∈Θm
We are interested in the asymptotic limit of νm as m → ∞. Let us restrict to the total degree polynomial space (2.30) and use the subscript n to explicitly denote the degree. Thus
d n+d pn = dim Pn = d . Let
p ψi,n(x) = λn(x)φi(x), i = 1, . . . , pn. (3.46)
where λn(x) is the (normalized) Christoffel function (2.42). They form a non-polynomial orthonormal basis satisfying
Z ψi,n(x)ψj,n(x)ven(x)dz = δij, ven(x) = ω(x)/λn(x). (3.47) D
For bounded domains, the following result holds.
Theorem 3.5.1 (Bounded). Suppose that for each n, there exists a candidate set of infinite √ points Θ∞,n with model matrix A∞,pn such that S( WA∞,pn )) = 1. Then there exists a
sequence {Θk,n}k>pn with empirical measure νk,n, which contains a subsequence νkl,n that ∗ converges to measure νn satisfying Z Z ∗ lim ψi,n(x)ψj,n(x)νkl,n(x)dx = ψi,n(x)ψj,n(x)νn(x)dx = δij. (3.48) l→∞ D D
∗ The sequence {νn} converges to v, i.e.,
dµ lim ν∗(x) = v(x) = D , a.e. (3.49) n→∞ n dx where dµD is the equilibrium measure of D.
Proof. From Theorem 3.4.2 and (3.48), we have
Z Z ∗ δij = ψi,n(x)ψj,n(x)νn(x)dx = φi(x)φj(x)ωn(x)dx, (3.50) D D 54 where
∗ ωn(x) = λn(x)νn(x). (3.51)
From (2.44) and the uniqueness of orthogonal polynomial probability measure,
∗ ω(x) ∗ ω(x) = lim ωn(x) = lim λn(x)ν = lim ν (x), a.e. (3.52) n→∞ n→∞ n v(x) n→∞
∗ ∗ Therefore, ν := limn→∞ νn = v.
Numerical tests were conducted to verify the theoretical estimates. Here we present the results obtained by Legendre polynomial, Chebyshev polynomial, and Jacobi polynomials with parameters (α, β) = (1, 1) and (0, 2). The near optimal points are computed using the greedy algorithm from [84], or Chapter 3.2.1, over a candidate set of 100,000 uniformly distribution points in [−1, 1]. On the left of Figure 3.5, we observe that at polynomial order n = 100, the empirical distributions of the points from the four different orthogonal polynomials are all close to the asymptotic distribution. The maximum norm difference between the empirical distributions and the asymptotic distribution are shown in Fig. 3.5, for increasing order of polynomials. Algebraic convergence can be observed for all four different cases.
1 Asymptotic Legendre 0.9 −1 Chebyshev 10 0.8 Jacobi(1,1) Jacobi(0,2) 0.7
0.6
0.5 −2 10 0.4
0.3
0.2 Legendre Chebyshev 0.1 Jacobi(1,1) Jacobi(0,2) −3 10 0 1 2 −1 −0.5 0 0.5 1 10 10
Figure 3.5: Empirical distribution of the near optimal set for over-determined system. Left: the empirical measures at order n = 100 for four different type of orthogonal polynomials; Right: Maximum norm differences between the empirical measures and the asymptotic distribution at increasing degrees of orthogonal polynomials.
55 The following theorem is about the asymptotic distribution in unbounded domains.
Theorem 3.5.2 (Unbounded). Suppose that for each n, there exists a candidate set of ¯ infinite points Θ∞,n ⊂ Sw such that the model matrix A∞,pn using the scaled point set √ 1/t ¯ n+d n Θ∞,n satisfies S( WA∞,pn )) = 1. Here pn = d is the cardinality of the polynomial 1/t ¯ space. Then there exists a sequence {n Θk,n}k>pn with empirical measure {νk,n}k>pn , ∗ which contains a convergent subsequence {νkl,n}l≥1 that converges to measure νn satisfying Z Z ∗ lim ψi,n(x)ψj,n(x)νkl,n(x)dx = ψi,n(x)ψj,n(x)νn(x)dx = δij. (3.53) l→∞ D D
∗ d/t ∗ 1/t ∗ Let µn(z) := n νn(n z). Then, the sequence {µn}n≥1 converges to the weighted equilib- rium measure dµD,Q, i.e.,
dµ lim µ∗ (z) = v(z) = D,Q , a.e. (3.54) n→∞ n dz
Proof. From the assumption that for each n, there exists a candidate set of infinite points √ Θ¯ ∞,n ⊂ Sw such that S( WA∞,n)) = 1. By Theorem 3.4.2, we have Z Z ∗ pn ∗ δij = ψi,n(x)ψj,n(x)νn(x)dx = φi(x)φj(x) νn(x)dx. D D Kn(x)
By setting w (x) := pn ν∗(x), we can write n Kn(x) n Z δij = φi(x)φj(x)wn(x)dx. D
It then follows from the uniqueness of the orthogonal probability measure that
pn ∗ lim νn(x) = lim wn(x) = w(x). (3.55) n→∞ Kn(x) n→∞
1/t ∗ Substituting the change of variable x = n z into the measure νn(x)dx produces
∗ d/t ∗ 1/t ∗ ∗ d/t ∗ 1/t νn(x)dx = n νn(n z)dz = µn(z)dz, where µn(z) := n νn(n z).
From Theorem 2.6.1, we have
K(n)(z)wn(z) dµ lim n = D,Q . (3.56) n→∞ pn dz
56 Therefore by combining (3.55) and (3.56), it follows
pn ∗ pn ∗ lim νn(x) = lim µn(z) = 1. n→∞ n→∞ (n) n Kn(x)w(x) Kn (z)w (z) This implies
dµ lim µ∗ (z) = D,Q . n→∞ n dz
3.6 Summary
Here we summarize least squares methods using orthogonal polynomials and the quasi- optimal sampling strategies. We again remark that the entire procedure for solving either
(3.25) or (3.41) is independent of the data vector yM . Upon conducting the iid samples from either the underlying orthogonal probability measure ω (3.29) or the equilibrium measure
µD (2.43), one solves either (3.25) or (3.41) to decide the m-point subset. It is only after † this step that one is required to collect the function data at the m select points (either Θm
S or Θm). Therefore, the problem of either (3.25) or (3.41) results in a “smaller” least squares problem of size m × p, and the full data vector yM is never required.
3.6.1 Quasi optimal sampling for ordinary least squares
The quasi optimal sampling strategy consists of the following steps.
• For a given bounded domain D with a measure ω, draw M 1 number of i.i.d.
random samples from the underlying orthogonal probability measure ω (2.2).
• Construct the model matrix AM .
• For any given number m < M, identify the quasi-optimal subset by applying the
S-optimality strategy (3.25) on the model matrix AM .
• Solve the ordinary least squares problem (3.28) but only to the subset problem with
m points.
57 3.6.2 Near optimal sampling for Christoffel least squares
The near optimal sampling strategy consists of the following steps:
• For a given domain D with a measure ω, draw M 1 number of iid random samples
from the equilibrium measure µD (2.43). √ • Construct the weighted model matrix WAM (3.40).
• For any given number m < M, identify the quasi-optimal subset by applying the √ S-optimality strategy (3.42) on the weighted model matrix WAM .
• Solve the weighted least squares problem (3.45) but only to the subset problem with
m points. √ It is worth noting that the construction of the weighted model matrix WAM can be easily accomplished by normalizing each row of the original unweighted model matrix AM √ and then rescaling it by a factor of p, i.e.,
√ A(i) A(i) → p , i = 1,...,M kA(i)k where A(i) is the i-th row of AM . This is because each row of the weighted model matrix √ WA is M √ √ p wiA(i) = p A(i), i = 1,...,M, Kn(xi) where v u p uX 2 p kA(i)k = t φk(xi) = Kn(xi). k=1
58 CHAPTER 4
ADVANCED SPARSE APPROXIMATION
In this chapter we present an optimization problem of minimizing `1−`2 under linear con- straints. That is,
min kxk1 − kxk2 subject to Ax = b x∈Rp where A is a matrix of size m × p, b is a vector of size m × 1, and m < p. The `1−`2 minimization is designed to find a spare solution in the feasible set {x | Ax = b}. This
problem was first proposed in [59] and then was studied in [105]. Here a further theoretical
study is discussed, which improves and extends the theory of `1−`2 minimization. Theo- retical estimates regarding its recoverability for sparse signals are improved. Also similar theoretical results are extended for non-sparse signals. This extension includes the analysis of (sparse) Legendre polynomial approximations via (weighted) `1−`2 minimization. We also remark that this sparse function approximation has an immediate application for stochastic collocation of UQ. Various numerical examples are presented in Chapter 7.2.
We observe that the `1−`2 minimization seems to produce results that are consistently better than the `1 minimization, although a rigorous proof of this is lacking.
4.1 Review on `1-`2 minimization
We consider the standard recovery problem of underdetermined system Ax = b, where the
m×p measurement matrix A ∈ R with m < p. In many practical problems, the problem is severely underdetermined with m p.
The `1−`2 minimization method proposed in [59, 105] seeks to find a sparse solution to
59 the underdetermined system Ax = b by solving
min kxk1 − kxk2 subject to Ax = b. (4.1) x∈Rp
If the data b contain noise, the corresponding “de-noising” version is
min kxk1 − kxk2 subject to kAx − bk2 ≤ τ, (4.2) x∈Rp where τ is related to the noise level in b.
Remark 4.1.1. The standard (and popular) approach of finding a sparse solution is com- pressive sensing (CS), (cf. [24, 22, 34]). Although sparsity is naturally measured by the `0 norm, the `1 norm is widely adopted, for it leads to a constrained optimization problem
min kxk1 subject to Ax = b (4.3)
that can be solved as a convex optimization. (The use of `0 norm will lead to an NP-hard optimization problem [69].)
As discussed in [59, 105], the motivation of using `1−`2 is that since the contours of kxk1 − kxk2 are closer to those of kxk0, the `1−`2 formulation should be more sparsity- promoting, compared to the use of kxk1 norm. The recoverability of the `1−`2 minimization problem (4.1) has been studied in [59, 105] in the context of exact recovery, which assumes the true solution is s-sparse. The major results from [59, 105] are summarized as follows.
Letx ¯ be a vector with sparsity of s satisfying
√ !2 3s − 1 a(s) = √ > 1, (4.4) s + 1 and suppose A satisfies the condition
δ3s + a(s)δ4s < a(s) − 1. (4.5)
Then, the following results hold ([105]),
• Let b = Ax¯, thenx ¯ is the unique solution to (4.1).
60 m • Let b = Ax¯ + e, where e ∈ R is any perturbation with kek2 ≤ τ, then the solution opt opt x to (4.2) satisfies kx − x¯k2 ≤ Csτ, where
2p1 + a(s) Cs = p √ > 0. (4.6) a(s)(1 − δ4s) − 1 + δ3s
4.2 Recovery properties of `1-`2 minimization
We present two sets of recovery results for the `1−`2 minimization. One for exact recovery of sparse functions, which improves the results in [105], and the other for recovery of non-sparse functions, which were not available before.
Exact recovery of sparse solutions
Theorem 4.2.1. Let x¯ be a vector with sparsity of s satisfying
3s − 1 2 a(s) = √ √ > 1, (4.7) 3s + 4s − 1 and suppose A satisfies the condition
δ3s + a(s)δ4s < a(s) − 1.
Then, the following results hold
• Let b = Ax¯, then x¯ is the unique solution to (4.1).
m • Let b = Ax¯ + e, where e ∈ R is any perturbation with kek2 ≤ τ, then the solution opt opt x to (4.2) satisfies kx − x¯k2 ≤ Csτ, where √ 2 3s − ps · a(s) Cs = p √ . a(s)(1 − δ4s) − 1 + δ3s
The proof can be found in the Appendix B.1. This result is very similar to that from
[105], summarized in (4.4)–(4.6). Upon comparing the constant a(s) in (4.7) against that in (4.4), the current result represents an improvement over that from [105]. The condition of (4.7) is satisfied for s > 2, however the condition of (4.4) is satisfied for s > 7. Thus the new condition is only missing the case of s = 2. 61 We now present another estimate for the exact recovery, using the more natural measure of δs, instead of δ3s + a(s)δ4s.
Theorem 4.2.2. Let x¯ be any vector with sparsity s satisfying
1 δ < , (4.8) s 1 + c(s) where
√ 9 9 r 1 2 c(s) := 5 + √ + √ + . 4 5s 2 5 4s2 s
Then, the following results holds.
• Let b = Ax¯, then the solution xopt to (4.1) is exact, i.e., xopt =x ¯.
m • Let b = Ax¯ + e, where e ∈ R is any perturbation with kek2 ≤ τ, then the solution xopt to (4.2) satisfies
opt kx − x¯k2 ≤ Cesτ (4.9)
where
2β(s)p1 + 1/(1 + c(s)) r 1 r 1 Ces = , β(s) = + + 2. (4.10) 1 − (1 + c(s))δs 4s 4s
The proof can be found in Appendix B.2. The derivation uses a different technique, from that in Theorem 4.2.1 and in [105]. This allows us to extend the analysis to the study of recovery of non-sparse signals, as presented in the following section.
Recovery of non-sparse solutions
To quantify the recovery accuracy for a general non-sparse solutionx ¯, it is customary to measure the solution against its “best” sparse versionx ¯s, which is the s-sparse vector containing only the s largest entries (in absolute values) ofx ¯.
n Theorem 4.2.3. Let x¯ ∈ R be an arbitrary vector and x¯s be the s-sparse vector containing
62 only the s largest entries (in absolute value) of x¯. Let us assume
1 δ2s < √ √ . (4.11) √2 2 1 + 2 + s−1
Then, the following results hold.
• Let b = Ax¯, the solution xopt to (4.1) satisfies
opt kx − x¯k2 ≤ C2,skx¯ − x¯sk1. (4.12)
m • Let b = Ax¯ + e, where e ∈ R is any perturbation with kek2 ≤ τ, then, the solution xopt to (4.2) satisfies
opt kx − x¯k2 ≤ C1,sτ + C2,skx¯ − x¯sk1. (4.13)
Here, the constants are p 4 s(1 + δ2s) C1,s = √ √ √ √ , s − 1 − ( 2 + 1) s + 2 − 1 δ2s √ (4.14) 2 + (2 2 − 2)δ2s C2,s = √ √ √ √ . s − 1 − ( 2 + 1) s + 2 − 1 δ2s
The proof can be found in Appendix B.3. Note that by using the standard CS notation
n for the error of the best s-term approximation of any vector x ∈ R , i.e.,
σs,p(x) = inf ky − xkp, (4.15) kyk0≤s the term kx¯ − x¯sk1 in the theorem is σs,1(¯x).
Remark 4.2.1. It must be pointed out that the estimates in Theorem 4.2.2 and 4.2.3 are derived by using the same approach developed in the standard CS literature. It is worthwhile to compare the current estimates for the `1−`2 minimization against the estimates for the standard CS using `1 minimization. For the `1 minimization, [17] derived an estimate of √ δ < 1√ for sparse solutions, and [21] derived an estimate of δ < 2 − 1 for non- s 1+ 5 2s sparse solutions. The current corresponding estimates for the `1−`2 minimization are (4.8) and (4.11), respectively. It is evident that the estimates for the `1−`2 minimization are
63 worse than those of the `1 minimization. The work of [59, 105] also arrived at similarly worse (compared to the `1 minimization) estimates for `1−`2 minimization, even though our current estimates notably extend those in [59, 105]. On the other hand, the numerical
examples in this paper and [59, 105] demonstrate that the `1−`2 minimization consistently outperforms the `1 minimization, for a variety of problems. This leads to the conviction that the current estimates for the `1−`2 minimization may be overly conservative. New estimates, perhaps not in the popular form of RIP condition, may be required to fully understand the properties of the `1−`2 minimization. This shall be pursued in a separate study and reported later.
4.3 Function approximation by Legendre polynomials via `1-`2
d Let Pn be the total degree polynomial space of degree upto n, (2.30) whose dimension is
d + n p = dim( d ) = . (4.16) Pn d
Let {φj(z)} be an orthogonal polynomial system, i.e., Z φi(z)φj(z)dω(z) = δij (4.17)
d 2 and {φj(z)}1≤j≤p be a basis for Pn. For an unknown function f(z) ∈ Lω, we week to d construct an approximation fn(z) ∈ Pn which has the form of p X fn(z) = cjφj(z) (4.18) j=1 using samples of f, f(zi), i = 1, ··· , m. By enforcing fn(zi) = f(zi) (interpolation) or fn(zi) ≈ f(zi), we arrive at a system of equations for the expansion coefficients Ac = y or
T T Ac ≈ y, where c = (c1, . . . , cp) is the expansion coefficient vector, y = (f(z1), . . . , f(zm)) is the sample vector, and
A = (aij)1≤i≤m,1≤j≤p, aij = φj(zi), (4.19)
64 d is the Vandermonde-like matrix. Note that other polynomial spaces other than Pn can d be chosen, e.g., tensor product space. Here we focus on Pn (2.30) because it is the most common choice.
To be consistent with the notations in the previous sections, hereafter we will use x to
denote the coefficient vector c and b the data vector y. The system Ac = y is then rewritten
as Ax = b.
The orthogonal polynomial system {φj(z)} is called a bounded orthonormal system if it is uniformly bounded, for some K ≥ 1,
sup kφjk∞ ≤ K. (4.20) j
We now recall the following result for bounded orthonormal systems.
m×p Theorem 4.3.1 ([78, 79]). Let A ∈ R be the Vandermonde-like matrix (6.2), whose
entries φj(zi) are from a bounded orthonormal system (4.17) satisfying (4.20) and the points
zi, i = 1, ··· , m are i.i.d. random samples from the measure µ. Assume
m ≥ Cδ−2K2s log3(s) log(n), (4.21)
3 −γ log (s) √1 then with probability at least 1 − n , the RIC δs of m A satisfies δs ≤ δ. Here C, γ > 0 are universal constants.
Using this property of the Vandermonde-like matrix A, we now present the recoverability result of the `1−`2 minimization for Legendre approximation, i.e., {φj(z)} are Legendre polynomials.
Exact recovery of sparse functions
We first study the recovery properties for sparse target functions. Let us assume the target function f(z) consist of s numbers of non-zero terms in its Legendre expansion. That is,
X f(z) = x¯jφj(z), |T | = s. (4.22) j∈T
65 d Let fn ∈ Pn be a Legendre approximation of f in the form of (4.18). We then have
T kf(z) − fn(z)kω = kx¯ − xk2, x = [x1, ··· , xp] (4.23)
owing to the orthonormality of the basis functions. Note that k · kω is the norm defined in
d (2.1). Here it is assumed that T ⊆ Tn, the multiple index set for the polynomial space Pn. We now consider the practical case of high dimensionality, where d ≥ n. Note that for even moderately high dimensions d > 1, one often can not afford high-order polynomial expansions, and d ≥ n is often the case.
Theorem 4.3.2. Let A be the Vandermonde-like matrix constructed using the Legendre
d polynomials in Pn and m sampling points {zi}1≤i≤m from the uniform measure. Suppose d ≥ n and
m ≥ Cδ−23ns log3(s) log(p), (4.24)
where
1 √ 9 9 r 1 2 δ < , c(s) = 5 + √ + √ + , 1 + c(s) 4 5s 2 5 4s2 s and C > 0 is universal. Let the target function be constructed as (4.22), with the s-sparse
−γ log3(s) coefficient vector x¯ satisfying T ⊆ Tn. Then, with a probability exceeding 1 − p , the following results hold.
• Let b = Ax¯, then the solution xopt to (4.1) is exact, i.e., xopt =x ¯.
m • Let b = Ax¯ + e, where e ∈ R is any perturbation with kek2 ≤ τ, then the solution xopt to (4.2) satisfies Ces kxopt − x¯k ≤ √ τ, (4.25) 2 m
where Ces is from (4.10).
Proof. It is sufficient to derive (4.25), which immediately delivers the noiseless case by setting τ = 0. Since m ≥ Cδ−23ns log3(s) log(p), and A is constructed using the Legendre polynomials and sample points from the uniform measure, then, with the probability at
3 γ log (s) √1 1 least 1 − p , the restricted isometry constant δs of m A satisfies δs ≤ δ < 1+c(s) . The 66 constraint condition can be equivalently written as
1 1 1 √ Ax − √ b ≤ √ τ m m 2 m without changing the solution. The proof then follows immediately from Theorem 4.2.2.
Recovery of non-sparse polynomial functions
For a general unknown function ffull, let f be its best approximation in the polynomial space
d 2 d Pn (2.30). Typically, ffull ∈ Lω and f = Pffull is the orthogonal projection of ffull onto Pn. d Note that if ffull ∈ Pn, ffull = f. In general, the best approximation f is not sparse. That is, we have
X d f(z) = x¯jφj(z), |Tn| = p = dim Pn, (4.26) j∈Tn d where Tn is the index set for Pn. We present the reconstruction results for f using the `1−`2
minimization, in the context of Legendre polynomials. Letx ¯s be the “best” s-sparse version
ofx ¯. A sparse approximation is constructed by the `1−`2 minimization problem, (4.1) or (4.2), and the quality of the approximation is measured by comparing the reconstructed coefficient vector x tox ¯s.
Theorem 4.3.3. Let A be the Vandermonde-like matrix constructed using the Legendre
d polynomials in Pn and m sampling points {zi}1≤i≤m from the uniform measure. Suppose d ≥ n and
m ≥ Cδ−23ns log3(2s) log(p), (4.27) where
1 δ < √ √ , √2 2 1 + 2 + s−1 and C > 0 is universal. Let the target function be a Legendre series (4.26) with an arbitrary
p d coefficient vector x¯ ∈ R , p = dim Pn, and x¯s be its truncated vector corresponding to the 3 s largest entries (in absolute value). Then, with a probability exceeding 1 − p−γ log (2s), the
following results hold.
67 • Let b = Ax¯, then the solution xopt to (4.1) satisfies
opt kx − x¯k2 ≤ C2,skx¯ − x¯sk1. (4.28)
m • Let b = Ax¯ + e, where e ∈ R is any perturbation with kek2 ≤ τ, then the solution xopt to (4.2) satisfies
C kxopt − x¯k ≤ √1,s τ + C kx¯ − x¯ k (4.29) 2 m 2,s s 1
Here the constants C1,s and C2,s are from (4.14).
Proof. It is sufficient to derive (4.29), as (4.28) can be obtained by setting τ = 0. Since
m ≥ Cδ−23ks log3(2s) log(n), and A is constructed using the Legendre polynomials and
3 sample points from the uniform measure, then, with the probability at least 1 − nγ log (2s),
1 1 the restricted isometry constant δ2s of √ A satisfies δ2s ≤ δ < √ √ . The constraint m 1+ 2+ √2 2 s−1 condition can be equivalently written as
1 1 1 √ Ax − √ b ≤ √ τ. m m 2 m
The conclusion then follows from Theorem 4.2.3.
4.4 Function approximation by Legendre polynomials via a weighted `1-`2
We consider a weighted version of the `1−`2 minimization. The counterparts to the noiseless
`1−`2 minimization (4.1) and de-nosing `1−`2 minimization (4.2) are,
min kxk1 − kxk2 subject to W Ax = W b, (4.30) x∈Rp
min kxk1 − kxk2 subject to kW Ax − W bk2 ≤ τ, (4.31) x∈Rp
respectively. Here d d/2 Y 2 1/4 Wi,j = (π/2) (1 − z` ) δij (4.32) `=1 is the tensored Chebyshev weight and τ is related to the noise level in W b.
68 Exact recovery of sparse functions via a weighted `1-`2
We first consider the recovery of sparse target functions, i.e., f has the form of (4.22).
Theorem 4.4.1. Let A be the Vandermonde-like matrix constructed using the Legendre
d polynomials in Pn, d ≥ 1, and m sampling points {zi}1≤i≤m from the Chebyshev measure. Suppose
m ≥ Cδ−22ds log3(s) log(p), (4.33) where
1 √ 9 9 r 1 2 δ < , c(s) := 5 + √ + √ + , 1 + c(s) 4 5s 2 5 4s2 s and C > 0 is universal. Let the target function be a Legendre series (4.22) with a s-sparse
−γ log3(s) coefficient vector x¯ satisfying T ⊆ Tn. Then, with a probability exceeding 1 − p , the following results hold.
• Let W b = WAx¯, then the solution xopt to (4.30) is exact.
m • Let W b = WAx¯+e, where e ∈ R is any perturbation with kek2 ≤ τ, then the solution xopt to (4.31) satisfies Ces kxopt − x¯k ≤ √ τ, (4.34) 2 m
where Ces is from (4.10).
Proof. Since m ≥ Cδ−22ds log3(s) log(p) and A is constructed using the Legendre poly- nomials and sample points from the uniform measure, then, with the probability at least
3 γ log (s) √1 1 1 − p , the restricted isometry constant δs of m WA satisfies δs ≤ δ < 1+c(s) . The constraint condition can be equivalently written as
1 1 1 √ W Ax − √ W b ≤ √ τ. m m 2 m
The conclusion then follows from Theorem 4.2.2.
Recovery of non-sparse functions a weighted `1-`2
We present the recovery properties for general non-sparse target functions (4.26). 69 Theorem 4.4.2. Let A be the Vandermonde-like matrix constructed using the Legendre
d polynomials in Pn, d ≥ 1, and m sampling points {zi}1≤i≤m from the Chebyshev measure. Suppose
m ≥ Cδ−22ds log3(2s) log(p), (4.35) where
1 δ < √ √ , √2 2 1 + 2 + s−1 and C > 0 is universal. Let the target function be a Legendre series (4.26) with an arbitrary
p d coefficient vector x¯ ∈ R , p = dim Pn, and x¯s be its truncated vector corresponding to the 3 largest s entries (in absolute value). Then, with a probability exceeding 1 − p−γ log (2s), the
following results hold.
• Let W b = WAx¯, then the solution xopt to (4.30) satisfies
opt kx − x0k2 ≤ C2,skx0 − x0,sk1. (4.36)
m • Let W b = WAx¯+e, where e ∈ R is any perturbation with kek2 ≤ τ, then the solution xopt to (4.31) satisfies
C kxopt − x¯k ≤ √1,s τ + C kx − x¯ k . (4.37) 2 m 2,s 0 s 1
Here the constants C1,s and C2,s are from (4.14).
Proof. Since m ≥ Cδ−22ds log3(2s) log(p) and A is constructed using the Legendre polyno- mial and sample points from the Chebyshev measure, then, with the probability at least
γ log3(2s) 1 1 1 − p , the restricted isometry constant δ2s of √ WA satisfies δ2s ≤ δ < √ √ . m 1+ 2+ √2 2 s−1 The constraint condition can be equivalently written as
1 1 1 √ W Ax − √ W b ≤ √ τ. m m 2 m
The conclusion then follows from Theorem 4.2.3.
70 4.5 Summary
In this chapter we study the `1−`2 minimization methods for sparse approximations of functions. In particular, we focus on the use of Legendre polynomials, which are widely used in UQ computations. We derive the recoverability properties of the `1−`2 minimiza- tion, for both the direct minimization and the Chebyshev preconditioned minimization.
The theoretical estimates indicate that in low dimensions the Chebyshev preconditioned
`1−`2 minimization should be preferred, whereas in high dimensions the straightforward unweighted `1−`2 minimization should be preferred. Numerical examples are provided in
Chapter 7.2 and verify these findings. The `1−`2 minimization methods seem to produce better results than the `1 minimization method for a variety of test problems. Although the theoretical proof is still lacking, it nevertheless can serve as a viable alternative for high dimensional function approximations.
71 CHAPTER 5
SEQUENTIAL APPROXIMATION
We introduce a sequential function approximation method. For an unknown target function f(x), the method sequentially seeks an approximation using only a single random sample of f. Unlike the traditional function approximation approaches, this method never forms or stores a matrix in any sense. This results in a simple numerical implementation using only vector operations and avoids the need to store the entire data set. Thus the sequential method would particularly be suitable when the data set is exceedingly large.
The sequential method is motivated by the randomized Kaczmarz (RK) method [88].
The RK method is a randomized iterative algorithm for solving (overdetermined) linear systems of equations. In this chapter, we present a sequential function approximation in high dimensions. Convergence analysis is also presented both in expectation and almost surely. The analysis establishes the optimal sampling probability measure which results in the optimal rate of convergence. Upon introducing the general sequential method, an application to the Gauss-tensor quadrature grid is discussed. This results in a very efficient method by taking advantage of the tensor structure of the grids.
Various numerical examples (up to hundreds of dimensions) are provided in Chapter 7.3 to verify theoretical analysis and demonstrate the effectiveness of the method.
5.1 Randomized Kaczmarz algorithm
Let us start by briefly reviewing the Randomized Kaczmarz (RK) algorithm ([88]). It is an iterative method to solve an overdetermined linear system of equations, Ax = b, where A is a full-rank matrix of size m × p with m > p. For an arbitrary initial approximation x(0) 72 to the solution, the (k + 1)-th iteration is computed by
(k) (k+1) (k) bik − hx ,Aik i x = x + 2 Aik , (5.1) kAik k2
2 where ik is randomly chosen from {1, . . . , m} with probabilities proportional to kAik k2. Error analysis has been studied for both consistent and inconsistent systems, in term of the expectation, cf. [88, 70]. The expectation is understood as the conditional expectation over all the previous random choices of the row-indices. Here we summarize some main results about the RK method.
• Consistent system ([88]). For a consistent system where x∗ is the solution to Ax∗ = b,
the k-th iterate of (5.1) satisfies
(k) ∗ −2 k/2 (0) ∗ Ekx − x k2 ≤ (1 − κ(A) ) · kx − x k2,
−1 where κ(A) = kAkF kA k.
• Inconsistent system ([70]). For an inconsistent system, let x∗ satisfy Ax∗ = y and
b = y + e. Then the k-th iterate of (5.1) satisfies
(k) ∗ −2 k/2 (0) ∗ Ekx − x k2 ≤ (1 − κ(A) ) · kx − x k2 + κ(A)γ,
|ei| where γ = maxi . kAik2
Here the spectral and Frobenius norm of A are defined as, respectively,
s X 2 kAk = max kAxk2 and kAkF = aij. kxk =1 2 i,j √ From the fact the κ(A) ≥ p, it is easy to see that the optimal convergence rate of the RK method is (1 − 1/p).
5.2 Sequential function approximation
The setup of this chapter is mostly introduced in Chapter 2.1. For readers who are not familiar with this setup, it is suggestive to review Chapter 2.1.
73 2 d Let f be an unknown target function in Lω(D) (2.1) where D is a domain in R , d ≥ 1. 2 Let Π be a p-dimensional linear subspace of Lω(D) where the approximation would be sought, and {φj(x)}1≤j≤p be its basis. The goal is to construct an approximation fe in a p-dimensional linear space Π,
p X T fe(x) = cjφj(x), c = [c1, ··· , cp] j=1
2 by appropriately determining its expansion coefficients c. Let PΠf be the best Lω approx- imation of f onto Π defined in (2.11). This can be written as
p X T PΠf(x) = cˆjφj(x), cˆ = [ˆc1, ··· , cˆp] j=1
where cˆ is defined in (2.5). For notational convenience, we adopt matrix-vector notation,
T Φ(x) = [φ1(x), ··· , φp(x)] .
Then fe(x) = hΦ(x), ci, where h·, ·i is the standard vector inner product. In this chapter, we assume the basis functions satisfy
2 2 0 < inf kΦ(x)k2 ≤ sup kΦ(x)k2 < ∞. (5.2) x∈D x∈D
Algorithm
d First, we choose a sampling measure ν on D ⊂ R , from which samples of x are drawn. Note that the sampling measure ν is not necessarily the same as the measure ω (2.2) of the
2 space Lω where the unknown target function f lies in. Also ν could be a discrete probability measure.
Starting from an arbitrary initial choice of the coefficient vector c(0), we draw i.i.d.
samples from the sampling probability measure ν. Let zk ∈ D, k = 1,... , be the k-th
sample drawn from ν and f(zk) its function value. The proposed RK method then updates the coefficient vector in the following way, for k = 1, ··· ,
(k−1) (k) (k−1) f(zk) − hΦ(zk), c i c = c + 2 Φ(zk), zk ∼ ν (5.3) kΦ(zk)k2 74 where h·, ·i is the vector inner product and k · k2 the vector 2-norm. The key feature of the algorithm is that there is no need to explicitly construct the
“model matrix” in the linear system (2.12), which is used in almost all function regression methods. The algorithm incorporates data as they arrive and can be flexible in many practical situations. The computational cost at each iteration is O(p), which depends only
on the dimensionality of the linear space Π and not on the size of the data.
5.3 Convergence and error analysis
Convergence in Expectation
For any probability measure ν on D (possibly either absolutely continuous or discrete) satisfying
kfkν < ∞, |hφi, φjiν| < ∞,
we define another measureν ˜ associated with ν (yet not necessarily probability) by
p˜ dν˜(x) = Pp 2 dν(x), (5.4) j=1 φj (x)
where p X 2 p˜ := kφjkω. (5.5) j=1
Note that when the basis is orthonormal, i.e., kφjkω = 1 for all j = 1, . . . , p, we havep ˜ = p.
Let h·, ·iν˜ be its corresponding inner product and
Σ = (σij)1≤i,j≤p, σij = hφi, φjiν˜, (5.6)
be the covariance matrix of the basis {φj(x), j = 1, . . . , p} under this new measure. Obvi- ously, Σ is symmetric and positive definite. It possesses eigenvalue decomposition
Σ = QT ΛQ, (5.7)
where Q is orthogonal and Λ = diag(λ1, . . . , λp) with
λmax(Σ) = λ1 ≥ λ2 ≥ · · · ≥ λp = λmin(Σ) > 0. (5.8)
75 2 Let us now consider the best Lω approximation PΠf (2.11). Let
T e = [1, . . . , p] , j = hf − PΠf, φjiν˜, 1 ≤ j ≤ p, (5.9) be the projection of its error onto the basis using the new measureν ˜. Obviously, ifν ˜ = ω, then e = 0.
We are now in position to introduce the convergence of the RK algorithm (5.3), in terms of its coefficient vector compared to the best approximation coefficient vector cˆ in (2.11) in
expectation. At the k-th iteration, the expectation E is understood as the expectation over
the random samples {zj}1≤j≤k of the algorithm. Also for simplicity of the exposition and without loss of generality, we assume c(0) = 0.
2 Theorem 5.3.1. Let cˆ be the coefficient vector of the Lω-projection (2.11) and ν be the sampling measure from which the i.i.d. samples are drawn. Let ν˜ be defined in (5.4) and
Σ in (5.6) with its eigenvalues labeled as in (5.8). Then, the k-th iterative solution of the
algorithm (5.3) satisfies
k 2 (k) (k) 2 k 2 (k) F` + (r`) (kcˆk2 − F`) + ≤ Ekc − cˆk2 ≤ Fu + (ru) (kcˆk2 − Fu) + , (5.10) where
2 kfkν˜ 2 ru = 1 − λmin(Σ)/p,˜ Fu = − kcˆk2, λmin(Σ) 2 (5.11) kfkν˜ 2 r` = 1 − λmax(Σ)/p,˜ F` = − kcˆk2, λmax(Σ) and 2 (k) = − cˆT QT D(k)Qe, (5.12) p˜ with k (k) (k) (k) (k) 1 − (1 − λj/p˜) D = diag[d1 , . . . , dp ], dj = , 1 ≤ j ≤ p. λj/p˜ In the limit of k → ∞,
T −1 (k) 2 T −1 F` − 2cˆ Σ e ≤ lim Ekc − cˆk2 ≤ Fu − 2cˆ Σ e. (5.13) k→∞
The proof can be found in Appendix C.1.
76 The convergence bounds can be notably tighter when one chooses the sampling measure
ν in the following special form.
Corollary 5.3.1. Let the sampling probability measure ν be
Pp φ2(x) dν(x) = j=1 j dω(x). (5.14) p˜
Then, the k-th iterative solution of (5.3) satisfies
k 2 (k) 2 k 2 F` + (r`) (kcˆk2 − F`) ≤ Ekc − cˆk2 ≤ Fu + (ru) (kcˆk2 − Fu). (5.15)
Furthermore, if the basis {φj(x), j = 1, . . . , p} is orthonormal with respect to the measure ω, i.e., Φ(x) = Ψ(x) in (2.7), then the following equality holds,
1 kc(k) − cˆk2 = kf − P fk2 + rk(kcˆk2 − kf − P fk2 ), r = 1 − . (5.16) E 2 Π ω 2 Π ω p
Proof. The choice of ν in the form of (5.14) results in dν˜ = dω. We immediately have
Σ = TTT and (k) = 0 in (5.12) because e = 0 in (5.9). The proof then follows naturally.
Error bounds in expectation
The convergence results in terms of the coefficient vector readily give us error bounds of the resulting function approximation fe. Let
p (k) X (k) (k) fe (x) = cj φj(x) = hΦ(x), c i (5.17) j=1
be the approximation constructed by the coefficient vector obtained at the kth step of (5.3).
Then the following result holds.
Theorem 5.3.2. Under the same conditions in Theorem 5.3.1 and denote E = kf −PΠfkω the error by the orthogonal projection, the k-th step approximation (5.17) satisfies
2 T (k) (m) (k) 2 2 T (k) (m) E + λmin(TT ) · (E` + ) ≤ Ekf − fe kω ≤ E + λmax(TT ) · (Eu + ), (5.18) where
(k) k 2 (k) k 2 E` = F` + (r`) (kcˆk2 − F`),Eu = Fu + (ru) (kcˆk2 − Fu), (5.19) 77 are defined in (5.11), (k) in (5.12), and T is the transformation matrix (2.7) between the bases.
Moreover, if the sampling probability measure ν takes the special form of (5.14), then
2 (k) (k) 2 2 (k) E + λmin(Σ) · E` ≤ Ekf − fe kω ≤ E + λmax(Σ) · Eu , (5.20) where Σ is the covariance matrix defined in (5.6).
Furthermore, if the basis {φj(x), j = 1, . . . , p} is orthonormal with respect to the measure ω, i.e., Φ(x) = Ψ(x) in (2.7), then the following equality holds,
(k) 2 2 k 2 2 1 kf − fe k = 2E + r (kPΠfk − E ), r = 1 − . (5.21) E ω ω p
Proof. The proof consists of straightforward derivations from Theorem 5.3.1 and Corollary
5.3.1.
The last equality (5.21) is achieved when one employs the special form of the sampling probability measure (5.14) and the orthonormal basis Ψ(x) with respect to the measure ω.
The subsequent rate of convergence is 1 − 1/p, which is the optimal rate of convergence
for RK methods. Hereafter, we will denote the sampling measure (5.14) optimal sampling
measure.
Almost sure convergence
We now provide almost sure convergence for a special case. In particular, we assume f ∈ Π.
Consequently, E = kf − PΠfkω = 0. We also employ the orthonormal basis Ψ(x) (2.8) with respect to the measure ω and the optimal sampling measure (5.14). It then follows immediately from (5.16) of Corollary 5.3.1 that
(k) 2 (k) 2 lim Ekf − fe kω = lim Ekc − cˆk2 = 0. k→∞ k→∞
In fact, this holds true without the expectation.
Theorem 5.3.3. Assume f ∈ Π, and the orthonormal basis Ψ(x) (2.8) with respect to the measure ω and the optimal sampling measure (5.14) are used in (5.3). Then, the coefficient
78 vector satisfies
(k) 2 lim kc − cˆk2 = 0 almost surely. (5.22) k→∞
Consequently, the kth step approximation fe(k) (5.17) satisfies
(k) 2 lim kf − fe kω = 0 almost surely. (5.23) k→∞
The proof can be found in Appendix C.2.
Remarks on the optimal sampling measure
The results in Corollary 5.3.1 and Theorem 5.3.2 state that the optimal convergence rate
(1 − 1/p) can be achieved when one uses orthonormal basis in the approximation and more importantly, draw samples from the optimal sampling measure (5.14). In general this probability measure may not possess simple analytical form. Consequently, samples can not be drawn via standard random number generator. One practical way is to employ methods such as Markov chain Monte Carlo (MCMC) via, for example, the Metropolis-
Hasting algorithm [53, 52]. The literature on MCMC methods is abundant and will not be reviewed here.
We also remark that if one uses (normalized) orthogonal polynomials as the orthonormal basis Ψ(x) in the approximation, the corresponding optimal sampling measure has a strong
T connection with the equilibrium measure. More specifically, let Ψ(x) = [ψ1(x), . . . , ψp(x)]
d be an orthonormal polynomial basis from the total degree polynomial space of degree n, Pn of (2.30), and denote the corresponding optimal sampling measure (5.14) by
Pp ψ2(x) dµ (x) = j=1 j dω(x), (5.24) n p
where the dependence on the degree n is explicitly shown. Then, the measure converges to
lim dµn(x) = dµD(x), almost everywhere x ∈ D, (5.25) n→∞
where µD is the pluripotential equilibrium measure of D ([9]). For a compact review, please see Chapter 2.6.
79 For tensor bounded domain D = [−1, 1]d, the equilibrium measure is known to be the product measure of the arcsine measure with “Chebyshev” density, i.e.,
1 dµD(x) = dx. (5.26) Qd q 2 j=1 π 1 − xj
Drawing samples from this measure is straightforward. Therefore, for orthogonal polynomial approximations of moderate to high degrees, it was proposed in [86] to draw samples from this equilibrium measure. This simplifies the sampling procedure and yet should deliver near optimal convergence. We remark that similar practice has been discussed in least squares polynomial approximation in [52, 66].
5.4 Randomized tensor quadrature approximation
We apply the sequential approximation method to the tensor Gauss quadrature grid. The tensor Gauss quadrature is reviewed on Chapter 2.5. (i) (i) For each dimension i, i = 1, ··· , d, let {(zj , wj )}1≤j≤m be one-dimensional quadrature rules of exactness of degree at least 2n on Di ⊂ R, i.e., m Z X (i) (i) 1 wj f(zj ) = f(xi)$i(xi)dxi, ∀f ∈ P2n (5.27) j=1 Di
1 where P2n is the polynomial space of degree upto 2n. For many Di, there are many choices which satisfy this requirement. For example, when Di is an interval, we have the Gauss- Lobatto rule, the Gauss-Radau rule, the Gauss-Legendre rule, etcetera, just to name a few.
Based on one-dimensional quadratures, we construct the tensor quadrature rule as follow.
d d Let D = D1 ⊗ · · · ⊗ Dd ⊂ R , M = m , and
(i) (i) (i) (i) Θi,m = {z1 , ··· , zm },Wi,m = {w1 , ··· , wm }, i = 1, ··· , d (5.28) be the one-dimensional quadrature point set and weight set, respectively. Let
ΘM = Θ1,m ⊗ · · · ⊗ Θd,m,WM = W1,m ⊗ · · · ⊗ Wd,m, (5.29) be the tensor product of all 1D quadrature point sets and weight sets, respectively. It is
80 d d clear that |ΘM | = |WM | = m = M. Let [m] = {1, ··· , m} and i = (i1, ··· , id) ∈ [m] . By employing a linear ordering between the multi-index i and the single index i, i.e.,
z = (z(1), ··· z(d)), w = (w(1), ··· w(d)), i ←→ i = (i , . . . , i ) ∈ [m]d i i1 id i i1 id 1 d the tensor quadrature rule {(zi, wi)}1≤i≤M is constructed. By the exactness of one-dimensional quadratures, the tensor quadrature has the following exactness,
M Z X (j) (j) TP w f(z ) = f(x)$i(x)dx, ∀f ∈ Pd,2n (5.30) j=1 D where
TP α α1 αd Pd,2n := span{x = x1 ··· xd , |α|∞ ≤ 2n}
1 d TP is the tensor product of the one-dimensional polynomial space P2n. Since P2n ⊆ Pd,2n for d any d ≥ 1 and n ≥ 0, the 2n polynomial exactness (5.30) obviously holds for all f ∈ P2n. d Let ω be a product of univariate measures $i, i = 1, ··· , d. Let Pn be the total d n+d degree polynomial space (2.30) of degree upto n whose dimension is p = dim Pn = d . d 2 Let {ψj(x)}1≤j≤p be an orthogonal basis for Pn. For f ∈ Lω, we denote its orthogonal d d projection onto Pn by PΠf. Again, our goal is to construct an approximation in Pn for the 2 unknown target function f(x) ∈ Lω(D). With the tensor quadrature points defined, we proceed to apply the randomized sequen- tial approximation method (5.3) with the following discrete sampling measure. For any tensor quadrature point zj ∈ ΘM , let us define the sampling discrete probability measure
M 2 X kΨ(zj)k µ (x) := w 2 δ x − z (5.31) ∗ j p j j=1 R which satisfies dµ∗ = 1 by the 2n polynomial exactness of the tensor quadrature and δ(x) is the Dirac delta function.
Setting an initial choice of c(0) = 0, one then computes, for k = 1,...,
(k−1) (k) (k−1) f(zjk )) − hc , Ψ(zjk )i c = c + 2 Ψ(zjk ), zjk ∼ µ∗, (5.32) kΨ(zjk )k2
which is a version of (5.3) using ν = µ∗ as a discrete sampling measure. Again, the
81 implementation of the algorithm is remarkably simple. One randomly draws a point from the tensor quadrature set ΘM using the discrete probability (5.31), and then applies the iteration update (5.32), which requires only vector operations. The iteration continues until convergence is reached.
The theoretical convergence rate of the algorithm (5.32) readily follows from the general analysis presented in Theorem 5.3.1.
For convenience, we introduce the following tensor quadrature probability measure
M X µ(x) := wjδ x − zj . j=1
Due to the exactness of the tensor quadrature, this probability measure defines an inner
d product on Pn, defined by Z hg, hiw := g(x)h(x)dµ(x)
2 for g, h ∈ Lω(D). Let denote its corresponding induced discrete norm by k · kw. Note that d for g ∈ Pn, kgkω = kgkw.
Theorem 5.4.1. Assume c(0) = 0. The k-th iterative solution of the algorithm (5.32)
satisfies
(k) 2 2 k 2 2 Ekc − cˆk2 = kf − PΠfkw + E + r (kPΠfkw − kf − PΠfkw − E), (5.33)
where