DEPARTMENT of INFORMATICS Orthogonal Matrix Decomposition

Home , Matrix decomposition

DEPARTMENT OF INFORMATICS

TECHNICAL UNIVERSITY OF MUNICH

Bachelor’s Thesis in Informatics: Games Engineering

Orthogonal Matrix Decomposition for Adaptive Sparse Grid Density Estimation Methods

Dmitrij Boschko DEPARTMENT OF INFORMATICS

TECHNICAL UNIVERSITY OF MUNICH

Bachelor’s Thesis in Informatics: Games Engineering

Orthogonal Matrix Decomposition for Adaptive Sparse Grid Density Estimation Methods

Orthogonale Matrix-Zerlegung für Adaptive Dünngitter-Dichteschätzungsverfahren

Author: Dmitrij Boschko Supervisor: Prof. Dr. Hans-Joachim Bungartz Advisors: Kilian Röhner, Michael Obersteiner Submission Date: 15.09.2017 I conﬁrm that this bachelor’s thesis in informatics: games engineering is my own work and I have documented all sources and material used.

Munich, 15.09.2017 Dmitrij Boschko Zusammenfassung

In dieser Arbeit wird ein neuer Algorithmus zur adaptiven Dünngitter-Dichteschätzung vorgestellt und untersucht. Im Kontext einer Ofﬂine/Online Teilung werden eine orthogonale Zerlegung der zugrundeliegenden Systemmatrix und rank-one updates mit Hilfe der Sherman-Morrison-Formel eingesetzt. Dieses Vorgehen erlaubt Regu- larisierung und nachfolgend Verfeinerung und/oder Vergröberung des Gitters. Die Geschwindigkeit der aktuellen Implementierung wird untersucht und mit dem auf der Choleskyzerlegung basierenden Verfahren zur adaptiven Dünngitter-Dichteschätzung verglichen. Durch das neue Verfahren kann eine Verbesserung bezüglich des Lösens des Gleichungssystems erreicht werden, während die Geschwindigkeit der anderen Funktionalitäten der Onlinephase geringfügig unterliegt. Darüber hinaus kann das neue Verfahren durch Parallelisierung weiter optimiert werden.

Abstract

In this thesis a new algorithm for adaptive sparse grid density estimation is introduced and analyzed. In an ofﬂine/online splitting context an orthogonal decomposition of the underlying system matrix and Sherman-Morrison rank-one updates are used. This allows for regularization, which can be followed by adaptivity. The current implementation’s runtime is analyzed and evaluated with respect to the Cholesky decomposition based algorithm for the same task. The new algorithm reaches faster solving times, while the speed of the other online subroutines stays in the same runtime complexity, though slightly slower. Furthermore, the new algorithm can be optimized through parallelization.

iii Contents

1 Introduction and Motivation 1

2 Theoretical Background 3 2.1 Probability Density Estimation ...... 3 2.2 Sparse Grids ...... 4 2.2.1 Construction of Sparse Grids ...... 4 2.2.2 Spatial Adaptivity on Sparse Grids ...... 8 2.3 Sparse Grid Density Estimation ...... 9 2.4 Ofﬂine/Online Splitting ...... 9 2.5 Orthogonal Decomposition into Hessenberg Matrix ...... 10 2.6 Sherman-Morrison Formula ...... 11

3 Algorithm 13 3.1 Ofﬂine Phase ...... 14 3.1.1 Decomposing into Hessenberg Matrix ...... 14 3.1.2 Inverting Tridiagonal Matrix ...... 15 3.2 Online Phase ...... 16 3.2.1 Reﬁning ...... 16 3.2.2 Coarsening ...... 18 3.2.3 Solving Resulting System ...... 21

4 Implementation 22

5 Evaluation 25 5.1 Ofﬂine Phase ...... 26 5.2 Online Phase ...... 27 5.2.1 Reﬁning ...... 27 5.2.2 Coarsening ...... 28 5.2.3 Solving the System ...... 29 5.3 Example: Ripley Dataset ...... 30

6 Possible Improvement 32

iv Contents

7 Conclusion 34

8 Appendix 35

Bibliography 37

v 1 Introduction and Motivation

Due to technological improvement of storage and computation power, it is possible to easily obtain and store large amounts of data. In different fields, this data can be used to extract important information, e.g. customer behavior, which then allows to predict trends and optimize marketing strategies. Correct retrieving of information out of data is, in general, a complicated process, but many techniques have been developed. Knowledge discovery in databases (KDD) often examines similar tasks as machine learning. Both fields use learning approaches to solve them. Unsupervised learning, supervised learning and reinforcement learning are the standard learning tasks today, and can be studied more extensively in [HTF09]. These methods can be used to solve problems such as image recognition, stock market prediction, medical diagnosis, playing complicated games and many more. This thesis will mainly cover unsupervised learning through a density estimation approach using sparse grids. Probability density estimation is one of the most commonly used methods in unsupervised learning, but can also serve as the basis for supervised learning methods for problems such as classification. Given a set of data samples, the goal of density estimation is to construct an approximate probability distribution of the underlying unknown one. One problem of this approach, and of learning in general, is overfitting. Overfitted functions match the given data samples in such a good way, that they tend to describe the random errors, i.e. noise, instead of the underlying relation of the data itself. This phenomenon arises in highly complex models, as the learning leads to over-optimized performance on the training data, thus ending up learning the data by heart instead of generalizing the underlying relations. There are many different ways to overcome overfitting, such as cross-validation, early stopping, priors and regularization, see [HTF09]. To maintain an accurate representation of a learning problem, the number of considered features of the data has to be high enough. Since for every single one of N data points, each of d features has to be considered, it leads to processing Nd values, which badly scales with high dimensions. One possible way to tackle the resulting exponentially growing complexity is using sparse grids. The sparse grids approach reduces the complexity of processed data from O(2nd) to O(2nn(d−1)), where d is the dimensionality and n is a given discretization level for N = 2n.

1 1 Introduction and Motivation

The obtained sparse grid structure is known a priori, but doesn’t always match the underlying problem modeled. It is often the case, that data points are not contained in the a priori grid or that after initializing the grid, new data becomes available and the grid has to be fitted to the new data. The process of adding and removing points without changing dimension is called spatial adaptivity. Adaptivity strategies have to make sure, that suited points are chosen, while not destroying the sparse grid structure. There are a few different strategies for choosing suited points to adapt, see [Pfl10] and [Kre16]. Often, data is not available all at once, but comes during processing, e.g. in online data mining, where new data comes from a stream. One way to adapt to the new data without having to compute everything from scratch is splitting the procedure into offline and online phases. The offline phase does all the necessary preparations to assure fast processing of incoming data in the online phase. With the problems mentioned, an adaptive sparse grid density estimation algorithm should be able to:

• compute a probability density function

• support efﬁcient counter measurements to overﬁtting

• maintain a sparse grid structure

• do reﬁnement and/or coarsening

• split the procedure into ofﬂine and fast online phases

• efﬁciently solve the resulting system of linear equations

An existing algorithm with the functionality required is the Cholesky decomposition based sparse grid density estimation approach studied in [Sie16], although in general having high runtime on changing the regularization parameter. In this thesis, an adaptive sparse grid density estimation algorithm with ofﬂine/online splitting is introduced and analyzed. It will be constructed to fulﬁll the criteria mentioned above. First of all, a necessary theoretical introduction to sparse grids is given, alongside with mathematical background for understanding each step of the algorithm (Chapter 2). Then, the provided theory is used to fully explain the procedure of the algorithm (Chapter 3). After that, the practical implementation will be viewed (Chapter 4), followed by its performance and accuracy evaluations (Chapter 5). In the end, some possible improvements to the algorithm’s procedure and implementations are considered (Chapter 6).

2 2 Theoretical Background

This chapter gives a mathematical overview of necessary topics for understanding the algorithm. Unless stated otherwise, vectors will be denoted as lower-case letters with ~ an arrow b, their i-th component as bi. Similarly, matrices will be denoted as A and an ~ entry at the i-th row and the j-th column as Aij. For d-dimensional multi-indices ` and ~k, the relational operator ≤ is deﬁned component-wise, i.e. ~ ~ ` ≤ k ⇔ `i ≤ ki, 1 ≤ i ≤ d.

The `1-norm and the maximum-norm are deﬁned as

d ~ ~ ~ ~ |k|1 := ki , and |k|∞ := max |ki|. ∑ ≤ ≤ i=1 1 i d

d d Ω always denotes [0, 1] ⊂ R and the Lp-norm of a function f : Ω → R is deﬁned by

p Z k f k := | f (~x)|pd~x. Lp Ω

The L2-inner-product for continuous functions f , g : [0, 1] → R is deﬁned as Z h f , gi = f (y)g(y)dy. Ω

2.1 Probability Density Estimation

d Let S = {~x1, ... ,~xN} ⊂ R be a set of N sample points (which are feature vectors) drawn from a probability density f (X), where X is a random variable. The goal of probability density estimation is to ﬁnd an approximation of f (which will be named pˆ). While this is one of the standard approaches to solve unsupervised learning tasks, such as clustering, probability density estimation can also serve as a basis for supervised learning tasks, such as classiﬁcation. Whereas parametric density estimation additionally assumes a given shape of the underlying density function, non-parametric density estimation only considers the given data S. In practice, the shape of f often

3 2 Theoretical Background cannot be inferred due to lack of information. More detailed studies on density estimation in general can be looked up in [HTF09]. A common method for non-parametrized density estimation is through kernels,

1 N y − x ( ) = ( i ) pˆ y ∑ K 2 , N i=1 σ where K are kernel functions centered at xi. The estimation is dependent of the used kernel function K and the bandwidth σ. While the kernel can more or less be chosen at ease, ﬁnding good performing bandwidths is not an easy task. The evaluation of pˆ depends on the amount of data points in S. Even with discretizing the data points by using bins and considering one kernel function per bin, growing dimension (i.e. adding features to the data) of the ~xi ∈ S leads to exponential growth of the number of bins. For a more detailed explanation, Section 7 of [Peh13] is referred to. One possible way of coping with this exponential growth is considering sparse grids.

2.2 Sparse Grids

An extensive coverage of the topic of adaptive sparse grids is given in [Pﬂ10]. The idea of sparse grids is omitting some of the points of a Cartesian full grid while not compromising the accuracy of solved problems, which rely on the grid-points’ information. This is done through a tensor-product construction and induces a structure of the new grid. With omitting points the model of the problem can be distorted, e.g. because the problem was highly dependent on a region of points which got omitted. That’s why additionally to omitting points from the full grid, it must be possible to add certain points back into the sparse grid. The process of adding and removing points from the sparse grid is referred to as adaptivity. In this section a mathematical construction of sparse grids is given, followed by illustration of adaptivity processes.

2.2.1 Construction of Sparse Grids The fundament of sparse grids are hierarchical basis functions, which will serve as kernels. Though different kinds of basis functions can be considered, see [Pﬂ10], this section will only deal with linear hat basis functions. In the case of zero boundaries the one-dimensional standard hat function

ϕ(x) = max(0, 1 − |x|), x ∈ Ω

4 2 Theoretical Background is dilated and translated depending on a given level ` and an index i, ` ` ϕ`,i := ϕ(2 x − i), ` ∈ N>0, i ∈ {0, 1, . . . , 2 }. Based on this, the hierarchical index sets, ` I` := {i ∈ N|1 ≤ i ≤ 2 − 1, i odd}, deﬁne a set of hierarchical subspaces

W` := span {ϕ`,i(x)|i ∈ I`} . By using the multi-index notation, one can simply generalize to multi-dimensional cases of the mathematical objects deﬁned. The multi-dimensional hierarchical basis-function is obtained through a tensor product,

d (~x) = (x ) ϕ~`~,i : ∏ ϕ`j,ij j , j=1 where~i and~` are now multi-indices, whose k-th components serve as the index and level in the k-th dimension. In a similar manner one can now extend n o ~ d `j I~` := i ∈ (N>0) |1 ≤ ij ≤ 2 − 1, ij odd, 1 ≤ j ≤ d , and n o = (~)|~ ∈ W~` : span ϕ~`~,i x i I~` . Given a ﬁxed grid-level n, the space of piecewise d-linear functions on a full grid with mesh width hn in each dimension is given by the direct sum, M Vn := W~` . ~ |`|∞≤n A d-dimensional sparse grid of level ` is deﬁned as

(1) M V` := W~`. ~ |`|1≤`+d−1

Figure 2.1 and Figure 2.2 show examples of a space of basis functions W~` and a sparse (1) grid Vl , respectively. As illustrated in Figure 2.3, when considering basis functions for interpolation, the interpolant u can be written as the weighted sum of basis functions

N u(x) = ∑ αi ϕi(~x) , i=1

5 2 Theoretical Background thus be uniquely described by the coefﬁcient-vector ~α. Therefore, given the basis functions, interpolating a function is equivalent to obtaining the vector ~α. This also holds for multi-dimensional interpolation on sparse grids, where the interpolant is described by (~) = (~) u x ∑ α~`~,i ϕ~`~,i x . ~ |`|∞≤n ~ i∈I~`

~ Figure 2.1: Basis functions of the subspaces W~` for |`|∞ ≤ 3 in two dimensions. Figure taken from [Pﬂüger].

6 2 Theoretical Background

Figure 2.2: Left picture, the subspaces W~` up to `j = 3 in dimensions j ∈ {1, 2}. Right picture, the resulting sparse grid space. Figure taken from [Pﬂüger].

Figure 2.3: To the left, an Interpolation of a function by using basis functions and weights αi. To the right, the corresponding basis functions. Figure taken from [Pﬂüger].

7 2 Theoretical Background

2.2.2 Spatial Adaptivity on Sparse Grids To adapt to the nature of the problem, it is often necessary to insert certain points into the sparse grid. An example could be the modeled problem’s dependency of certain subspaces of the full grid. If the sparse grid’s basis functions only have a small support in this specific area, the resulting loss in accuracy will be too great. By additionally adding independent single points to the grid, the structure might get distorted. Because many sparse grid algorithms rely on the sparse grid’s structure (e.g. traversing hierarchical basis functions), it has to be reestablished after adding a point. To refine a grid point, all of its children have to be added, if not yet created. Additionally, it has always to be made sure that the parents off all grid points are contained in the sparse grid. An illustration of this process is given in Figure 2.4. While this procedure describes the refinement process itself, it is not clear which points to choose for refinement. Different strategies exist, however, none of the strategies are perfectly stable, as a counterexample can be brought up for each refinement strategy [Pfl10].

Figure 2.4: Starting with a two-dimensional grid of level 2 (left), the red point is considered for refinement. The refined point and its children get added to the grid (middle) and another point (red) is considered for further refinement, leading to a new grid with added ancestors, marked as gray (right). Figure taken from [Pflüger].

In a context of high dimensional data, every added grid point means signiﬁcant added runtime for algorithms. Removing unnecessary points from the grid thus seems to make sense, hence coarsening the grid. More detailed discussion of adaptivity strategies is contained in [Pﬂ10] and [Kre16].

8 2 Theoretical Background

2.3 Sparse Grid Density Estimation

Starting with a data set S = {x~1, ... ,~xN}, which is sufficient to obtain an overfitted initial guess pe, spline smoothing is used to obtain a generalized approximation pˆ. The spline smoothing finds a function pˆ in a function space F, such that Z 2 2 pˆ = arg min ( f (~x) − pe(~x)) d~x + λkΛ f kL2 . f ∈F Ω k k2 The right term λ Λ f L2 is a regularization term, which assures that the method doesn’t 2 overfit, whereas minimizing the left term ( f (~x) − pe(~x)) ensures fitting the initial guess pe, which was highly overfitted. That means that the λ ∈ R>0 controls the tradeoff between accuracy of the approximation towards the initial guess, i.e. interpolating S, and overfitting. By transforming the integrals, it is possible to obtain a system of linear equations

(R + λC)~α =~b , where Rij = hϕi, ϕji and Cij = hΛϕi, Λϕji describe the system matrix and bi = 1 N N ∑j=1 ϕi(xj) are the right hand side vector’s components of the system of linear equations. Choosing C = I, the identity matrix, results in the regularization term limiting the growth of the hierarchical coefﬁcients αi. In general, C does not have to be the identity. For a more elaborate derivation of the system of linear equations, [Peh13] and the references therein can be considered. With this result, sparse grid based density estimation can be achieved through building the system of linear equations out of the basis functions and then solving this system for ~α. The estimated density function can then be evaluated via linear combination of the coefﬁcients of ~α and the basis functions at given grid points.

2.4 Ofﬂine/Online Splitting

Given a sparse grid with basis functions, the matrix of the system of linear equations given by (R + λI)~α =~b can be computed without consideration of specific data points. In context of online data streams only the right side~b changes, so it would be of advantage preparing the system for solving, while no new data comes in. This is basic idea of offline/online splitting. The offline phase often is used to decompose the matrix (R + λI) and store it. The

9 2 Theoretical Background decomposition is done in a way that leads to efficient solving, given a vector~b. When new data comes in often, the fastened online process of solving amortizes the costly offline decomposition. Different decompositions have been researched in the context of sparse grid density estimation. Eigenvalue and LU decompositions for the offline phase have been researched in [Peh13]. Although the Eigenvalue decomposition has higher runtime than LU, it allows to efficiently change the regularization parameter λ without recalculating the whole system matrix, which LU decomposition doesn’t. Both, Eigenvalue and LU decomposition lack support of adaptivity in an efficient way, because every time the grid changes, the whole adapted term has to be decomposed from scratch. In [Sie16], the Cholesky decomposition into (R + λI) = LLT has been researched. The Cholesky decomposition allows for modifications, with which the matrix can be adapted to refined and coarsened grid points, however, changing the regularization parameter does require as much time as a new offline step in general.

2.5 Orthogonal Decomposition into Hessenberg Matrix

The QR-decomposition is a commonly used decomposition of a matrix A ∈ Rm×n into an upper-triangular matrix R and an orthogonal matrix Q. It is achieved by transforming the matrix A with Householder-reﬂections, or Givens-rotations. For a detailed review of QR-decompositions [DR08] is referred to. Given a real quadratic matrix A ∈ Rd×d, it can be transformed into an upper-triangular matrix R with an additional sub-diagonal after d − 2 Householder-reﬂections Hi, 1 ≤ i ≤ d − 2. This achieved form is called an upper Hessenberg matrix.

∗ ∗ · · · ∗ ∗ ∗ ∗ ∗ · · · ∗ ∗ ∗ ∗ ∗ · · · ∗ ∗ ∗ ∗ ∗ · · · ∗ ∗ ∗ ∗ ∗ · · · ∗ ∗ ∗ ∗ ∗ · · · ∗ ∗ ∗           0 ∗ · · · ∗ ∗ ∗ ∗ ∗ · · · ∗ ∗ ∗ 0 ∗ · · · ∗ ∗ ∗     H1   Hd−2···H1 A  .  ∗ ∗ · · · ∗ ∗ ∗ −→ 0 ∗ · · · ∗ ∗ ∗ −→ 0 0 .. ∗ ∗ ∗        ......   ......   ......   ......   ......   ......  ∗ ∗ · · · ∗ ∗ ∗ 0 ∗ · · · ∗ ∗ ∗ 0 0 ··· 0 ∗ ∗

10 2 Theoretical Background

Additionally if A ∈ Rd×d is symmetric, the reﬂections can also be applied from the other side without destroying the prior introduced zeroes.

∗ ∗ ∗ ∗ · · · ∗ ∗ ∗ ∗ 0 0 ··· 0 0 ∗ ∗ ∗ ∗ · · · ∗ ∗ ∗ ∗ ∗ 0 ··· 0 0      .   .  ∗ ∗ ∗ ∗ .. ∗ ∗ 0 ∗ ∗ ∗ .. 0 0      . . .   . . .  ∗ ∗ ∗ ∗ .. .. .  Hd−2···H1 AH1···Hd−2 0 0 ∗ ∗ .. .. .    −→    . . . . .   . . . . .   ...... ∗ ∗  ...... ∗ 0      .   .  ∗ ∗ ∗ .. ∗ ∗ ∗ 0 0 0 .. ∗ ∗ ∗ ∗ ∗ ∗ · · · ∗ ∗ ∗ 0 0 0 ··· 0 ∗ ∗

With Q := H1 ··· Hd−2, and since Householder-reﬂections are symmetric, the decomposition can be written as A = QTQT . As Q is a composition of orthogonal matrices, it is orthogonal itself. T is in Hessenberg- form, and since A was symmetric, T is also symmetric. Using this decomposition on a quadratic symmetric matrix A ∈ Rd×d thus yields an orthogonal decomposition into the orthogonal matrix Q and the symmetric, tridiagonal matrix T. Given such a decomposition A = QTQT, one can easily invert A by inverting T, using the orthogonality of Q and basic rules for matrix operations, i.e.

A−1 = (QTQT)−1 = (QT)−1T−1Q−1 = QT−1QT .

Inverting a symmetric tridiagonal matrix can be done in linear time, e.g. by using LR-decomposition, forward and backward substitution. For more details, see [DR08].

2.6 Sherman-Morrison Formula

A rank-one update of a matrix is the addition of another matrix with rank one to the original matrix. This can be stated as

A + ~u~vT, ~u, ~v ∈ Rd.

For invertible A ∈ Rd×d, (A + ~u~vT) is invertible if and only if 1 + ~vA−1~u 6= 0. The inverse can then be calculated as A−1~u~vT A−1 (A + ~u~vT)−1 = A−1 − . 1 +~vA−1~u

11 2 Theoretical Background

For a proof, one can multiply the right side expression by (A + ~u~vT) or check [Bar51]. As long as the updated matrix will be invertible again, that is, as long as 1 +~vA−1~u 6= 0 holds, the formula can be applied various times. An additional rank-one update with d ~u(2),~v(2) ∈ R will result in

( + ~~ T)−1~ ~ T ( + ~~ T)−1 A uv u(2)v(2) A uv ((A + ~u~vT) + ~u ~vT )−1 = (A + ~u~vT)−1 + . (2) (2) +~ T ( + ~~ T)−1~ 1 v(2) A uv u(2)

Substituting the additive component of the ﬁrst rank-one update,

A−1~u~vT A−1 B := − , 1 +~vA−1~u leads to (A + ~u~vT)−1 = A−1 + B and the second update can be written as

( −1 + )~ ~ T ( −1 + ) −1 A B u(2)v(2) A B ((A + ~u~vT) + ~u ~vT ) = (A−1 + B) − . (2) (2) +~ T ( −1 + )~ 1 v(2) A B u(2)

By iteratively substituting the additive components, multiple applications of the Sherman-Morrison formula can be stated by just calculating the additive component d d×d of the n-th update with ~u(n), ~v(n) ∈ R , starting with B0 = 0 ∈ R , leading to the iterative formula for the additive component

−1 T −1 (A + Bn−1)~u(n)~v(n)(A + Bn−1) B = B − . n n−1 +~ T ( −1 + )~ 1 v(n) A Bn−1 u(n)

12 3 Algorithm

The beginning of this section ﬁrst gives a quick overview of the introduced algorithm for adaptive sparse grid density estimation before elaborately looking at each step. The subsections consist of an explanation of what the step does, why it is working, runtime analysis and pseudo code of the subroutines. Given the level and the index of the grid, resulting in an n-sized grid, the procedure can be split up into ﬁve subroutines.

1. Calculate Rij, ∀i, j ∈ {1, 2, . . . , n}

2. Decompose R = QTQT − − 3. Invert (R + λI) 1 = Q(T + λI) 1QT

4. Apply adaptivity by adding the row/column of L2-products ~x to (R + λI):

a) Adjust the k-th value of ~x to xk = (xk − 1 + λ) · 0.5 b) Set adequate sign of rank-one update (+ if reﬁne, − if coarsen) −1 c) Perform Sherman-Morrison updates ((R + λI) ± ~x~eT ±~e~xT)

5. Solve resulting system of linear equations: ~α = QT−1QT~b + B~b

13 3 Algorithm

The offline phase consists of the steps 1. − 3. Optionally, right after the offline steps are finished, tuning actions for the regularization parameter λ can be applied as often as needed. Steps 4. and 5. make up the online phase, after which ~α is obtained, thus the probability function can be evaluated. Note: Since this thesis is a proof of concept, the algorithm and its subroutines are nei- ther fully optimized nor parallelized yet. Details on possible optimization are contained in Chapter 6. Also, in order to be able to perform rank-one updates through the Sherman-Morrison formula, other orthogonal decompositions, like Eigenvalue decomposition, can be considered, too.

3.1 Ofﬂine Phase

Choosing an initial a priori sparse grid leads to ﬁxing the amount of grid points and their basis hat functions. In this subsection the ﬁxed size of the initial a priori grid will be denoted as n ∈ N>0. This leads to a real quadratic system matrix R + λI of size n.

3.1.1 Decomposing into Hessenberg Matrix

Because of the symmetry of the L2-inner-product of the basis functions ϕi, the resulting system matrix R is symmetric. By applying Householder-reflections to this matrix, the resulting Hessenberg matrix will be symmetric and tridiagonal. This leads to a decomposition of the form R = QTQT, where Q ∈ Rn×n is an orthogonal matrix and T ∈ Rn×n is a symmetric tridiagonal matrix. Symmetric matrices obtained from decompositions profit from the possibility to store only the diagonal and the upper half of the entries and use the lower half of the space to store information about the transformations done. Usually, this is advised when transforming a given matrix into a Hessenberg matrix. In the context of offline/online splitting, however, it can be of advantage to already calculate and store the values of Q in the offline phase, so they don’t have to be computed in the online phase when solving the system. 5 3 The runtime of a transformation into a Hessenberg matrix is approximately 3 n ∈ O(n3), as stated on P. 256 [DR08]. Unfortunately unpacking the values of Q explicitly takes significant additional runtime. Condoning the unpacking costs in the offline

14 3 Algorithm phase will proﬁt the speed of solving the system in the online phase though.

Decomposition into Hessenberg Matrix 1: R ← system matrix of initial grid 2: Q ← orthogonal matrix 3: diag ← diagonal of Hessenberg matrix T 4: subdiag ← sub- and super-diagonal of Hessenberg matrix T 5: 6: function orthogonal_tridiagonal_decomposition(R) 7: R ← decompositionIntoHessenbergForm(R) 8: Q, diag, subdiag ← unpackFromCompactStorage(R) 9: end function

3.1.2 Inverting Tridiagonal Matrix By using the obtained decomposition R = QTQT, solving the system (R + λI)~α = ~b can be done by transforming

R + λI = QTQT + λIQQT (Q orthogonal) = QTQT + Q(λI)QT (mult. with scalar commutate) = Q(T + λI)QT (matrix distributive property) and then inverting

(Q(T + λI)QT)−1 = (QT)−1(T + λI)−1Q−1 = Q(T + λI)−1QT.

Optionally, the regularization parameter λ can be tuned here without having to decompose the initial system matrix again. This can be done by adding λ to the diagonal of T, inverting (T + λI), and then solving for ~α = Q(T + λI)−1QT~b. These steps can be repeated as often as needed to ﬁnd an optimal λ. Since T is a tridiagonal symmetric matrix1, its inverse can be explicitly calculated in 2 −1 O(n ) by solving T~x = ~ei for each i-th column of T , i ∈ {1, 2, ... , n}. After that, three matrix-vector multiplications have to be performed to obtain ~α, each of which is O(n2). Solving the decomposed system prior to adaptivity procedures is therefore O(n2). Note that for solving the system QTQT~α = ~b, T doesn’t have to be explicitly

1Solving s.p.d. tridiagonal systems is O(n), see p. 89f [DR08]

15 3 Algorithm inverted. Nevertheless, solving the decomposed system will remain in O(n2). Sugges- tions concerning optimization are contained in Chapter 6.

Inversion of Tridiagonal Matrix 1: T−1 ← storage space for inverse of tridiagonal matrix 2: diag ← diagonal of tridiagonal matrix T 3: subdiag ← sub- and super-diagonal of tridiagonal matrix T 4: 5: function invert_tridiagonal_matrix(diag, subdiag) 6: for k ← 1 to n −1 7: Tk ← solve(diag, subdiag, k-th_unit_vector) 8: end for 9: end function

3.2 Online Phase

In the online phase, the goal is to process new data as fast as possible without loosing too much accuracy. The new data can for example come from an online data stream. In the context of sparse grid density estimation, adaptivity information is received, i.e. which points to reﬁne/coarsen, alongside with the right side ~b ∈ Rn of the system to solve. In this subsection, the matrix of the ofﬂine phase shall be quadratic of size −1 T n ∈ N>0 and already saved in decomposed and inverted form Q(T + λI) Q .

3.2.1 Refining Refining one point in the sparse grid will not only add the point to the grid, but also generate new basis functions according to the construction of the hierarchical subspace W~`. This will change the system matrix by adding a row and a column of the corresponding L2-inner-products. With n + 1 being the new numbering of the point, the system matrix R will change into a bigger matrix     hϕ1, ϕn+1i hϕ1, ϕ1i · · · hϕ1, ϕni  .   . .. .  refine  R .  R =  . . .  −→   .  hϕn, ϕn+1i  hϕn, ϕ1i · · · hϕn, ϕni hϕn+1, ϕ1i · · · hϕn+1, ϕni hϕn+1, ϕn+1i

16 3 Algorithm

The goal now is to express the adding of rows and columns as rank-one updates, while simultaneously resizing the matrix. Let ~x ∈ R(n+1) be the i-th row/column which will be added through the update, which can be stated as ! R ~0 + ~x~eT +~e ~xT, ~0T 0 i i where ~ei is the i-th unit vector. This approach doesn’t consider the regularization parameter λ yet, nor does it hold the necessary assumptions of the Sherman-Morrison formula, because the matrix isn’t invertible prior to the update. Also, the (n + 1)-th component of ~x gets added twice to the new matrix, one for updating the row, and one for updating the column. To solve the invertibility of the matrix to update, it is extended by the (n + 1)-th unit vector by appending it as a row/column. This results in the inverse also being extended by just the unit vector as row/column, so it can be expressed by setting the (n + 1)-th diagonal element of the additive component B to 1. The rest of the problems can be solved by changing the (n + 1)-th component of ~x. To even out the added 1 of the additive component, it has to be subtracted accordingly from the row/column which is updated. The λ simply can be added. Finally, the value must be divided by 2, to cope for adding it twice. This leads to a specific transformation of the (n + 1)-th component of ~x prior to refinement. Taking these changes into account, refining a point in the sparse grid corresponds to two rank-one updates. As derived in Chapter 2, for calculating the inverse of the twice rank-one updated matrix, it suffices to calculate the additive component for each of the two updates. The general procedure for refining the k-th grid point ~xk results in three steps.

1. Adapt the (n + k)-th component of ~xk:

xn+k = 0.5 · (xn+k + λ − 1).

2. Calculate the ﬁrst updated inverse:

−1 T T −1 T (QT Q + Bk−1)~xk~e (QT Q + Bk−1) B˜ = B − n+k . k k−1 T −1 T 1 +~en+k(QT Q + Bk−1)~xk

3. Calculate the second updated inverse:

−1 T T −1 T (QT Q + B˜ )~e + ~x (QT Q + B˜ ) B = B˜ − k n k k k . k k T −1 T ˜ 1 + ~xk (QT Q + Bk)~en+k

17 3 Algorithm

Note, that the order of ~xk and ~en+k has been swapped in the second update. Also, as the matrix QT−1QT is of dimension n, it has to be filled up with zeroes until it fits the dimension of B. The reduced dimension of QT−1QT and the symmetry of the matrices allow for some optimization. The formula of the first rank-one update can be expanded to

−1 T T −1 T T (QT Q ~xk + Bk−1~xk) · (~e QT Q +~e Bk−1) B˜ = B − n+k n+k . k k−1 T −1 T T 1 +~en+kQT Q ~xk +~en+kBk−1~xk

Since the dimension of QT−1QT is always n and multiplication with the unit vector can T −1 T be treated as a storage access to the corresponding index, the terms ~en+kQT Q and T −1 T ~en+kQT Q ~xk are always zero, which leads to the ﬁrst update being just

(QT−1QT~x + B ~x ) · (~eT B ) ˜ = − k k−1 k n+k k−1 Bk Bk−1 T . 1 +~en+kBk−1~xk

−1 −1 T Furthermore, because Q is orthogonal and T is symmetric, it holds that QT Q ~xk = T −1 T T (~xk QT Q ) , thus this specific term doesn’t have to be recalculated in the second T ˜ update, leaving only ~xk Bk to be calculated in the second update. With these optimizations in mind, the runtime for both updates can be obtained. In the −1 T 2 first update QT Q ~xk has to be explicitly calculated, which is O(n ). Also, Bk~xk has to 2 be calculated in the first, and B˜k~xk in the second update, both of which is in O((n + k) ). All other terms can be obtained by accessing the storage of the calculated terms. This makes both updates together have a runtime of O(n2 + (n + k)2) = O((n + k)2), where n is the initial grid size and k − 1 is the amount of points refined before the current point. To conclude, if m is the current grid size then refining a grid point is O(m2).

3.2.2 Coarsening When using the Sherman-Morrison formula, coarsening can only be done for points, whose information is stored in the additive component B, i.e. prior reﬁned points. Because removing rows/columns of the initial system matrix R + λI leads to removing rows/columns from the decomposition QTQT, coarsening initial grid points with rank- one updates would require not only a new decomposition, but also recomputing the additive component B for every grid point reﬁned/coarsened before. In this section, let A ∈ R(n+k)×(n+k) be the updated system matrix, which results from

18 3 Algorithm

Sherman-Morrison Rank-One Update 1: B ← additive component of Sherman-Morrison formula 2: Q ← orthogonal matrix of decomposition 3: T−1 ← inverse of tridiagonal matrix with regularization parameter 4: ~x ← row/column to add to the original system matrix 5: i ← index of row/column to update 6: 7: function rank_one_update(B, Q, T−1, ~x, i) 8: x_term ← QT−1QT~x // Q and T−1 ﬁlled with zeroes to match ~x 9: b_term ← B~x 10: divisor ← 1 + b_termi ~ 11: prod ← h(x_term + b_term), Bii/divisor 12: B˜ ← B−prod 13: b_term ← B˜~x 14: divisor ← 1 + b_termi ~ 15: prod ← h(x_term + b_term), B˜ii/divisor 16: B ← B˜−prod 17: end function

Refinement of one Point 1: B ← additive component of Sherman-Morrison formula 2: Q ← orthogonal matrix of decomposition 3: T−1 ← inverse of tridiagonal matrix with regularization parameter 4: λ ← regularization parameter 5: m ← current matrix size 6: 7: function refine_point() 8: for k ← 1 to m + 1 9: xk ← hϕk, ϕm+1i 10: end for 11: xm+1 ← 0.5(xm+1 − 1 + λ) 12: B ← rank_one_update(B, Q, T−1, ~x, m + 1) the initial system matrix R + λI by refining k grid points. In the prior section, it was shown that the inverse of A can then be written as

−1 −1 T A = QT Q + Bk,

19 3 Algorithm where QT−1QT is resized and filled with zeroes to fit the size of the adaptive component B. Similar to refining, coarsening can be performed on the prior refined system matrix by executing rank-one updates. In this case, to remove a row/column, the update has to be subtracted and the numbering index of the refined point i has to be bigger than the a priori grid size n. With the same adjustments as in refining, only this time to the row/column-vector’s i-th component, xi = 0.5 · (xi + λ − 1), the rank-one updates for coarsening can be stated as T T A − ~x~ei −~ei~x . Due to saving all refinement information in the adaptive component B, the calculations corresponding to coarsening one point are just two other rank-one updates and can be stated with the Sherman-Morrison formula. With the adaptations to the removed row/column, the updated matrix will have only zero-entries in that row/column, except on the i-th diagonal element, which will be 1. The resizing in the coarsening case takes place after the Sherman-Morrison updates are done, by simply removing corresponding rows/columns. Doing so will reduce the size of the current system matrix. Since the procedure in the coarsening case is equal to the refining case, all optimization concerning the Sherman-Morrison formula can be applied similarly. Thus, with exactly the same reasoning as in the refinement case, the runtime for coarsening one point from a refined grid with k added points is in O((n + k)2), with n + k being the current grid size.

Coarsening of one Point 1: B ← additive component of Sherman-Morrison formula 2: Q ← orthogonal matrix of decomposition of system matrix 3: T−1 ← inverse of tridiagonal matrix with regularization parameter 4: λ ← regularization parameter 5: m ← current matrix size 6: i ← index of point to coarsen, i > initial grid size 7: 8: function coarsen_point 9: for k ← 1 to m 10: ~xk ← hϕk, ϕii 11: end for 12: xi ← 0.5(xi − 1 + λ) 13: B ← rank_one_update(B, Q, T−1, −~x, i)

20 3 Algorithm

3.2.3 Solving Resulting System After adaptivity is done, the inverse of the current system matrix of size n + k is available as ! QT−1QT ~0 + B , ~0T 0 k where n is the size of the a priori grid and k is the number of new grid points after reﬁning and coarsening. With Bk being the additive component that holds information of k reﬁned points, it has the size n + k. Given a right side vector~b ∈ Rn+k the system can be solved simply by multiplying the inverse of the system matrix with~b, thus ! ! ! QT−1QT ~0 QT−1QT ~0 ~α = + B ~b = ~b + B ~b. ~0T 0 k ~0T 0 k

The multiplication of the left addend consist of three matrix-vector multiplications, which is O(n2). The right addend is one matrix-vector multiplication of size (n + k), which is O((n + k)2). Therefore the total runtime for solving the system after adaptivity is O((n + k)2), with n + k being the current grid size.

21 4 Implementation

This section gives an overview of the implementations, which were done in SG++ as a part of this thesis1. The created classes are listed with their respective functionality inside the online/ofﬂine splitting procedure and their subroutines. The following UML-diagram gives a view of the embedding of the new classes into the legacy system of SG++. The base diagram is taken from [Let17].

1SG++ is a project developed at the chair for scientiﬁc computing at the Technical University of Munich and at the chair for simulation of large systems at the University of Stuttgart, sgpp.sparsegrids.org.

22 4 Implementation

• Class DBMatOfflineOrthoAdapt The online/offline splitting’s offline part is implemented in this class. The over- written function decomposeMatrix() decomposes the initial grid’s system matrix −1 T into QT Q , and then inverts T. The size of the matrices is denoted by n ∈ N>0. – decomposing into Hessenberg matrix: The orthogonal decomposition of the initial system matrix into a Hessenberg matrix is accomplished with the GSL-functions2 gsl_linalg_symmtd_decomp and gsl_linalg_symmtd_unpack. The resulting matrices Q and T are saved as member variables of the offline object. – inverting symmetric tridiagonal matrix: Currently, the inversion takes place by solving T~x = ~ei for each i∈ {1, 2, ... , n} and storing the result as a column of T−1. This is done by the function gsl_linalg_solve_symm_tridiag. As mentioned in Chapter 3 the inverse of T does not have to be calculated explicitly, but doing so will not change the runtime by significant amount.

• Class DBMatOnlineDEOrthoAdapt The online part of the splitting handles adaptivity calculations. The L2-products of the new points’ basis functions are calculated in a subroutine and the resulting vectors are stored in a member variable. This allows for coarsening of points which were refined before. – refine points: After calculating the L2-products of the new grid points’ basis functions, the Sherman-Morrison formula is used to update the additive component B. To avoid frequent resizing of matrices, the refining is done for a set of points instead of one at a time. The matrix B gets resized to size n + k if k points will be refined. To avoid complexity scaling with the bigger matrix, sub-matrices of size n+i are used for i∈ {1, 2, . . . , k}. – coarsen points: Similar to refining, the Sherman-Morrison formula is used for coarsening. The row/column to remove from the additive component was prior calculated in a refinement step and can be taken out of the container member variable of the online object. After coarsening, the entries of the corresponding row/column are set to zero. Like in refining, a set of points gets coarsened, which results in a block-matrix at the end of the coarsening function.

2Link to manual: https://www.gnu.org/software/gsl/manual/html_node/Linear-Algebra.html

23 4 Implementation

– resize matrix after coarsening: To resize the coarsened matrix B, the indices of the coarsened points can be used to skip the rows/columns which were set to zero. The other values are copied into a smaller allocated matrix of size n + k − r, with k reﬁned points and r coarsened points, r ≤ k. Due to allocating and moving values, the resizing after the coarsening takes additional time in O((n + k − r)2).

• Class DBMatDMSOrthoAdapt – solve resulting system: To ﬁt the design to the legacy system, this class was created to solve the resulting system ~α = QT−1QT~b + B~b by using the prior obtained matrices Q, T−1, B, and a given vector ~b as the right side of the system. When computing the product QT−1QT~b, the components of~b with index bigger than initial system matrix size are ignored.

24 5 Evaluation

Using an Intel Core i7-4710HQ CPU with 2.5GHz frequency, runtime evaluations are done for the algorithm in its currently implemented version. As its functionality differs from adaptive sparse grid density estimation based on Cholesky decomposition, a given metric for comparison may yield different results for other setups. The matrix updates are not directly dependent of the grid’s level and dimension, but only on the resulting grid size. Therefore, the considered parameter will be the size of the grid’s system matrix. The following table contains considered combinations of grid dimension and grid level along with the resulting matrix size.

lvl 2 3 4 5 6 7 8 9 10 dim = 2 5 17 49 129 321 769 1793 4097 9217 dim = 3 7 31 111 351 1023 2815 7423 - - dim = 4 9 49 209 769 2561 7937 - - -

Also note, that the speed of the algorithm is data-independent. Even when solving the system, the data is not relevant, as only the size of~b will impact the speed. Therefore it is sufﬁcient to only consider the size of the system. After the runtime evaluations of the single subroutines, an example for density estimation and classiﬁcation on a data set is given.

25 5 Evaluation

5.1 Ofﬂine Phase

Given an already initialized system matrix R + λI, the different decomposition methods are measured. As Cholesky-based sparse grid density estimation can also deal with adaptivity, it is interesting to compare Cholesky with orthogonal decompositions, which currently are Eigenvalue decomposition and decomposition into Hessenberg matrix. Although all three decompositions are O(n3), the scaling factors differ immensely. Cholesky decomposition is faster than the orthogonal ones in general, but can’t be used to efﬁciently tweak the regularization parameter λ, taking up to O(n3). The following Figure 5.1 shows a speed comparison of the mentioned decomposition types.

3,000 Decomp. into Hessenberg Cholesky Decomp. 2,400 Eigenvalue Decomp.

1,800

1,200 Seconds 600

0 1023 4097 9217 Grid Size

Figure 5.1: Runtime of the different decompositions for the ofﬂine phase.

26 5 Evaluation

5.2 Online Phase

So far, the only algorithm capable of adaptivity transformations during the online phase is the Cholesky-based one. Due to the underlying orthogonal decomposition of the new algorithm, the additive component can only store reﬁned points and only prior reﬁned points can be coarsened. This constraint does not hold for the Cholesky-based algorithm, where it is possible to coarsen every grid point, which is an advantage.

5.2.1 Refining Both, the Cholesky-based and the orthogonal decomposition based online phases have to compute the corresponding L2-products of the basis functions before refining a point. In the following graph, one refinement step includes the computation of the L2-products as well as the necessary rank-one update of the system matrix. For the runtime measurement it doesn’t matter if a single point or a set of points get refined, because the refinement process is only dependent of the current matrix size. Figure 5.2 shows the times for refining one point for the Sherman-Morrison based and the Cholesky based refinement.

1 Sherman-Morrison based Refine Cholesky Modification based Refine.

0.75

0.5

Seconds 0.25

0 1023 4097 9217 Grid Size

Figure 5.2: Runtimes for reﬁning one grid point.

27 5 Evaluation

5.2.2 Coarsening Although the reﬁning and coarsening processes in the new algorithm are both based on the Sherman-Morrison formula, they differ in runtime slightly. Both adaptivity procedures have to resize the matrix, but only after coarsening it has to be not only resized, but also reassembled from the block shaped sub-matrices. This takes additional runtime, which is in O(n2). Coarsening a point in the Cholesky-based decomposition of the system matrix is dependent of the index of the coarsened point. For details, see [Sie16]. The following Figure 5.3 contains runtimes for the average mean of arbitrary coarsen indices. In case of orthogonal decomposition, only allowed indices were considered.

1 Sherman-Morrison based Coarsen Cholesky Modiﬁcation based Coarsen.

0.75

0.5

Seconds 0.25

0 1023 4097 9217 Grid Size

Figure 5.3: Runtimes for coarsening one arbitrary grid point.

28 5 Evaluation

5.2.3 Solving the System The complexity for solving the system matrix after adaptivity is in O((n + k)2), where n is the initial grid size and k is the number of points added. When considering adaptivity transformations of the matrix, only the current size is relevant. But when solving with no prior refinement, the additive component B is filled with zeroes, so the calculations B~b can be pruned. In Figure 5.4, in order to measure the added runtime of processing the additive component, one refinement step has taken place before solving.

Sherman-Morrison based Solve Cholesky-based Solve. 0.4

0.3

0.2 Seconds 0.1

0 1023 4097 9217 Grid Size

Figure 5.4: Runtime of solving the resulting system.

29 5 Evaluation

5.3 Example: Ripley Dataset

In this subsection, the new algorithm is used to perform density estimation and classiﬁcation on the Ripley Dataset from [RH95], scaled to [0, 1]d to match sparse grid settings. The Ripley Data is generated by mixing two Gaussian distributions and contains two class labels, which cannot be separated linearly. The train data contains 250 samples, whereas the test data contains 1000 samples, see Figure 5.5.

Figure 5.5: Training data samples (left) and test data samples (right) with two classes denoted by blue and red points.

The following computations are done with an initial sparse grid of level 5, which means the initial size of the system matrix is 129. The 250 train samples are learned in 25 batches, each batch of size 10. After each batch is processed, one point is considered for refinement, which can grow the matrix size by more than one. To assure a fair accuracy comparison with the Cholesky-based algorithm, no coarsening is done throughout the learning, because the new algorithm can only coarsen points which were refined before, whereas the Cholesky-based algorithm can coarsen every grid point. Figure 5.6 shows visualizations of the computed density estimation functions of both classes and Figure 5.7 shows the predicted classes. Classification with the specified settings yields an accuracy of 89.5% for both the new method and the Cholesky-based approach. Changing the learning setup in general yields different accuracies, but in all performed learning procedures the accuracy of both algorithms is close to each other.

30 5 Evaluation

Figure 5.6: Computed density functions of the red class (left) and the blue class (right).

Figure 5.7: Predicted classes based on density estimation.

31 6 Possible Improvement

As this thesis is a prove of concept, the algorithm is not optimized to full extend. This section lists possible improvements.

• Parallelization: Proﬁling all the functions of the algorithm reveals, that almost all computation time of the online phase’s subroutines goes to matrix-vector products. Comput- ing the L2-inner-products can be parallelized, as the values are available at all times and independent of each other. For adaptivity, the two updates using the Sherman-Morrison formula together compute ﬁve matrix-vector products and two matrix-matrix subtractions, each of which can be parallelized. The solving subroutine also uses matrix-vector products, and thus can also be parallelized.

• Using tridiagonal solver: Besides parallelizing the solving subroutine, further optimization can be achieved by using a solver for tridiagonal systems, although only for the smaller sized calculation of QT−1QT~b. Instead of calculating three matrix-vector products, which has a runtime of 3n2, it is possible to only calculate two matrix-vector products and solve one tridiagonal system, which would be 2n2 + 2n. Using a tridiagonal solver, the solving procedure is

QTQT~α = ~b ⇔ TQT~α = QT~b ⇔ QT~α = solve(T, QT~b) ⇔ ~α = Q · solve(T, QT~b).

• Trading offline/online time: To be able to efficiently regularize, the initial system (R + λI)~α = ~b has to be solved. Using Eigenvalue decomposition instead of decomposition into Hessen- berg matrix allows for faster inversion of T, as it will be a diagonal matrix then, and thus for faster online phases. The tradeoff is, however, a significant amount

32 6 Possible Improvement of ofﬂine time, as shown in Figure 5.1. The tradeoff might be worth it though, because not only solving for regularization will get a boost, but also the adaptivity processes. When using the decomposition into Hessenberg matrix, the inverse of T is, in general, a full matrix. Using Eigenvalue decomposition however yields an inverse diagonal matrix, thus speeds up later matrix-vector multiplications. Both, when solving QT−1QT~α =~b or multiplying QT−1QT~x, this will change the runtime complexity from 3n2 to 2n2 + n.

33 7 Conclusion

Both, using orthogonal decompositions for faster regularization processes, and using the Sherman-Morrison formula for adaptivity updates on the system matrix are promising approaches for adaptive sparse grid density estimation. Although orthogonal decompositions in general take a longer amount of time than LU or Cholesky decomposition, they allow for fast computation of the inverse system matrix. Changing the regularization parameter λ is therefore faster than known methods, and can even be further optimized, as explained in Chapter 6. Adaptivity based on Sherman-Morrison rank-one updates is a new approach, and yields promising results with good runtime orders and acceptable speeds in the current implementation. Even more promising is the possible parallelization of the numerous matrix-vector multiplications used in both reﬁnement and coarsening. The solving subroutine in its current implementation already is faster than the corresponding one in the Cholesky-based approach. Solving of the orthogonal based approach allows for better parallelization than the Cholesky-based one, because it only consists of matrix-vector products, whereas the Cholesky-based approach has to solve via forward- and backward-substitution.

34 8 Appendix

Dimension Level Grid Size Cholesky Hessenberg Eigenvalue 2 4 49 0.000 0.000 0.000 4 4 209 0.002 0.010 0.026 3 5 351 0.008 0.071 0.108 2 7 769 0.074 0.710 1.978 3 6 1023 0.481 2.077 5.946 2 8 1793 1.016 12.419 41.597 3 7 2815 4.930 61.556 147.053 2 9 4097 11.935 228.177 474.967 4 7 7937 80.999 1683.156 3343.404 2 10 9217 126.780 2758.517 nope

Table 8.1: The runtimes (in Seconds) of the different decomp. types (sorted by grid size), Cholesky decomposition (Col. 4), orthogonal decomp. into Hessenberg matrix (Col. 5), Eigenvalue decomp. (Col. 6), as seen in Figure 5.1.

Dimension Level Grid Size Cholesky Sherman-Morrison 2 7 769 0.000 0.007 3 6 1023 0.001 0.015 2 8 1793 0.005 0.044 3 7 2815 0.034 0.098 2 9 4097 0.067 0.199 4 7 7937 0.255 0.717 2 10 9217 0.419 1.024

Table 8.2: The runtimes (in Seconds) of the different reﬁnement types, Cholesky based (Col. 4), Sherman-Morrison based (Col. 5), as seen in Figure 5.2. The rows are sorted by Grid size.

35 8 Appendix

Dimension Level Grid Size Cholesky Sherman-Morrison 2 7 769 0.005 0.007 3 6 1023 0.012 0.015 2 8 1793 0.039 0.047 3 7 2815 0.091 0.107 2 9 4097 0.184 0.216 4 7 7937 0.673 0.782 2 10 9217 0.906 1.140

Table 8.3: The runtimes (in Seconds) of the different coarsening types, Cholesky based (Col. 4), Sherman-Morrison based (Col. 5), as seen in Figure 5.3. The rows are sorted by Grid size.

Dimension Level Grid Size Cholesky Sherman-Morrison 2 7 769 0.002 0.002 3 6 1023 0.003 0.004 2 8 1793 0.014 0.013 3 7 2815 0.035 0.032 2 9 4097 0.074 0.064 4 7 7937 0.264 0.197 2 10 9217 0.361 0.258

Table 8.4: The runtimes (in Seconds) of the different solving subroutines, Cholesky based (Col. 4), Sherman-Morrison based (Col. 5), as seen in Figure 5.4. The rows are sorted by Grid size.

36 Bibliography

[Bar51] M. S. Bartlett. “An Inverse Matrix Adjustment Arising in Discriminant Anal- ysis.” In: (1951). Ann. Math. Statist. Volume 22, Number 1, p. 107 - 111. [DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. zweite Auflage. Springer, 2008. [HTF09] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Second Edition. Springer, 2009. [Kre16] S. Kreisel. “Spatial Refinement for Sparse Grid Classifiers.” Bachelor’s Thesis. 2016. [Let17] M. Lettrich. “Iterative Incomplete Cholesky Decomposition for Datamining using Sparse Grids.” 2017. [Peh13] B. Peherstorfer. “Model Order Reduction of Parametrized Systems with Sparse Grid Learning Techniques.” PhD thesis. 2013. [Pfl10] D. Pflüger. “Spatially Adaptive Sparse Grids for High-Dimensional Prob- lems.” PhD Thesis. 2010. [RH95] B. D. Ripley and N. L. Hjort. Pattern Recognition and Neural Networks. Cam- bridge University Press, 1995. [Sie16] A. Sieler. “Refinement and Coarsening of Online-Offline Data Mining Meth- ods with Sparse Grids.” Bachelor’s Thesis. 2016.