i iii v Czech Technical University in Prague Faculty of Electrical Engineering Department of Cybernetics

Master’s Thesis The Linear Direct Sparse Solver on GPU for Bundle Adjustment Method

Bc. Ondrej Ivančík

Supervisor: Ing. Ivan Šimeček, Ph.D.

Study Programme: Open Informatics

Field of Study: Computer Vision and Image Processing

May 11, 2012 v vi

Aknowledgements

I would like to thank to my supervisor Ivan Šimeček who enabled me to deal with a very interesting topic and to prof. Olaf Hellwich and Cornelius Wefelscheid who allow me to work on my thesis within an individual project at TU Berlin. vii viii

Declaration

I hereby declare that I have completed this thesis independently and that I have listed all the literature and publications used. I have no objection to usage of this work in compliance with the act §60 Zákon č. 121/2000Sb. (copyright law), and with the rights connected with the copyright act including the changes in the act.

Prague, May 11, 2012 ix Abstract

The thesis deals with solving of sparse linear positive definite systems. It implements on CPU utilizing a CRS format for sparse matrices, a fast AMD ordering, and a symbolic factorization. Analysed are possibilities of a parallelization of Cholesky decomposition for sparse diagonal-based linear systems and for Bundle Adjustment problem where matrices of specific structure arise. Cholesky decomposition exploiting a Schur complement is implemented on both CPU and GPU side.

Abstrakt

Práce se zabývá řešením řídkých lineárních pozitivně definitních soustav. Implementuje Choleského dekompozici na CPU s využitím CRS formátu řídkých matic, rychlé AMD permutace a symbolické faktorizace. Analyzuje možnosti paralelizace Choleského dekompozice pro řídké lineárné systémy diagonálního tvaru a pro problém vyrovnání svazku, kde vznikají řídké matice specifické struktury. Navrhuje a implementuje výpočet Choles- kého dekompozice na GPU a CPU pomoci Schůrova komplementu.

x xi Contents

1 Introduction 2 1.1 Motivation ...... 2

2 Solving Linear Systems 4 2.1 System of Linear Equations ...... 4 2.2 Direct Methods for Solving Linear Systems ...... 5 2.2.1 Cramer’s Rule ...... 5 2.2.2 Forward and Backward Substitution ...... 5 2.2.3 ...... 6 2.2.4 Gauss-Jordan Elimination ...... 7 2.2.5 LU Decomposition ...... 7 2.2.6 Cholesky Decomposition ...... 7 2.3 Iterative Methods for Solving Linear Systems ...... 8

3 Sparse Matrices 10 3.1 Ordering Methods ...... 10 3.1.1 Arrowhead Matrix Example ...... 11 3.1.2 Graph Representation ...... 11 3.1.3 Bottom-up Ordering Methods ...... 12 3.1.4 Top-down Ordering Methods ...... 12 3.2 Symbolical Factorization ...... 13

4 Bundle Adjustment 16 4.1 Unconstrained Optimization ...... 17 4.1.1 Search Methods ...... 18 4.1.2 Levenberg–Marquardt ...... 19

5 Overview of NVIDIA CUDA 22 5.1 The CUDA Execution Model ...... 23 5.2 GPU Memory ...... 24

6 Analysis of the Problem 28 6.1 Structure of Linear Systems in BA ...... 28

xii xiii CONTENTS

6.2 Block Cholesky Decomposition for BA ...... 29

7 Implementation 34 7.1 Used Framework ...... 34 7.2 Compressed Row Storage Format ...... 34 7.3 Cholesky decomposition on GPU ...... 35 7.4 Ordering for CPU solver ...... 36 7.5 Block Matrix Format for GPU ...... 36 7.6 Block Cholesky decomposition on GPU ...... 37 7.7 Ordering for GPU solver ...... 38

8 Testing 40 8.1 Octave solvers ...... 40 8.2 CPU solver ...... 41 8.3 GPU solver ...... 42 8.4 CUSP solvers ...... 43

9 Conclusion 44

A List of Abbrevations 50

B User Manual 52 B.1 Requirements ...... 52 B.2 Usage ...... 52

C Contetns of the Attached CD 54 List of Figures

3.1 The dependence of the reordering of a on the fill-in count ...... 11 3.2 Ordering example ...... 14

4.1 Reprojection error ...... 17

5.1 Block diagram of a GF100 GPU ...... 24 5.2 Streaming multiprocessor of a GF100 (Fermi) GPU ...... 25 5.3 Bandwidth of various GPU memory ...... 25

6.1 An example of a modestly sized Hessian in BA ...... 30

7.1 Sample of a symmetric positive definite sparse matrix 6 × 6 with 22 nonzero elements ...... 35 7.2 Performing k-way ordering on diagonal-based matrix ’Wathen 10 × 10’ ...... 38 7.3 Performing k-way ordering on diagonal-based matrix ’Poisson 30’ ...... 39

8.1 Test of Octave solvers ...... 41 8.2 Test of iterative CUSP solvers. Max. error is the maximal difference with Octave’s reference solution ...... 43

xiv xv LIST OF FIGURES Chapter 1

Introduction

Finding a solution of a system of linear algebraic equations (2.1) is the most basic task in linear algebra and the heart of many engineering problems. It is the topic of studies for many years not only for its application in many branches of scientific computing, but also for its high computational com- plexity and a wide variety of methods and approaches that help to solve linear systems of different types faster and more accurately. Finding a solution for a system of nonlinear algebraic equations can be achieved using iterative solvers which keystone is solving a linear system in each iteration step to bring near the sufficiently accurate solution. There- fore, a linear solver forms a crucial part and a bottleneck of a nonlinear solver at the same time. A widely used optimization method in 3D reconstruction is bundle adjustment. As a nonlinear iterative optimization method, it needs to solve a sparse, often very large linear system of a specific structure many times. Studying of the suitable linear solver for bundle adjustment is the main part of my thesis.

1.1 Motivation

One particular and promising approach for speeding-up the process of solving systems of linear equations consists in parallel computation. In case of dense direct solvers, the parallelization is more straightforward and has better per- formance results than those for sparse direct solvers. Iterative methods, almost used for solving large sparse linear systems, are efficiently paralleliz- able thanks to the character of iterative solvers that used only sparse matrix and vector multiplications and additions.

2 3 1.1. MOTIVATION

In the last decade, there has been growing interest in general-purpose compu- tation on graphics processing units (GPGPU). Several libraries were devel- oped which implement basic linear algebra subroutines or even linear solvers for dense matrices (NVIDIA cuBLAS, MAGMA, CULA Dense) and sparse matrices (NVIDIA cuSparse, NVIDIA CUSP, CULA Sparse). At the present time, no implementation of a linear direct solver for general sparse matrices on GPU exists. The main cause is the problematic fine-grain parallelization and the thread divergence on a GPU. Sparse matrices consisting of many small independent full blocks on diagonal with some dependent parts on borders are formed during computation of bundle adjustment. It seems that there is possibility to eliminate these blocks in parallel manner effectively even on GPU. The question is which type of solver is more suitable — direct or iterative? My thesis aims to give the answer for it. Chapter 2

Solving Linear Systems1

2.1 System of Linear Equations

Definition 1. A system of m linear equations in n unknowns consists of a set of algebraic relations of the form

n aijxj = bi, i = 1,...,m (2.1) Xj=1 where xj are unknowns, aij are the coefficients of the system and bi are the components of the right-hand side. System (2.1) can be more conveniently written in matrix form as Ax = b, (2.2) m×n m where A = (aij) ∈ C denotes the coefficient matrix, b = (bi) ∈ C n the right side vector and x = (xi) ∈ C the unknown vector, respectively. A solution of (2.2) is any n-tuple of values xi which satisfies (2.1).  Remark 1. The existence and uniqueness of the solution of are ensured if one of the following (equivalent) hypotheses holds: 1. A is invertible, 2. rank(A)= n, 3. the homogeneous system Ax = 0 admits only the null solution. 

In next chapters I will be dealing with numerical methods finding the solution of real-valued square systems of order n, that is, systems of the form (2.2) with A ∈ Rn×n and x, b ∈ Rn. Such linear systems arise frequently in any

1For this chapter was cited from [20] and [21]

4 5 2.2. DIRECT METHODS FOR SOLVING LINEAR SYSTEMS branch of science, also in bundle adjustment. These numerical methods can generally be divided into two classes. In absence of roundoff errors, direct methods yield the exact solution in a finite number of steps. Iterative methods require (theoretically) an infinite number of steps to find the exact solution.

2.2 Direct Methods for Solving Systems of Linear Equations

2.2.1 Cramer’s Rule

The solution of system (2.2) is formally provided by Cramer’s rule

det(Aj) xj = , j = 1,...,n, (2.3) det(A) where Aj is the matrix obtained by substituting the j-th column of A with the right-hand side b. If the determinants are evaluated by the recursive Laplace rule, the method based on Cramer’s rule turns out to be unac- ceptable even for small dimensions of A because of its computational costs (n + 1)! flops. However, Habgood and Arel [11] have recently shown that Cramer’s rule can be implemented in O(n3) time, which is comparable to more common methods of solving systems of linear equations.

2.2.2 Forward and Backward Substitution

Definition 2. A square matrix with zero entries above the main diagonal (aij = 0 for ij) is called upper triangular. A lower (upper) triangular matrix is strictly lower (upper) triangular when its entries on the main diagonal are zeros, too.  Example 1. Lower (upper) triangular systems can be easily solved using forward (backward) substitution. For example, the nonsingular 3 × 3 upper triangular system u11 u12 u13 x1 b1  0 u22 u23 x2 = b2 0 0 u x b  33  3  3 can be solved in sequence as follows

x3 = b3/u33,

x2 =(b2 − u23x3)/u22,

x1 =(b1 − u12x2 − u13x3)/u33.  CHAPTER 2. SOLVING LINEAR SYSTEMS 6

For a nonsingular upper triangular system of order n (n ≥ 2), the solution can be expressed generally in the form

bn xn = unn n (2.4) 1 xi = bi − uijxj , i = n − 1,..., 1. uii j=Xi+1   Analogically, the solution for a nonsingular lower triangular system of order n (n ≥ 2) in the form

b1 xi = l11 i−1 (2.5) 1 xi = bi − lijxj , i = 2,...,n. lii Xj=1   The number of multiplication and divisions for forward/backward substi- n tution is equal to 2 (n + 1), while the number of sums and subtractions is n 2 2 (n − 1). The total operation count for (2.4) and (2.5) is thus n .

2.2.3 Gaussian Elimination

Let A be a square nonsingular matrix. A linear system Ax = b can be transformed into equivalent (lower or upper) triangular system Tx = b that has the same solution using three elementary row operations. The solution b of the system is invariant to 1. the multiplication of a row by a nonzero scalar, 2. the addition of one row to another, 3. the swapping of two rows. The basic idea is to multiply the i-th equation by a nonzero constant and subtract with the first equation to zeroize first unknown in the i-th equation. This is done with all equations from 2 to n. Then, the second equation is considered as reference and all unknowns in equations form 3 to n are zeroed. The procedure ends, when the system has form Tx = b. Right-hand side b equals to T−⊤b. Finally, the solution is obtained by forward substitution b (if T is lower triangular matrix) or backward substitution (if T is upper b triangular matrix). 2 To complete Gaussian elimination 3 (n − 1)n(n + 1) + n(n − 1) flops are 2 3 2 required. To solve the linear system, about 3 n + 2n flops are needed (with n2 flops to backsolve the triangular system). Neglecting the lower order of 2 3 terms, the Gaussian elimination process has a cost of 3 n flops. 7 2.2. DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

2.2.4 Gauss-Jordan Elimination

Gauss-Jordan elimination is slightly different as Gaussian elimination. The transformation of the system using three elementary row operations repeats until each equation contains only one of the unknowns, thus giving an im- mediate solution. Principal deficiencies of this method are that 1. it requires all the right-hand sides to be stored and manipulated at the same time and 2. it is three times slower than the alternative solvers, when the inverse of A is not desired.

2.2.5 LU Decomposition

Suppose that it is able to write the matrix A as a product of two matrices, A = LU where L is lower triangular and U is upper triangular. This decomposition can be used to solve the linear system

Ax =(LU)x = L(Ux)= b (2.6) by first solving (by forward substitution) for the vector y such that

Ly = b (2.7) and then solving (by backward substitution) for the vector x such that

Ux = y. (2.8)

n×n Theorem 1. Let A ∈ R . The LU decomposition of A with lii = 1 for i = 1,...,n exists and is unique iff the principal submatrices Ai of A of order i = 1,...,n − 1 are nonsingular. The LU decomposition is usually performed in place to avoid copying and wasting the memory when storing triangular matrices L and U separately as it is shown in Algorithm 1. At the end (here only for presentational purposes) is the result stored in L and U matrix.

2.2.6 Cholesky Decomposition

Theorem 2. Let A ∈ Rn×n be a symmetric and positive definite matrix. Then, there exists a unique lower triangular matrix L with positive diagonal entries such that A = LL⊤. (2.9) CHAPTER 2. SOLVING LINEAR SYSTEMS 8

Algorithm 1 LU Decomposition Require: A square matrix A. Ensure: A lower triangular matrix L with ones on the main diagonal and an upper triangular matrix U such that LU = A. function [L, U] = lu2(A) [n,n] = size(A); for k = 1:n A(k+1:n,k) = A(k+1:n,k) / A(k,k); A(k+1:n,k+1:n) = A(k+1:n,k+1:n) - A(k+1:n,k) * A(k,k+1:n); end L = tril(A,-1) + eye(n); % ones on the diagonal U = triu(A); end

The computational costs for Cholesky halves, with respect to the LU decom- n3 position, to about 3 flops because the input matrix A is symmetric. An implementation example of Cholesky decomposition is coded in Algorithm 2.

Algorithm 2 Cholesky Decomposition Require: A square positive definite matrix A. Ensure: A lower triangular matrix L such that LL⊤ = A. function [L] = chol2(A) [n,n] = size(A); for k = 1:n A(k,k) = sqrt(A(k,k)); A(k,k+1:n) = A(k,k+1:n) / A(k,k); for i = k+1:n A(i,i:n) = A(i,i:n) - A(k,i:n) * A(k,i); end end L = triu(A)’; end

2.3 Iterative Methods for Solving Systems of Linear Equations

Iterative methods formally yield the solution x of a linear system after an infinitive number of steps. At each step they require the computation of the residual of the system. For full matrices, their computational cost is of the order n2 operations for each iteration to be compared with an overall 2 3 cost of the order of 3 n operations needed by direct methods. Iterative methods can therefore become competitive with direct methods because the 9 2.3. ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS required number of iterations to converge is either independent of n or scales sublinearly with respect to n. The basic idea of iterative methods is to construct a sequence of vectors x(k) that enjoy the property of convergence

x = lim x(k), k→∞ where x is the solution to (2.2). In practice, the iterative process is stopped at the minimum value of n such that x(n) − x <ǫ.

Chapter 3

Sparse Matrices

Many engineering problems have to confront with large and sparse matrices. A sparse matrix is a matrix that allows special techniques to take advantage of the large number of zero elements. This definition helps to define ’how many’ zeros a matrix needs in order to be ’sparse’. The answer is that it de- pends on what the structure of the matrix is, and what is being used for. For example, a randomly generated sparse n×n matrix with cn entries scattered randomly throughout the matrix is not sparse in the sense of Wilkinson (for direct methods) since it takes O(n3) time to factorize (with high probability and for large enough c [9]). [3] Example 2. Using some of the sparse formats to store ’real’ sparse matrices can result in significant computational and storage savings. For instance a tridiagonal square matrix with 1, 000, 000 rows. Storing 3 million nonzero elements in double precision, and other data as row and column indices, consumes approx. 40MB. But storing the same matrix as full matrix would consume more than 7TB. Such big differences can be expected also in exe- cution times. 

3.1 Ordering Methods

An unfavourable fact lies in the process of elimination with sparse matrices. Some zero values of input matrix become non-zero during the elimination (fill-ins) and their positions must be precomputed in advance. Reordering techniques try to minimize the amount of fill-ins by finding a permutation of rows and columns of the input matrix. But finding such optimal permutation is a NP-complete problem [26] and could be more time consuming than solving original linear system; therefore, heuristic approach that gives often near optimal results is applied.

10 11 3.1. ORDERING METHODS

3.1.1 Arrowhead Matrix Example

Example 3. The operations counts required for the solution of two linear system Ax = b will be examined. The input matrices are on the figure 3.1. Even though both matrices have the same number of non-zero elements, there is a significant computation reduction by simply permutation of rows and columns.

••••• ••••• • • • • • • • • ⋆ ⋆ ⋆ • • • • • • • ⋆ • ⋆ ⋆ • • • • • • • ⋆ ⋆ • ⋆ • • • • • • • ⋆ ⋆ ⋆ • ••••• •••••

(a) Left-up arrow- (b) Left-up arrow- (c) Right-down ar- (d) Right-down ar- head matrix head matrix after rowhead matrix rowhead matrix af- LU ter LU

Figure 3.1: The dependence of the reordering of a sparse matrix on the fill- in count. • represents nonzero elements of the input matrix, ⋆ fill-ins and empty space zero elements

For the left-up arrowhead matrix 3.1a, the number of multiplications and divisions required by the forward elimination is α = 40, for the back substi- tution β = 25. The total number of operations is α + β = 65 and the input sparse matrix becomes full. For the right-down matrix 3.1c, the number of mul. and div. required for the forward elimination is α = 8, for back sub- stitution is β = 13. The total number of operations is α + β = 21 and the input sparse matrix remains sparse.  There are many recent works about ordering schemas. This is because the specific problems construct specific types of sparse matrices (band-diagonal, block triangular, block tridiagonal, . . . ) [20, p. 77]. Below, the most used methods are described. They can be divided in two categories, according how the elimination tree is build. Most state-of-the-art ordering schemes for sparse matrices are a hybrid of a bottom-up method such as minimum degree and a top-down scheme such as George’s nested dissection.

3.1.2 Graph Representation of Sparse Matrices

To explain ordering methods, it is convenient to introduce a graph represen- tation of sparse matrices. They are then represented as undirected graphs (sparse matrix have the structure of an adjacency matrix for this graph). All schemes are described for the undirected graph G =(V,E),E ⊂ V × V , CHAPTER 3. SPARSE MATRICES 12 associated with the symmetric matrix S. Let v be a vertex of G. The set of vertices that are adjacent to v is denoted by adjG(v).

3.1.3 Bottom-up Ordering Methods

Bottom-up methods build the elimination tree from the leaves up to the root. In each iteration k a greedy heuristic is applied to Gk−1 to select a vertex for elimination. This section briefly describes two of the most popular bottom- up algorithms, the minimum degree and the minimum deficiency ordering heuristics. Minimum Degree Ordering As mentioned above, at each iteration k the minimum degree algorithm eliminates a vertex v that minimizes the number of adjacent vertices degGk−1(v) = |adjGk−1(v)|. The algo- rithm is a symmetric variant of the Markowitz scheme [15] and was first applied to sparse symmetric factorization by Tinney and Walker [22]. Over the years many enhancements have been proposed to the basic algorithm that have greatly improved its efficiency. Minimum Deficiency Fill A less popular bottom-up scheme is the mini- mum deficiency or minimum local fill heuristic. The exact amount of fill is used to select a vertex for elimination. The minimum deficiency algorithm has received much less attention because of its prohibitive runtime.

3.1.4 Top-down Ordering Methods

The most popular top-down scheme is George’s nested dissection algorithm [7, 8]. The basic idea of this approach is to find a subset of vertices S in G, whose removal partitions G in two subgraphs G(B) and G(W ) with V = S ∪ B ∪ W and |B|, |W | ≤ α|V | for some 0 <α< 1. Such a par- tition of G is denoted by (S,B,W ). The set S is called vertex separator of G. If we order the vertices in S after the (black) vertices in B and the (white) vertices in W , no fill-edge can occur between B and W . Typically, the columns corresponding to S constitute a full off-diagonal block in the Cholesky factor. Therefore, S is supposed to be small. Once S has been found, the algorithm is recursively applied to each connected component of G(B) and G(W ) until a component consists of a single vertex or a clique. In this way the elimination tree is built from the root down to the leaves. Graph partitioning heuristics are usually divided into construction and im- provement heuristics. A construction heuristic takes the graph as input and computes an initial separator from scratch. An improvement heuristic tries to minimize the size of a separator through a sequence of elementary steps. 13 3.2. SYMBOLICAL FACTORIZATION

As some ordering methods are implemented in MATLAB as standard func- tions (colperm, symrcm, colamd, symamd, amd, dmperm), I have tested some of them (see figure 3.2).

3.2 Symbolical Factorization

Symbolical factorization is a step executed before the numerical factoriza- tion. It precomputes the positions of fill-ins (see also 3.1) that appears during factorization process when one row is added to another. It can be seen on the Cholesky or LU factors that they are often more denser than original matrices (see figure 3.2). The CRS format stores only nonzero elements and therefore needed space for fill-ins must be allocated before the numerical factorization. The naïve solution is to run slightly changed numerical factor- ization and store new nonzero entries. In fact that symbolical factorization works only with indices to determine the Cholesky or LU factors, it can be computed much faster than full numerical factorization. When implementing my symbolical factorization I have used a great information source [13]. CHAPTER 3. SPARSE MATRICES 14

0 0

50 50

100 100

150 150

200 200

250 250

300 300

0 50 100 150 200 250 300 0 50 100 150 200 250 300 no ordering: fill−ins=13309 colperm: fill−ins=30627

0 0

50 50

100 100

150 150

200 200

250 250

300 300

0 50 100 150 200 250 300 0 50 100 150 200 250 300 symrcm: fill−ins=13040 colamd: fill−ins=9569

0 0

50 50

100 100

150 150

200 200

250 250

300 300

0 50 100 150 200 250 300 0 50 100 150 200 250 300 symamd: fill−ins=6681 amd: fill−ins=6583

Figure 3.2: Applying different ordering methods and displaying LU factors. Nonzeros are in black, fill-ins in gray color 15 3.2. SYMBOLICAL FACTORIZATION Chapter 4

Bundle Adjustment

Three-dimensional (3D) reconstruction is a problem that appears often in many computer vision tasks. 3D reconstruction can be defined as the prob- lem of using 2D measurements arising from a set of images depicting the same scene from different viewpoints, aiming to derive information related to the scene geometry as well as the relative motion and the optical characteris- tics of the camera(s) employed to acquire these images. Bundle adjustment (BA) is almost invariably used as the last step of every feature-based 3D reconstruction algorithm [14, p. 1–2].

Bundle adjustment is the problem of refining a visual reconstruction to produce jointly optimal 3D structure and viewing parameter (camera pose and/or calibration) estimates. Optimal means that the parameter estimates are found by minimizing some cost function that quantifies the model fitting error, and jointly that the solution is simultaneously optimal with respect to both structure and camera variations. The name refers to the ‘bundles’ of light rays leaving each 3D feature and converging on each camera cen- tre, which are ‘adjusted’ optimally with respect to both feature and camera positions. Equivalently — unlike independent model methods, which merge partial reconstructions without updating their internal structure — all of the structure and camera parameters are adjusted together ‘in one bundle’ [23].

BA boils down to minimizing the reprojection error (4.1) between the ob- served and predicted image points, which is expressed as the sum of squares of a large number of nonlinear, real-valued functions. Thus, the minimization is achieved using nonlinear least-squares algorithms [4], from which Levenberg- Marquardt has proven to be of the most successful due to its ease of imple- mentation and its use of an effective damping strategy that lends it the ability to converge quickly from a wide range of initial guesses [12].

16 17 4.1. UNCONSTRAINED OPTIMIZATION

Figure 4.1: Reprojection error [17]

4.1 Unconstrained Optimization1

The aim of the unconstrained optimization is to find x∗ such that

arg min f(x). (4.1) x∈Rn

The point x∗ is called a global minimizer of f if f(x∗) ≤ f(x) ∀x ∈ Rn, while x∗ is called a local minimizer of f if a neighborhood N of x∗ exists such that f(x∗) ≤ f(x) ∀x ∈ N . Vector of first partial derivations of the function f (must be continuously differentiable) by the vector x is denoted by ⊤ ∂f ∂f ∇f(x)= (x),..., (x) ∂x1 ∂xn  and called gradient of f at a point x. If d is is a non null vector in Rn, then the directional derivative of f with respect to d is

∂f f(x + αd) − f(x) (x)= lim ∂d α→0 α and satisfies ∂f(x)/∂d = [∇fx]⊤d. Moreover, denoting by (x, x + αd) the segment in Rn joining the points x and x + αd, with α ∈ R, Taylors’s expansion ensures that ∃ξ ∈ (x, x + αd) such that

f(x + αd) − f(x)= α∇f(ξ)⊤d.

1This chapter was cited from [21]. CHAPTER 4. BUNDLE ADJUSTMENT 18

If f is twice-continuously differentiable, it can by denoted by H(x) (or ∇2f(x)) the Hessian matrix of f evaluated at a point x, whose entries are

∂2f(x) hij(x)= , i,j = 1,...,n. ∂xi∂xj

In such case it can be shown that, if d 6= 0, the second-order directional derivative exists ∂2f (x)= d⊤H(x)d. ∂2d For a suitable ξ ∈ (x, x + d) also

1 f(x + d) − f(x)= ∇f(x)⊤d + d⊤H(ξ)d. 2 Existence and uniqueness of solution for (4.1) is not guaranteed in Rn. Nev- ertheless, it can be proved that the gradient of a local minimizer x∗ equals to a null vector. This condition is necessary for optimality to hold. However, this condition also becomes sufficient if f is a convex function on R, i.e., such that ∀x, y ∈ Rn and for any α ∈ [0, 1]

f[αx + (1 − α)y] ≤ αf(x)+(1 − α)f(y).

4.1.1 Search Methods

Analytical methods are possible to use only for simple problems (Bra- chistochrone problem, univariate minimization). Numerical methods must be used for most engineering optimization problems (too large and complex to solve analytically). Numerical methods can be divided into two classes Gradient-based methods are efficient for many variables and for smooth objective function. The drawback is the local conver- gence. Derivative-free methods are suitable for problems when gradi- ents are not available, objective function is not differentiable or the global minimizer is desired to find. Gradient-based descent methods compute direction d(k) and positive pa- rameter (step length) α(k) at each iteration k with the help of gradient and Hessian. Algorithm 3 shows the skeleton of this method. The concept of computing the direction d(k) and the step length α(k) defines a specific direct method. 19 4.1. UNCONSTRAINED OPTIMIZATION

Algorithm 3 Descent method

Require: ∇f(x), H(x) and a starting point x0. Ensure: A local minimizer x∗.

1: k ← 0 2: while (not converged) do 3: compute direction d(k) and step length α(k) 4: x(k+1) ← x(k) + α(k)d(k) 5: k ← k + 1 6: end while 7: return x(k)

Newton’s method computes

d(k) = −H−1(x(k))∇f(x(k)),

where H is positive definite within a sufficiently large neighborhood of point x∗; inexact Newton’s method

d(k) = −B−1(x(k))∇f(x(k)),

where B(x(k)) is a suitable approximation of H(x(k)); gradient (steepest descent) method

d(k) = −∇f(x(k)); conjugate gradient method

d(k) = −∇f(x(k))+ β(k)d(k−1),

where β(k) is a scalar to be suitably selected in such a way that the directions d(k) turn out to be mutually orthogonal with respect to a suitable scalar product.

4.1.2 Levenberg–Marquardt Algorithm

Levenberg–Marquardt (LM) algorithm, also known as the damped least- squares method, provides a numerical solution to the problem of minimizing a function, generally nonlinear, over a space of parameters of the function. It can be thought of as a combination of Gauss–Newton and the steepest descent method. When the current solution is far from a local minimum, the algorithm behaves like a steepest descent method: slow, but guaranteed CHAPTER 4. BUNDLE ADJUSTMENT 20 to converge. When the current solution is close to a local minimum, it becomes a Gauss-Newton method and exhibits fast convergence. For these reasons, mostly LM algorithm is used in bundle adjustment. Let f be an assumed functional relation which maps a parameter vector p ∈ Rm to an estimated measurement vector xˆ = f(p), xˆ ∈ Rn. An initial parameter estimate p0 and a measured vector x are provided and it is desired to find the vector p∗ that best satisfies the functional relation f locally, that is, minimizes the squared distance ǫ⊤ǫ with ǫ = x − xˆ for all p within a sphere having a certain, small radius. The basis of LM algorithm is an affine approximation to f in the neighborhood of p. For a small ||δp||, f is approximated by (see [5, p. 75])

f(p + δp) ≈ f(p)+ Jδp, where J is the Jacobian of f. The basis of LM algorithm is an affine approximation to f in the neighbor- hood of p. At each iteration, it is required to find the step δp that minimizes the quantity ||x − f(p + δp)|| ≈ ||x − f(p) − Jδp|| = ||ǫ − Jδp||. The mini- mum is attained when Jδp − ǫ is orthogonal to the column space of J. This ⊤ leads to J (Jδp − ǫ) = 0, which yields δp as the solution of the so-called normal equations [10]: ⊤ ⊤ J Jδp = J ǫ (4.2) Matrix J⊤J in the above equation is the first order approximation to the 1 ⊤ ⊤ Hessian of 2 ǫ ǫ [16] and δp is the Gauss-Newton step. J ǫ corresponds 1 ⊤ ⊤ to the steepest descent direction since the gradient of 2 ǫ ǫ is −J ǫ. The LM algorithm actually solves a slight variation of Equation (4.2), known as augmented normal equations

⊤ ⊤ Nδp = J ǫ, with N ≡ J J + µI, µ> 0. (4.3)

The strategy of altering the diagonal elements of J⊤J is called damping and µ is referred to as the damping term. It is decreased, when the updated parameter vector p + δp with δp computed from Equation (4.3) leads to a reduction in the error ǫ⊤ǫ; otherwise it is increased, the augmented normal equations are solved again and this process iterates until a value of δp that decreases the error is found. 21 4.1. UNCONSTRAINED OPTIMIZATION Chapter 5

Overview of NVIDIA CUDA1

By introducing CUDA (Compute Unified Device Architecture) NVIDIA has given programmers the initial opportunity to capitalize on inexpensive, gen- erally available, massively parallel computing hardware. Teraflop computing is now within the economic reach of most people around the world. The impact of GPGPU (General-Purpose Graphics Processing Units) technology spans all aspect of computation, from cell phones to largest supercomputers. Programmable GPUs are deployed in areas of scientific computing, cloud computing, computer visualization, simulations, games, . . . Programming for GPGPU requires a basic knowledge about the GPU archi- tecture because only small changes in data structures or program can make significant differences in the performance. Modern GPUs belong in principle to the SIMD class of Flynn’s taxonomy. That means that GPUs are capable to do the same operation on multiple data simultaneously. The restriction is one operation at the time which reduces possible problems worth to par- allelize on GPU. On the other hand, well-vectorized problems are able to achieve an acceleration by two or more orders of magnitude over multi-core processors2. To ensure best performance of GPGPU, next tree rules should be met. 1. Get the data on the GPGPU and keep it there. GPGPU are separate devices plugged into the PCI Express bus of the host computer which is very slow compared to GPGPU memory system (20 to 28 times slower). 2. Give the GPGPU enough work to do. CUDA-enabled GPUs deliver teraflop performance and they are fast enough to complete small prob-

1For this chapter I have quote from [6]. 2Top 100 NVIDIA CUDA application showcase speedups (Min 100, Max 2600, Median 1350), published May 9., 2011.

22 23 5.1. THE CUDA EXECUTION MODEL

lems faster than the host processor can start kernels. Each thread should perform as much instruction to hide this latency. 3. Focus on data reuse within the GPGPU to avoid memory bandwidth limitation. All high-performance CUDA applications exploit internal resources on the GPU (registers, shared memory) to bypass global memory bottlenecks.

5.1 The CUDA Execution Model

The heart of CUDA performance lies in the execution model and the sim- ple partitioning of a computation into fixed-sized blocks of threads in the execution configuration. CUDA maps naturally the parallelism within an application to the massive parallelism of the GPGPU hardware. The result is the compatibility within older and future generations of GPU. GPU hardware parallelism is achieved through replication of a common ar- chitectural building blocks called a streaming multiprocessor (SM). Figure 5.1 illustrates 16 SM on a GF100 (Fermi) series GPGPU. The software abstrac- tion of a thread block translates into a natural mapping of the kernel onto an arbitrary number of SM on a GPGPU. Each SM can be scheduled (by GigaThread global scheduler) to run one or more thread blocks. Therefore, they are independent and not synchronizable during the kernel execution3. Thread blocks also acts as a container of thread cooperation, as only threads in a thread block can share data. Thread in a thread block can utilize high- speed memory inside the SM called shared memory for data sharing. Figure 5.2b depicts the composition of one of 16 streaming multiprocessors in GF100 GPU. SIMD cores require less power and space than non-SIMD cores. As a result, GPGPU have a high flop per watt ratio compared to conventional CPUs [25]. The threads running on a multiprocessor are partitioned into groups in which all threads execute the same instruction simultaneously. On the CUDA architecture, these groups are called warps, each warp has 32 threads, and this execution model is referred to as SIMT (Single Instruction Multiple Threads) [18]. GPGPUs are not true SIMD machines (but SIMT), since SIMD are only streaming multiprocessors which may be running one or more different in- structions. Conditionals (if statements) can decrease performance inside a SM because each branch of each conditional must be evaluated. This can cause slowdown of 2n for n nested loops.

3Atomic operation make an exception, they allow threads of different blocks to commu- nicate. This approach should be used in reasonably situations, as using atomic operations may introduce scalability and performance issues. CHAPTER 5. OVERVIEW OF NVIDIA CUDA 24

Figure 5.1: Block diagram of a GF100 (Fermi) GPU [2]

5.2 GPU Memory

For highest performance of applications developed for GPU, data inside the SM must be reused. The reason is that on-board global memory (DRAM in 5.2a) is not fast enough when all SM want to perform read/write operation. CUDA provides configurable caches for each SM to give the opportunity for data reuse. The awareness of difference between on-board (GPU) and on- chip (SM) memory is the key to achieving the highest performance that GPGPU can provide.

The most fastest and most scalable is on-chip SM memory. However, it is limited to a few KB. The on-board global memory is accessible by all the SM across the GPU and is measured in GB. Significant bandwidth gaps between on-board and on-chip memories could be seen in table 5.3. Although the bandwidth of shared memory can greatly accelerate applications, it is too slow to achieve peak performance [24].

Example 4. Computing a simple dot-product for( i = 0; i < N; i++ ) c[i] = a[i] * b[i]; 25 5.2. GPU MEMORY

(a) Memory hierarchy [1] (b) Block diagram [1]

Figure 5.2: Streaming multiprocessor of a GF100 (Fermi) GPU

Register memory ≈ 8000 GB/s Shared memory ≈ 1600 GB/s Global memory 177 GB/s Mapped memory ≈ 8 GB/s

Figure 5.3: Bandwidth of various GPU memory [6, p. 111] on a GPU utilizing only global memory gives a limited performance. When 4-byte floating-point values are being used, a 1 Tflop GPU would require 12TB/s of memory bandwidth. A GPU with 177GB/s of memory band- width could only deliver 14 Gflop (1.4% of the potential 1 Tflop performance). 

When programming for a GPU, it is necessary to reuse data within the SM (to exploit data locality). GPGPUs support two types of data locality – CHAPTER 5. OVERVIEW OF NVIDIA CUDA 26 temporal locality (or LRU – Last Recently Used) means that last recently accessed data is likely to be used again in the future and spatial locality means that neighbouring data is cached to be used in the future. For compute capability 2.0 or higher, a constant or texture memory used for effective data broadcasting to all threads are overcome by the global memory. This is because compute 2.0 devices contains SM with L1 cache and a unified L2 cache that speed-up accessing the global memory. 27 5.2. GPU MEMORY Chapter 6

Analysis of the Problem

As I have mentioned in the Introduction, finding the solution of a linear system is the most compute–demanding part in the problem of solving a nonlinear system. At each iteration, a linear system Ax = b must be solved. Bundle adjustment (BA), as a least squares problem, works with sparse linear systems of a special structure (doubly bordered block diagonal). A similar structure can be obtained when applying nested dissection ordering on the diagonal-based matrix A (band-diagonal, block tridiagonal, . . . ). The im- plemented solver on GPU can be used for BA when the information about the structure of A matrix is provided by BA or for the diagonal-based ma- trix when the information about the structure is provided by the ordering function.

6.1 Structure of Linear Systems in BA

A system of linear augmented normal equations 4.3 arises in BA and are solved at each iteration of Levenberg-Marquardt algorithm. Matrix J is Jacobian and N is the first order approximation of Hessian. The structure of N can be exactly determined according to the input parameters of BA problem.

Example 5. [14, p. 9] Consider that we want to optimize parameters of 3 cameras and 4 3D points visible in all cameras. The measurement vec- ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ tor X = (x11, x12, x13, x22, x21, x23, x31, x32, x33, x41, x42, x43) is made up of the measured image point coordinates across all cameras. The parameter ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ vector P = (a1 , a2 , a3 , b1 , b2 , b3 , b4 ) is defined by all parameters de- x scribing 3 projection matrices and 4 3D points. Let A and B denote ∂ˆij ij ij ∂aj x x x and ∂ˆij , respectively. ∂ˆij = 0, ∀j 6= k and ∂ˆij = 0, ∀i 6= k. Employing ∂bj ∂ak ∂bk

28 29 6.2. BLOCK CHOLESKY DECOMPOSITION FOR BA this notation, the Jacobian can be written as

A11 0 0B11 0 0 0  0 A12 0 B12 0 0 0  0 0A B 0 0 0  13 13  A 0 0 0B 0 0   21 21     0 A22 0 0B22 0 0    ∂X  0 0A23 0 B23 0 0  J = =   . (6.1) ∂P A31 0 0 0 0B31 0     0 A32 0 0 0B32 0     0 0A33 0 0 B33 0    A41 0 0 0 0 0B41    0 A42 0 0 0 0B42    0 0A43 0 0 0B43   Then, the approximation of Hessian (matrix N from Equation (4.3)) have a form

U1 0 0W11 W21 W31 W41 δa1 ǫa1  0 U2 0 W12 W22 W31 W41 δa2  ǫa2  0 0 U3 W13 W23 W33 W43 δa3 ǫa3  ⊤ ⊤ ⊤      W W W V 0 0 0  δb1  = ǫb1  . (6.2)  11 12 13 1      W⊤ W⊤ W⊤ 0 V 0 0  δ  ǫ   21 22 23 2   b2   b2  W⊤ W⊤ W⊤ 0 0 V 0  δ  ǫ   31 32 33 3   b3   b3  W⊤ W⊤ W⊤ 0 0 0V  δ  ǫ   41 42 43 4   b4   b4  

Denoting the upper left, lower right, and upper right parts of the matrix in Equation (6.2), respectively, with U, V and W, allows to rewrite augmented normal equations (4.3) compactly to ∗ U W δa ǫa ⊤ ∗ = , (6.3) W V δb ǫb where * designates the augmentation of the diagonal elements of U and V. Now, let’s compare the structure of Hessian in Equation (6.2) with a Hessian of a bigger BA problem (figure 6.1). The upper left part (U) corresponds to the approximation of second derivations of camera parameters, lower right (V) to the approximation of second derivations of 3D points and upper right part (W) to the derivation

6.2 Block Cholesky Decomposition for BA

Lourakis and Argyros [14] suggest to solve augmented normal equations (6.3) arising in BA in two steps (firstly for δa and then for δb) as follows. Left CHAPTER 6. ANALYSIS OF THE PROBLEM 30

(a) Original input matrix (b) Rotated of 180 degrees with marked parts (see also figure 7.1 for comparison)

Figure 6.1: An example of a modestly sized Hessian in BA. This is the sparsity pattern of a 992 × 992 normal equations (i.e. approximate Hessian). Black regions correspond to nonzero elements [14, p. 27] multiplication of Equation (6.3) by the block matrix

I −WV∗−1 (6.4) 0 I  results in

∗ ∗−1 ⊤ ∗−1 U − WV W 0 δa ǫa − WV ǫb ⊤ ∗ =  W V δb  ǫb 

Since the top right block of the above left hand matrix is zero, therefore δa can be determined from its top half, which is

∗ ∗−1 ⊤ ∗−1 (U − WV W ) δa = ǫa − WV ǫb (6.5)

Matrix S ≡ U∗−WV∗−1W⊤ is the Schur complement of V∗ in the left-hand side matrix of (6.3) and is also positive definite [19]. Linear system (6.5) is solved for δa using Cholesky decomposition of S. δb is computed by solving

∗ ⊤ V δb = ǫb − W δa.

This approach has a big advantage — an absence of fill-ins during the compu- tation. The approach explained in the next Example is slightly different [21, p. 102]. 31 6.2. BLOCK CHOLESKY DECOMPOSITION FOR BA

Example 6. Let A ∈ Rn×n be a symmetric positive definite matrix that can be divided into 4 submatrices A11, A12, A21 and A22. Then, according the Theorem 2, Cholesky decomposition A = LL⊤ exists where L is lower triangular matrix with strictly positive diagonal entries. If matrix A consists of 4 submatrices, the equation A = LL⊤ can be rewritten to ⊤ ⊤ ⊤ A11 A21 L11 0 L11 L21 A = = ⊤ . A21 A22 L21 L22 0 L22

The aim of the block Cholesky decomposition is to compute values in L11, ⊤ ⊤ ⊤ L21, L22 submatrices or L11, L21, L22 respectively. The whole process can be divided into 4 steps: ⊤ 1. A11 = L11L11 (Cholesky decomposition) ⊤ −1 ⊤ ⊤ ⊤ 2. L21 = L11 A21 from A21 = L11L21 −⊤ ⊤ L21 = A21L11 from A21 = L21L11 ⊤ ⊤ 3. A22 − L21L21 = L22L22 (Cholesky decomposition) During the decomposition process, first two steps can be done simultaneously. S The last step is updating the A22 submatrix with matrix A22 that is called Schur complement of A11 in matrix A and can be expressed as S −1 ⊤ A22 = A22 − A21A11 A21 = ⊤ ⊤ −1 ⊤ = A22 − L21L (L11L ) L11L = 11 11 21 (6.6) ⊤ −⊤ −1 ⊤ = A22 − L21(L11L11 )(L11 L11)L21 = ⊤ = A22 − L21L21. 

Example 7. This method allows parallel computation when diagonal blocks are independent, for example linear system (6.7). Blocks A11 and A22 have ⊤ not any mutual dependent elements (A12 and A12 are zero matrices).

A11 0 A13 x1 b1  0 A22 A23 x2 = b2 (6.7) A⊤ A⊤ A x b  13 23 33  3  3 ⊤ ⊤ After the first step, blocks A11, A13, A13, A22, A23, A23 and parts of right- hand side b1 and b2 are updated parallely and the system has the form as follows:

⊤ −1 ⊤ ⊤ −1 L11 0 L11 A13 x1 L11 0 L13 x1 L11 b1 ⊤ −1 ⊤ ⊤ −1  0 L22 L22 A23 x2 =  0 L22 L23  x2 = L22 b2 0 0 A x 0 0 A x b  33   3  33  3  3  CHAPTER 6. ANALYSIS OF THE PROBLEM 32

S The next step is to update block A33 with the Schur complement A33 of A11 0 matrix 0 A22 in matrix A, that is according to (6.6) h i L A − L L 13 33 13 23 L    23 S and to update vector b3 with b3 , that equals to

−1 L11 b1 b − L L − . 3 13 23 L 1b    22 2 Next, the linear system

S S A33x3 = b3 is using the Gaussian elimination transformed to

S⊤ −S S L33 x3 = L33 b3 and solved for x3 using back substitution. Finally, remaining parts of x vector (x1 and x2) in the transformed system

⊤ ⊤ −1 L11 0 L13 x1 L11 b1 ⊤ ⊤ −1  0 L22 L23  x2 = L22 b2  0 0 LS⊤ x L−SbS  33   3  33 3  are computed now using only back substitution.  33 6.2. BLOCK CHOLESKY DECOMPOSITION FOR BA Chapter 7

Implementation

This chapter describes chosen framework and implementation details such as used structures, functions and data types of a practical output of the thesis – linear direct solver (LDS).

7.1 Used Framework

The whole application was developed on a Linux environment (Xubuntu 12.04 for 64-bit PC and Debian 6.0 for 32-bit PC). The host code (for the CPU side) was written in ANSI C, the device code (for the GPU side) in CUDA (CUDA Driver 4.0). All object files was linked together into an exe- cutable file (ldsexam) using NVCC compiler, no static or dynamic libraries was created (see my makefile).

7.2 Compressed Row Storage Format

Many formats for sparse matrices exists. One of the most general is the compressed row storage (CRS) format. It makes no assumptions about the sparsity pattern and stores only indices and nonzero elements. On the other hand, it is not very efficient because it needs an indirect addressing step for every scalar operation in a matrix-vector product. I have decided on this format in my CPU-side solver for its effective utilization in the Cholesky decomposition. A CRS format needs three vectors: nozval of floating-point numbers, rowbeg and colind of integers. The nozval vector stores the values of the nonzero elements of the matrix, as they are traversed in a row-wise fashion. The colind vector stores the column indexes of the elements in the nozval vector.

34 35 7.3. CHOLESKY DECOMPOSITION ON GPU

That is, if nozval(k) = aij then colind(k) = j. The rowptr vector stores the locations in the nozval vector that start a row, that is, if nozval(k)= aij then rowptr(i) ≤ k < rowptr(i+1). By convention, rowptr(n+1) = nnz+1, where nnz is the number of all nonzeros. Example 8. Consider a sparse symmetric matrix in the figure 7.1

012345 0 7 1 1 8 1 2 2 1 8 3 2 3 9 3 2 4 2 3 3 9 3 5 1 2 2 3 9

Figure 7.1: Sample of a symmetric positive definite sparse matrix 6 × 6 with 22 nonzero elements

CRS has the following attributes: n = 6, nnz = 22, rowptr 0 1 2 3 4 5 6 0 2 5 9 12 17 22 colind 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 0 5 1 2 4 1 2 4 5 3 4 5 1 2 3 4 5 0 2 3 4 5 nozval 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 7 7 8 1 2 1 9 3 2 9 3 2 2 3 3 9 3 1 2 2 3 9 

7.3 Cholesky decomposition on GPU

The implementation of a sparse Cholesky decomposition (functions CRS_chol and CRS_chol_subs) was quite straightforward. Before these functions are called, a symbolical factorization must be performed which determines the indices of fill-ins and allocate space for them. For purpose of Cholesky de- composition, only lower or upper triangular matrix is sufficient to have. This fact was exploited by skipping all elements from the beginning of each row to the main diagonal. This is done by CRS_shifted_rows. Another difference in decomposition of sparse matrices rests in the necessity of altering the be- ginning of each row during the factorization. Regarding that I have worked with temporary arrays rowbeg and rowend. CHAPTER 7. IMPLEMENTATION 36

7.4 Ordering for CPU solver

In my solver, I have utilized approximate minimum degree (AMD) ordering by Tim Davis that can be found also in MATLAB’s amd function. It minimizes number of fill-ins very effectively and fast (see figure 3.2). For BA problem, even faster ordering (but with more fill-ins) can be used and that is a simple rotation of 180 degrees.

7.5 Block Matrix Format for GPU

There are 3 different parts in the matrix: full diagonal blocks, the sparse border and the almost dense tail (light, middle and dark gray in the fig- ure 7.1). Analyzing the properties of this parts and CUDA architecture I have suggested to use the following matrix data structure (MXBF). Blocks: As there are many (from thousands to millions) full but small di- agonal blocks (Vi), they can be stored in one array (data) in row-wise manner. In BA, the blocks have the same size, but when using METIS k-way ordering, blocks have not the same size. Because of that, for each block a information about its size must be stored (blksz). When iterating over the blocks, it is efficient to have an index saying where the data start for i-th block (blkp). Only upper part of the blocks is stored but memory is allocated for the full block to get rid of strange indexing. Border: This part has the majority of nonzero elements. Therefore, is must be stored as a sparse matrix. I have chosen the CRS format. In fact that the input matrix is symmetric, only one border side is sufficient to have. Tail: After computing the Schur complement, this part will be almost dense. Consequently it is stored as a full matrix. Only upper triangle is stored, but memory allocated for a full matrix, as in the case with blocks. The data for this part are stored also in data array and tail points to location where the data for tail start. The MXBF structure of matrix from Example 7.1 have these attributes: n = 6, tail = 5 (where data for tail in the array data starts), tailsz = 3, ndata = 14 (number of elements in blocks and tail), brd_nnz = 13 (number of nonzeros in the border), blksz blkp data 0 1 0 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 0 1 7 8 1 0 8 9 3 2 0 9 3 0 0 9 37 7.6. BLOCK CHOLESKY DECOMPOSITION ON GPU brd_rowptr brd_colind brd_nozval 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 4 2 1 1 2 1 2 3 2

7.6 Block Cholesky decomposition on GPU

Consider the block matrix

A11 0 A13 x1 b1  0 A22 A23 x2 = b2 , A⊤ A⊤ A x b  13 23 33  3  3 where A11 and A22 are called ’blocks’, A13 and A23 ’borders’ and A33 is ’tail’. The block Cholesky decomposition consists of four main parts: ⊤ ⊤ 1. Eliminating blocks (A11 ← L11 and A22 ← L22), updating corre- −1 −1 sponding borders (A13 ← L11 A13 and A23 ← L22 A23), and updating −1 corresponding parts of right-hand side of linear system (b1 ← L11 b1 −1 and b2 ← L22 b2). All previous steps are done simultaneously (within elimination loops). Each thread eliminates one block (in the test ma- trix it is the size of 3 × 3) and update it’s own part of border and b vector. As the border part is sparse and can have arbitrary number of nonzero elements, I store and access this data in a global memory. 2. Computing the Schur complement

−1 ⊤ −1 S L11 A13 L11 A13 A33 ← A33 = A33 − −1 −1 L22 A23 L22 A23

−1 L11 A13 There was a problem that updated border part −1 was stored in h L22 A23 i row-wise manner and the transposed matrix was not available. There- fore, using dot-product when matrix-matrix multiplying was not pos- sible. I had to loop through the rows of the matrix and update the elements of A33 matrix at every multiplication. But this could be possible only when using atomic operations (atomicAdd). Even more, this could be used only for single-precision floats in compute capability > 2.0. I am aware of such restriction of proposed approach. S S⊤ 3. Eliminating of the tail (A33 ← L33 ). This part has surely the biggest potential to exploit the full potential of a GPU. Unfortunately, this was postponed due to lack of time. I have planned to call a function from MAGMA library that is able to solve dense linear system. In my solver, this part is performed on CPU-side. CHAPTER 7. IMPLEMENTATION 38

S⊤ 4. Back substitution. Performed on CPU-side, firstly for dense part L33 and then for sparse borders and full blocks.

7.7 Ordering for GPU solver

A requirement of my GPU solver is that the input matrix can be partitioned in a such structure that appears in a approximation of Hessians in BA prob- lem (see the matrix in Equation 6.2). This can be achieved applying nested dissection ordering recursively. METIS K-way ordering was used for parti- tioning the input matrix into independent block structure for GPU solver. Figures 7.2 and 7.3 illustrates structure of matrices from MATLAB gallery reordered by k-way ordering. As BA has this structure implicitly and the size and number on independent block are known from BA configuration, it needs only rotation of 180 degrees to get structure like in figure 6.1b.

0 0

50 50

100 100

150 150

200 200

250 250

300 300

0 50 100 150 200 250 300 0 50 100 150 200 250 300 nz = 4861 nz = 4861 (a) Original matrix (b) Reordered into 5 independent blocks

Figure 7.2: Performing k-way ordering on diagonal-based matrix ’Wathen 10 × 10’ 39 7.7. ORDERING FOR GPU SOLVER

0 0

100 100

200 200

300 300

400 400

500 500

600 600

700 700

800 800

900 900 0 200 400 600 800 0 200 400 600 800 nz = 4380 nz = 4380 (a) Original matrix (b) Reordered using k-way ordering into 10 independent blocks

Figure 7.3: Performing k-way ordering on diagonal-based matrix ’Poisson 30’ Chapter 8

Testing

Testing was performed on the followed configuration: Intel i7-2600 CPU @ 3.40GHz, 4GB RAM, GeeForce GT570, Debian 6.0 for 32-bit PC, CUDA Driver 4.0. Applications were compiled using GCC (version 4.3.5) and NVCC with -use-fast-math and -O3 optimization mode.

To check the accuracy of my solvers I have used Octave to get the reference x vector. The solution from Octave and my solver were printed into the file (x_octave.vec and x_result.vec) and the differences were compared with another Octave function (vec_ck).

The main testing input matrix was the approximation of Hessian from BA problem optimizing 3 parameters of 11049 3D points and 7 parameters of 22 cameras. The matrix is of size 33, 301 × 33, 301 and have 1, 817, 521 nonzero elements saved in ’Matrix Market coordinate’ format (data/jTj_mue.mtx).

8.1 Octave solvers

In Octave, I have tested the direct solver (left division operator ’\’), the Pre- conditioned Conjugate Gradient solver (pcg) and Preconditioned Conjugate Residuals (pcr). Iterative solvers was set to terminate after reaching 200 iterations or a residual norm less than 1−6. Table 8.1 shows the results. Pre- conditioned Conjugate Residuals solver have terminated after 45 iterations, but the result was wrong.

40 41 8.2. CPU SOLVER

Method Time Res. norm Iterations Left division operator 695 ms 1.283−13 – Conjugate gradient 1440 ms 4.128−5 75 Conjugate residuals 1386 ms NaN 45

Figure 8.1: Test of Octave solvers

8.2 CPU solver

After execution the CPU solver from lds directory with the command ./bin/ldscpuexam data/jtj_mueI.mtx data/g.mtx the following information are printed: load matrix: 1070 ms load vector: 10 ms symamd ord.: 80 ms mat. reorder: 390 ms symbolic: 500 ms CRS_symbolic: 1834461 nnz CPU CRS chol: 50 ms all: 2120 ms

The new number of nonzeros haven’t increased much (from 1, 817, 521 to 1, 8344, 61) which means that there are very few fill-ins (less then 1%). It can be seen, that my implemented functions for reordering of the matrix and for symbolic factorization are not very efficient. The reason can be that reordering is performed by transforming CRS format into triplet (or COO) format, reordered, sorted, and then transformed back which needs a lot of data moves. Although finding the ordering takes more time then solving the whole linear system, without it (try to comment it in ldscpuexam.c) the computation will takes more than several minutes. Execution of all functions required by finding the solution takes 1 second.

Command octave -q --eval="vec_ck( ’x_octave.vec’, ’x_result.vec’ );" outputs the residual norm of the difference with the reference octave solution and find where is the biggest difference max err: 0.0000000228 at 138th element res nrm: 0.0000000000 CHAPTER 8. TESTING 42

8.3 GPU solver

For checking the corectness of GPU solver, I have implemented the GPU solver on CPU side (to use this, a constant BLOCK_CHOLESKY_CPU must be uncomented and to compile with make). Then, ldsgpuexam in performed on CPU. Calling ./bin/ldsgpuexam data/jtj_mueI.mtx data/g.mtx gives this results: load matrix: 1060 ms load vector: 20 ms kway ord.: 20 ms mat. reorder: 370 ms symbolic: 500 ms CRS_symbolic: 1834083 nnz MXBF_from_crs: 11049 blocks 858522 border nnz 123157 block and tail data block matrix: 10 ms elim. blks: 10 ms tail update: 30 ms elim tail: 0 ms back subs: 0 ms CPU block chol: 40 ms all: 2020 ms

This solver that exploits the special structure in BA runs faster than general CPU solver (40 vs. 50 ms). When checking the residual norm: max err: 0.0000221960 at 59th element. res nrm: 0.0000000010

Output of the real GPU solver: load matrix: 1070 ms load vector: 10 ms kway ord.: 20 ms mat. reorder: 380 ms symbolic: 500 ms CRS_symbolic: 1834083 nnz MXBF_from_crs: 11049 blocks 858522 border nnz 123157 block and tail data block matrix: 10 ms 43 8.4. CUSP SOLVERS elim on GPU: elim without copy: 15.1688 ms elim with copy: 20.0004 ms elim blocks + tail update: 420 ms elim tail: 0 ms back subs: 0 ms GPU block chol: 430 ms all: 2430 ms with residual norm: max err: 0.0000072417 at 103th element. res nrm: 0.0000000003 The GPU solver must be run on single-precision floats because of atomicAdd operations. elim without copy is the time that is needed for elimination of blocks and tail update (computing the Schur complement).

8.4 CUSP solvers

CUSP is a C++ template library that implements parallel algorithms for sparse matrix and graph computations. It provides a variety of iterative solvers such as Conjugate-Gradient (CG), Biconjugate Gradient (BiCG), Bi- conjugate Gradient Stabilized (BiCGstab), Generalized Minimum Residual (GMRES), Multi-mass Conjugate-Gradient (CG-M) and Multi-mass Bicon- jugate Gradient stabilized (BiCGstab-M). Two of them I have tested with maximum number of iterations set to 200 and relative error 1−6. Table 8.2 shows the results.

Method Time Max. error Iterations CG 50 ms 3.8−8 77 BiCGstab 90 ms 2.3−8 76

Figure 8.2: Test of iterative CUSP solvers. Max. error is the maximal difference with Octave’s reference solution Chapter 9

Conclusion

The aim of this thesis was to deal with linear direct solvers and then imple- ment a linear direct GPU solver for BA problem. Of course, the implementa- tion of a GPU solver was preceded by studying the mathematical background of linear direct solvers. Firstly, the CPU solver must be implemented. An- other important concepts that concerns about direct sparse solvers must be acquired like a symbolic factorization, working with CRS matrix format, and applying ordering techniques. I can say that my implemented CPU solver is fast and reliable when solving positive definite linear systems. This have been done in the first half of the academic year. The next half year, I have started experimenting with the METIS k-way ordering and how to utilize it for solving general sparse systems in parallel. Although this approach is fully usable, it has drawbacks such as a slow computation of the ordering, relatively big tail part, and independent blocks of different sizes. Simultaneously I was analyzing the BA problem and structure of its linear sparse systems in Levenberg-Marquardt algorithm. As the structure of the BA and returned k-way ordering was the same, I tried to write the solver which could be general (the needed information about block matrix gives k- way ordering) and specific at the same time (in this case information about the block matrix provides BA configuration). The general solver on GPU is not finished (special symbolic factorization is missing). The GPU solver special for BA was implemented, but provides very small speedups in comparison with CPU solver. The reason is that only global memory on GPU was used for all computations. In testing phase I have found out that iterative solver have a great potential to solve linear systems very fast. The advantage of iterative solvers is the configurable accuracy which can be sufficient for iterative nonlinear solvers.

44 45

Even when using with a preconditioner, the solution should be found very fast. On the other side, when using direct solvers, symbolical factorization is solved only once in LM algorithm. Direct solvers give in general more accurate results. From my experiments, I suggest to use direct solver on CPU with a dense GPU solver for factorizing the Schur complement. I realize that the precise study of SBA (Sparse Bundle Adjustment) package is missing such as testing of a practical utilization of GPU solvers in this package. Bibliography

[1] Unknown author. NVIDIA GeForce GTX 680 s čipem GK104: Herní Kepler detailně. CD-R server s.r.o, URL http://diit.cz/clanek/ unifikovane-jadro-a-rizeni-cipu, 2012. Cited in page 25.

[2] O. Coles. Nvidia GF100 GPU fermi graphics architecture. Benchmark Reviews, URL http://benchmarkreviews.com, 2010. Cited in page 24.

[3] T. Davis. Sparse matrix. From MathWorld — A Wolfram Web Re- source URL http://mathworld.wolfram.com/SparseMatrix.html, 2012. Retrieved April 2012. Cited in page 10.

[4] J. Dennis. Nonlinear last squares. State of the Art in Numerical Anal- ysis, pages 269–312, 1977. Cited in page 16.

[5] J. Dennis and R. Schnabel. Numerical Methods for Unconstrained Op- timization and Nonlinear Equations. SIAM Publications, 1996. Cited in page 20.

[6] R. Farber. CUDA Application Design and Development. Morgan Kauf- mann, Waltham, MA, 2011. 2 citations in pages 22 and 25.

[7] A. George and J. W. H. Liu. An automatic nested dissection algorithm for irregular finite element problems. SIAM Journal on Numerical Anal- ysis, 15(5):105–1069, 1978. Cited in page 12.

[8] A. George and J. W. H. Liu. A fast implementation of the minimum degree algorithm using quotient graphs. ACM Transactions on Mathe- matical Software, 6:33–358, 1980. Cited in page 12.

[9] J. R. Gilbert, C. Moler, and R. Schreiber. Sparse matrices in : Design and implementation. SIAM Journal on Matrix Analysis and Application, pages 333–356, 1992. Cited in page 10.

[10] G. Golub and C. van Loan. Matrix Computations. Jonhn Hopkins University Press, Baltimore, MD, 3rd edition, 1996. Cited in page 20.

[11] K. Habgood and I. Arel. A condensation-based application of cramer’s

46 47 BIBLIOGRAPHY

rule for solving large-scale linear systems. Journal of Discrete Algo- rithms, 10:98–109, 2012. Cited in page 5. [12] K. Hiebert. An eveluation of mathematical software that solves nonlin- ear least squares problems. ACM Transactions on Mathematical Soft- ware, 7(1):1–16, 1981. Cited in page 16.

[13] J.-Y. L’Excellent and B. Ucar. Elimination tree. URL http://graal. ens-lyon.fr/~bucar/CR07, 2010. Cited in page 13. [14] M. I. A. Lourakis and A. A. Argyros. Sba: A software package for generic sparse bundle adjustment. ACM Transactions on Mathematical Software, 36(1), 2007. 4 citations in pages 16, 28, 29, and 30. [15] H. M. Markowitz. The elimination form of the inverse and its application to linear programming. Management Science, 3:255–269, 1957. Cited in page 12. [16] J. Nocedal and S. Wright. Numerical Optimization. Springer, New York, NY, 1999. Cited in page 20.

[17] F. Ntawiniga. Bundle adjustment technique. URL http://archimede. bibl.ulaval.ca/archimede/fichiers/25229/ch06.html, 2008. Retrieved April 2012. Cited in page 17. [18] NVIDIA. OpenCL Programming for the CUDA Architecture, 2009. Cited in page 23. [19] V. Prasolov. Problems and Theorems in Linear Algebra. American Mathematical Society, Providence, RI, 1994. Cited in page 30. [20] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes: The Art of Scientific Computing. Cambridge Uni- versity Press, 3rd edition, 2007. 2 citations in pages 4 and 11. [21] A. Quarteroni, R. Sacco, and F. Saleri. Numerical Mathematics. Springer, 2000. 3 citations in pages 4, 17, and 30. [22] W. F. Tinney and J. W. Walker. Direct solutions of sparse network equations by optimally ordered triangular factorization. In Proceedings of the IEEE, volume 55, pages 1801–1809, 1967. Cited in page 12. [23] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon. Bundle ad- justment – a modern synthesis. Proceedings of the International Work- shop on Vision Algorithms: Theory and Practice, pages 298–372, 1999. Cited in page 16. [24] V. Volkov. Better performance at lower occupancy. GPU Technology Conference 2010 (GTC 2010), 2010. URL http://www.cs.berkeley.edu/ ~volkov. Cited in page 24. BIBLIOGRAPHY 48

[25] R. Vuduc. Analysis and tuning case study. Teragrid Conference, URL http://hpcgarage.org/tg10--gpu-tutorial, 2010. Retrieved May 2012. Cited in page 23. [26] M. Yannakakis. Computing the minimum fill-in is NP-complete. SIAM Journal of Algebraic Discrete Methods, 2:77–79, 1981. Cited in page 10. 49 BIBLIOGRAPHY Appendix A

List of Abbrevations

3D Three-Dimensional BA Bundle Adjustment CPU Central Processing Unit CUDA Compute Unified Device Architecture CRS Compressed Row Storage GPGPU General-Purpose Computing on Graphics Processing Unit GPU Graphics Processing Unit LDS Linear Direct Solver (output of this thesis) LM Levenberg-Marquardt (algorithm) LRU Last Recently Used SBA Sparse Bundle Adjustment SIMD Single Instruction Multiple Data SIMT Single Instruction Multiple Threads SM Streaming Multiprocessor

50 51 Appendix B

User Manual

B.1 Requirements

All code was written in ANSI C and CUDA and tested on 64-bit Linux (Xubuntu distribution), GCC 4.4.6. For successful compilation, the package libscotchmetis-dev is required. Some install paths in makefile for CUDA and METIS include file must be set properly. After compilation, execution file ldscpuexam and ldsgpuexam will be created into bin directory.

B.2 Usage ldscpuexam ldsgpuexam bin/ldscpuexam data/jtj_mueI.mtx data/g.mtx or bin/ldsgpuexam data/jtj_mueI.mtx data/g.mtx for tested matrix. A.mtx is symmetric positive definite matrix of size n × n stored in matrix market format, b.vec is the right side n × 1 vector of the equation system stored in matrix market format. Some time information will be printed on the stdout and the solution will be stored in file named x_result.vec. For testing the correctness of the solution, octave function vec_ck can be used callable from the command line: octave -q –eval ’vec_ck( "x_result.vec", "x_octave.vec" );’

52 53 B.2. USAGE Appendix C

Contetns of the Attached CD

. +-- lds | +-- bin | +-- data | | +-- g.mtx | | +-- jtj_mueI.mtx | | +-- test_thesis.mtx | +-- makefile | +-- obj | +-- octave | | +-- matrix_load.m | | +-- octave_solver.m | | +-- spy_print.m | +-- README.txt | +-- src | | +-- colamd.c | | +-- colamd_global.c | | +-- colamd.h | | +-- crs.c | | +-- crs.h | | +-- etree.c | | +-- etree.h | | +-- ldscpuexam.c | | +-- ldsgpuexam.c | | +-- mxbf.cu | | +-- mxbf.h | | +-- mxbf_chol.cu | | +-- ord.c | | +-- ord.h | | +-- UFconfig.c | | +-- UFconfig.h | | +-- uni.c | | +-- uni.h | | +-- vec.c | | +-- vec.h | +-- vec_ck.m | +-- x_octave.vec +-- text +-- Ivancik_thesis_2012.pdf

7 directories, 31 files

54