HIGH PERFORMANCE COMPUTING WITH SPARSE MATRICES AND GPU ACCELERATORS

By

SENCER NURI YERALAN

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA 2014 ⃝c 2014 Sencer Nuri Yeralan

2 For Sencer, Helen, Seyhun

3 ACKNOWLEDGMENTS I thank my research advisor and committee chair, Dr. Timothy Alden Davis. His advice, support, academic and spiritual guidance, countless one-on-one meetings, and coffee have shaped me into the researcher that I am today. He has also taught me how to effectively communicate and work well with others. Perhaps most importantly, he has taught me how to be uncompromising in matters of morality. It is truly an honor to have been his student, and I encourage any of his future students to take every lesson to heart - even when they involve poetry.

I thank my supervisory committee member, Dr. Sanjay Ranka, for pressing me to strive for excellence in all of my endeavors, both academic and personal. He taught me how to conduct myself professionally and properly interface and understand the dynamics of university administration.

I thank my supervisory committee member, Dr. Alin Dobra, for centering me and teaching me that often times the simple solutions are also, surprisingly, the most efficient.

I also thank my supervisory committee members, Dr. My Thai and Dr. William

Hager for their feedback and support of my research. They challenged me to look deeper into the problems to expose their underlying structures.

I thank Dr. Meera Sitharam, Dr. Beverly Sanders, Dr. Richard Newman, Dr. Jih-Kwon Peir, Dr. Alireza Entezari, and Dr. Jorg¨ Peters for challenging me in the most interesting courses I have ever had the pleasure of taking. Their guidance and advice throughout the years have been invaluable.

I thank Dean Cammy Abernathy for teaching me that life is not just about research.

She taught me that we all have to sometimes stop ourselves from being distracted by minutia and focus on matters that truly have an impact on each others’ lives. She was the primary inspiration for me to become active in campus politics. As a result I learned

4 how to cultivate and foster lasting coalitions and raise awareness for Computer Science in the Gator Nation.

I thank my mother and father for always being there for me and lavishing me with love, advice, and support. My parents have always been proud of me, but this is my opportunity to mention how proud I am of them. Without my parents, completing graduate school would not have been possible.

I thank Sean and Jeanne Conner for providing me a home away from home in

Tampa, Florida without judgements, reservations, pocket aces, or a 28-point crib.

I thank my friends Dr. Joonhao Chuah, John Corring, Diego Rivera-Gutierrez,

Michael Borish, Paul Accisano, Braden Dunn, Peter Dobbins, Kristina Sapp, Joan Crisman, Dr. Jose Roberto Soto, Graham Picklesimer, Rob Short, Kelsey Antle, Patrick

Battoe, Nick Kesler, Lance Boverhof, Anthony LaRocca, Kevin Babcock, Luke McLeod,

Cale Flage, and Evan James Ferl J.D. for challenging, supporting, and humoring me throughout the years. These are the best friends anyone could ever ask for, and I’m truly lucky to have you all in my life.

5 TABLE OF CONTENTS page

ACKNOWLEDGMENTS ...... 4

LIST OF TABLES ...... 8

LIST OF FIGURES ...... 9 ABSTRACT ...... 10

CHAPTER

1 INTRODUCTION ...... 11

2 REVIEW OF LITERATURE ...... 15 2.1 Fill-Reducing Orderings ...... 15 2.1.1 Minimum Degree ...... 15 2.1.2 Nested Dissection ...... 17 2.1.3 Graph Partitioning ...... 19 2.2 Sparse QR Factorization ...... 23 2.2.1 Early Work For QR Factorization ...... 23 2.2.2 Multifrontal QR Factorization ...... 23 2.2.3 Parallelizing Multifrontal QR Factorization ...... 23 2.3 GPU Computing ...... 25 2.4 Communication-Avoiding QR Factorization ...... 25 2.5 Sparse Multifrontal Methods on GPU Accelerators ...... 26

3 HYBRID COMBINATORIAL-CONTINUOUS GRAPH PARTITIONER ...... 28

3.1 Problem Description ...... 28 3.1.1 Definition ...... 28 3.1.2 Applications ...... 28 3.1.3 Outline of this chapter ...... 29 3.2 Multilevel Edge Separators ...... 29 3.3 Related Work ...... 31 3.4 Hybrid Combinatorial-Quadratic Programming Approach ...... 32 3.4.1 Preprocessing ...... 32 3.4.2 Coarsening ...... 33 3.4.3 Initial Partitioning ...... 35 3.4.4 Refinement ...... 36 3.5 Results ...... 38 3.5.1 Hybrid Performance ...... 38 3.5.2 Power Law Graphs ...... 40 3.6 Future Work ...... 41 3.7 Summary ...... 41

6 4 SPARSE MULTIFRONTAL QR FACTORIZATION ON GPU ...... 43 4.1 Problem Description ...... 43 4.1.1 Main Contributions ...... 43 4.1.2 Chapter Outline ...... 44 4.2 Preliminaries ...... 45 4.2.1 Multifrontal Sparse QR Factorization ...... 45 4.2.1.1 Ordering phase ...... 45 4.2.1.2 Analysis phase ...... 46 4.2.1.3 Factorization phase ...... 49 4.2.1.4 Solve phase ...... 50 4.2.2 GPU architecture ...... 50 4.3 Related Work ...... 54 4.4 Parallel ...... 55 4.4.1 Dense QR Scheduler (the Bucket Scheduler) ...... 57 4.4.2 Computational Kernels on the GPU ...... 63 4.4.2.1 Factorize kernel ...... 64 4.4.2.2 Apply kernel ...... 68 4.4.2.3 Apply/Factorize kernel ...... 73 4.4.3 Sparse QR Scheduler ...... 73 4.4.4 Staging for Large Trees ...... 77 4.5 Experimental Results ...... 80 4.6 Future Work ...... 84 4.7 Summary ...... 86 5 CONCLUSION ...... 87

REFERENCES ...... 89

BIOGRAPHICAL SKETCH ...... 96

7 LIST OF TABLES Table page

3-1 Selected problems from the University of Florida Collection [19] 40

3-2 Performance results for the 3 matrices listed in Table 4-3 ...... 40

4-1 Performance of 1x16 “short and fat” dense rectangular problems...... 80 4-2 Performance of 16x1 “tall and skinny” dense rectangular problems...... 81

4-3 Five selected matrices from the UF Sparse Matrix Collection [19]...... 81

4-4 Performance results for the 5 matrices listed in Table 4-3...... 82

8 LIST OF FIGURES Figure page

3-1 Multilevel graph partitioning ...... 30

3-2 Graph coarsening using heavy edge matching ...... 33

3-3 Graph coarsening using heavy edge and Brotherly matching ...... 34 3-4 Graph coarsening using heavy edge, Brotherly, and adoption matching .... 34

3-5 Graph coarsening using heavy edge, Brotherly, and community matching ... 35

3-6 Comparison of cut quality and timing between combinatorial, continuous, and our hybrid methodology on 1550 problems from the UF Collection [19] ..... 39

3-7 Comparison of cut quality and timing between METIS 5 and our hybrid methodology on 1550 problems from the UF Collection [19] ...... 39

3-8 Comparison of cut quality and timing between METIS 5 and our hybrid methodology on 25 power law problems from the UF Collection [19] ...... 41

4-1 A sparse matrix A, its factor R, and its column elimination tree ...... 48

4-2 Assembly and factorization of a leaf frontal matrix ...... 49

4-3 Assembly and factorization of a frontal matrix with three children ...... 50 4-4 High Level GPU Architecture [74] ...... 53

4-5 A 256-by-160 dense matrix blocked into 32-by-32 tiles and the corresponding state of the bucket scheduler...... 58 4-6 Factorization of a 256-by-160 matrix in 12 kernel launches ...... 60

4-7 Pipelined factorization of the 256-by-160 matrix in 7 kernel launches ...... 62

4-8 Assembly tree for a sparse matrix with 68 fronts ...... 75

4-9 Stage 1 of an assembly tree for a sparse matrix with 68 fronts ...... 78

4-10 Stage 2 of an assembly tree for a sparse matrix with 68 fronts ...... 79 4-11 Force-directed renderings of 5 selected problems from the UF Collection ... 82

4-12 GPU-accelerated speedup over the CPU-only algorithm versus arithmetic intensity on a logarithmic scale...... 83

9 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy HIGH PERFORMANCE COMPUTING WITH SPARSE MATRICES AND GPU ACCELERATORS

By

Sencer Nuri Yeralan December 2014

Chair: Timothy Alden Davis Major: Computer Engineering

Sparse matrix factorization relies on high quality fill-reducing orderings in order to limit the amount of flops and memory required to compute and store the resulting sparse factor. In turn, fill-reducing ordering , such as nested dissection, require efficient graph partitioning techniques. We present a novel multilevel hybrid combinatorial-continuous graph partitioner that combines traditional multilevel combinatorial techniques with a quadratic programming formulation. The alternation of the two methods results in a cut quality that is superior to either by themselves, and the cut quality is superior to contemporary graph partitioners for power-law graphs that arise in social networking modalities.

Sparse matrix factorization also involves a mix of regular and irregular computation, which is a particular challenge when trying to obtain high-performance on the highly parallel general-purpose computing cores available on graphics processing units

(GPUs). We present a GPU-accelerated sparse multifrontal QR factorization method which is up to ten times faster than a highly optimized method on a multicore CPU. Our method is unique compared with prior methods, since it factorizes many frontal matrices in parallel, and keeps all the data transmitted between frontal matrices on the GPU. A novel bucket scheduler algorithm extends the communication-avoiding QR factorization for dense matrices, by exploiting more parallelism and by exploiting the staircase form present in the frontal matrices of a sparse multifrontal method.

10 CHAPTER 1 INTRODUCTION

Problems in modern applications grow increasingly complex, considering more and more unknowns representing any number of physical phenomena including predicting position and/or momentum of subatomic particles, simulating position of cells within the bloodstream, and even representing windtunnel air velocity readings taken from sensors positioned along the airfoil of a helicopter designed entirely in MATLAB⃝R . Despite the increasing complexity of such linear systems, one trend remains simple – as problem size increases, the number of correlated relationships between unknowns rarely tends to maintain the O(n2) growth necessary for the problem to remain dense. As the number of variables increases in a linear system, the more sparse that system becomes.

The general strategy for solving linear systems using direct methods involves factoring the system prior to performing a forward or backsolve with one or more right-hand sides. For sparse linear systems, the order in which variables are annihilated plays a significant role in the runtime performance and memory requirements of the solution strategy. Selecting a poor pivot may lead to significant fill-in thereby requiring an increase in the amount of flops and memory required to compute and store the sparse factor. Conversely, selecting an ordering in which this fill-in is minimized pays dividends during the remainder of the factorization process. More importantly is that once such a

fill-reducing ordering is computed, it can be reused - provided the nonzero pattern of the initial problem remains the same.

Early work on fill-reducing orderings established the link between graph theory and sparse linear systems. Minimum degree algorithm produces a fill-reducing ordering by simply selecting the vertex of minimum degree, removing it, and replacing it with a clique of its neighbors. Strategies for tie-breaking and performance and quality optimizations quickly followed, but the true insight was the crucial observation of realizing the potential for graph theory within sparse linear systems.

11 Nested dissection method followed this early work. Nested dissection recursively partition the system into disjoint yet equal-sized parts. If a perfect partitioning is possible, the fill-reducing ordering produces completely independent problems which can be operated on in parallel with no loss of fidelity. However for most problems, such a perfect partitioning is not possible. Instead, the problem is reordered into two disjoint subproblems with common vertices in the vertex separator. The kernel of the nested dissection method is finding high quality vertex separators. Research focusing on graph partitioning, specifically the vertex separator problem and its dual, the edge separator problem, has been studied from a combinatorial standpoint for many years with many heuristics and optimizations having been developed. Recent work establishes continuous quadratic programming formulations for these problems, and blending combinatorial multilevel methods with these new continuous methods is a new research space waiting to be explored. Efficient and high quality graph partitioners are required for good fill-reducing orderings which are required to extract the most parallelism from sparse direct methods, such as QR factorization.

When nested dissection finally recursively produces a fill-reducing ordering, it is possible to factorize each subproblem followed by their common vertex separator in a bottom-up postorder fashion. When the depth of recursion is significant, as it is for large sparse problems, a large degree of natural parallelism is exposed, waiting to be exploited. Sparse direct solvers that exploit parallelism whenever possible tend to outperform their single-threaded counterparts. However, as problem sizes continue to increase, a new breed of parallel algorithms implemented on parallel hardware platforms must be developed.

An increase in the degree of exploitable parallelism requires powerful parallel hardware architectures and algorithms. Graphical processing units (GPUs) provide high-performance computing with many floating-point cores. GPU devices exchange raw computing power for flexibility, and algorithms designed for a superscalar environment

12 are ill-suited for high performance on a GPU. Instead of focusing on control units, branch predictors, and translation look-aside buffers, the GPU transistor budget heavily favors arithmetic logic units (ALUs). The GPU contains many cores each of which execute code in a single program, multiple data (SPMD) fashion. Threads are grouped into logical units, called warps, that operate in a lock-step fashion. The programmer is responsible for managing his or her own memory hierarchy, and poor memory management is often a pitfall for many codes. Nevertheless, GPU-accelerated sparse methods are actively being researched and developed as the problems engineers face today become increasingly large.

Combining combinatorial and continuous formulations of the edge separator problem leads to an improvement in cut quality and balance partition ratio for modern large power law series graphs. Converting these improved solutions to the edge separator problem into a solution to the vertex separator problem can result in higher quality fill-reducing orderings via nested dissection. High quality fill-reducing orderings in turn lead to sparse direct methods that are more efficient in both time and memory.

When considering sparse QR factorization, a multifrontal QR factorization method implemented on GPU accelerators is more efficient than contemporary CPU-based multifrontal QR factorization methods because many dense fronts can be tiled and factorized simultaneously while avoiding costly memory transfers between the host machine and the GPU device.

To explore this thesis, a novel multilevel hybrid combinatorial-continuous graph partitioner was developed. Additionally, a novel GPU-accelerated sparse multifrontal QR factorization algorithm was designed and implemented. We measure the cut quality and runtime performance of the hybrid graph partitioner against the METIS 5.0.2 package for a wide variety of problems from the University of Florida Sparse Matrix Collection. We also measure the runtime performance of the GPU-accelerated sparse multifrontal QR

13 factorization algorithm also on a wide variety of problems from the University of Florida Sparse Matrix Collection.

14 CHAPTER 2 REVIEW OF LITERATURE

Researchers have studied strategies for achieving high performance in sparse

direct solvers for many years. Extracting high performance from sparse direct solvers relies on good fill-reducing matrix permutations that reduce the number of flops and

memory required to factorize the system. In this space, several algorithms have

been developed that exploit the graph-theoretic properties of sparse factorization.

Empowered with good fill-reducing orderings, researchers turn to parallelization to

extract additional performance from sparse direct methods, such as QR factorization. These parallelization efforts include investigating multifrontal methods, methods that

avoid communication, and multicore, manycore, and heterogeneous computing.

2.1 Fill-Reducing Orderings

Research into fill-reducing orderings began as the observation that the order in

which variables are eliminated plays a significant role in the performance of sparse

direct solvers. The minimum fill-in problem is defined as reordering the rows and

columns of a sparse symmetric matrix with the goal of making its triangular factor as sparse as possible. Following Garey and Johnson’s seminal treatment of complexity

and intractability [29], Yannakakis proved that computing the minimum fill-in is in

NP-COMPLETE [89]. Since then, researchers have focused on developing efficient

heuristics to compute fill-reducing orderings. Rose showed that the fill-in problem

can be alternatively defined as the problem of finding the smallest set of edges for an undirected graph such that the addition of these edges makes the graph chordal [82].

Computing fill-reducing orderings from a graph-theoretic foundation became the focus for many.

2.1.1 Minimum Degree

Minimum degree algorithm is an early example of leveraging graph theory to

produce a fill-reducing ordering for use in solving sparse linear systems. The approach

15 traces its roots to Markowitz’s efforts to permute the rows and columns of non-symmetric linear programming problems so as to minimize the number of off-diagonal entries in the pivot row and column [70]. Tinney and Walker noted potential problems with with Markowitz’s approach. They proposed a symmetric version in which rows along with their corresponding columns are permuted at the same time, such that pivot elements are always selected from the diagonal of the system [87].

Rose expanded on this early work to derive a graph theoretic version in which only a symbolic manipulation simulating occurs [81]. Rose named this strategy the minimum degree algorithm. Rose’s minimum degree algorithm takes a graph G, selects a vertex vmin of minimum degree, eliminates vmin from G, and forms a clique of vmin’s adjacent vertices. This operation continues until G is empty. The order in which vertices are deleted becomes the fill-reducing elimination ordering. Degrees of vertices in the system may fluctuate as vertices are deleted and replaced by cliques, and Rose noted that resolving ties is an important consideration in the quality of the final result.

Subsequent research focused on improving either the performance of minimum degree or the quality of the ordering. George and McIntyre worked on mass elim- ination - identifying vertices which could be immediately eliminated following the elimination of particular vertices in finite-element problems [36]. George and Liu exploited indistinguishable nodes by clustering sets of vertices whose adjacencies

(along with the vertex itself) are identical and eliminating them in a single step [34].

Speelpenning developed the generalized element method, which uses an efficient elimination graph data structure - the quotient graph - that tracks cliques and elements instead of edges [84]. Speelpenning’s approach allowed for a compact in-memory representation whose bound on the amount of required memory is the amount required to store the original problem [84]. Duff and Reid used the generalized element method to propose element absorbtion, a strategy whereby cliques are allowed to merge in

16 Speelpenning’s representation [24, 25]. In an effort to improve quality of the minimum degree ordering, Liu investigated the external degree of vertices [65]. In this context, the external degree is defined for a vertex v as the number of adjacent vertices which are not indistinguishable from v itself. Liu’s idea stemmed from George and McIntyre’s work in which many vertices in the clique resulting from an elimination of a vertex will be subsequently and immediately eliminated.

Amestoy, Davis, and Duff developed an approximate minimum degree (AMD) algorithm [1]. In AMD, the minimum degree is computed on the quotient graph, and the algorithm makes use of mass eliminations, clique merging, external degree, and indistinguishable nodes. Instead of using the exact external degree for vertex updates, AMD computes and uses an upper bound on the external degree. AMD tends to produce a higher quality fill-reducing ordering and with greater speed than its true degree counterparts. Davis, Gilbert, Larimore, and Ng developed a column approximate minimum degree (COLAMD) ordering algorithm which computes the column preordering directly from the sparsity structure of the input problem A rather than

AT A [18]. COLAMD uses the symbolic LU factorization method and carefully selects pivot columns so as to reduce fill-in in the resulting factors. Both AMD and COLAMD are used in contemporary commercial software packages.

2.1.2 Nested Dissection

Alan George introduced nested dissection of a regular finite element mesh [30]. In nested dissection, George realizes the system as an undirected graph whose vertices correspond to the system’s rows and columns and whose edges correspond to nonzero entries. George first recursively finds vertex separators for the graph and the subgraphs generated via partitioning. George then computes the symbolic Cholesky factorization, ordering the variables corresponding to the vertices in each partition before those in the separator at each level. George showed that the amount of fill-in is bounded by the square of the size of the vertex separators at each level of recursion.

17 Lipton, Rose, and Tarjan use separators in a modified version of nested dissection to guarantee bounds of O(n log n) on fill-in for two-dimensional finite element mesh problems [63]. The authors called their method generalized nested dissection. The √ algorithm takes an integer parameter a along with graph G with n vertex separators

for its subgraphs. The algorithm first numbers vertices and arranges vertices into sets

A, B, C. It then forms subgraphs G1 and G2 as graphs induced by the vertices in A ∪ C

and B ∪ C, respectively, while eliminating edges in both subgraphs whose vertices are in

C. The algorithm then recurses on G1 and G2. Vertices are renumbered along the way,

except for those in C, as C is the separator at each level and ultimately permuted to the

end of the submatrix at each level. Gilbert and Tarjan expanded on the work of Lipton, Rose, and Tarjan by finding fill-in

upper bounds of O(n log n) for planar graphs, two-dimensional finite element meshes, graphs with fixed genus, and graphs of bounded degree that have vertex separators of √ size n [41]. In Gilbert and Tarjan’s work, they start with generalized nested dissection but at each level of recursion intentionally neglect vertices in C. Also in contrast with the prior method, Gilbert and Tarjan’s algorithm recurses up to k times on the entire system as opposed to once per subgraph at each level of recursion. Later, Gilbert proved that every graph with a fixed bound on vertex degree has a nested dissection ordering with

fill-in within a factor of O(log n) of the minimum [38]. Natanzon, Shamir, and Sharan found the first polynomial approximation algorithm

for the minimum fill-in problem by introducing a novel partitioning strategy [73]. They

use an approximation to a graph partitioning strategy with an extra input parameter, k,

defined as follows. First define partitions, A and B with all of the vertices initially in B.

Next move vertices on independent chordless cycles from B into A. Then move vertices on related chordless cycles that have independent paths from B into A. Then define k-essential edge as an edge connecting the start and end of a chordless cycle, length k,

whose vertices are in A but have a path through B. Finally either add k-essetial edges

18 into A or transfer vertices along chordless cycle paths from B into A. The trio of authors approximate the algorithm by setting k for the first two steps to be ∞, and they set k for

the final step to be 0. They add a fourth step, namely finding a minimal triangulation of

A. The authors prove the first polynomial approximation algorithm for the minimum fill-in

problem that approximates to within 8x of the optimal minimum fill. The nested dissection algorithm’s fill-reducing ordering quality relies on being able to quickly compute high quality vertex separators. The vertex separator problem and its dual, the edge separator problem are discussed next.

2.1.3 Graph Partitioning

Graph partitioning, specifically the edge separator problem and the vertex separator

problem, are crucial to computing high quality fill-reducing orderings because they form

the kernel of the nested dissection method. The edge separator problem is defined as taking a graph G = (V , E) with V vertices and E edges and arranging the vertices into sets of partitions such that the sum of edge weights crossing partition boundaries is minimized and the size of the partitions is uniform. The dual problem, the vertex separator problem, seeks to partition the graph into three sets, A, B, and V with V being a special set called the vertex separator. In the vertex separator problem, vertices in A and B may only have internal edges or edges into V . Additionally, the size of the vertex separator, V , should be minimized.

Kernighan and Lin at Bell Labs developed the first edge separator package for use at Bell Systems [60]. Their algorithm seeks to separate the graph into two partitions, A and B such that the number of edges between A and B is minimized. Their approach uses the number of edges that lie across partitions as a heuristic, and their algorithm considers all pairs of vertices. The algorithm swaps two vertices va and vb if performing such an exchange results in a reduction of the number of edges that cross the boundary between partitions. Kernighan and Lin’s initial algorithm operates in O(n3) time with n being the number of vertices to consider.

19 Fidducia and Mattheyes introduced a linear time heuristic, improving on Kernighan-Lin’s swapping strategy, by ranking vertices by using a metric called the “gain” of a vertex

[28]. Fidducia and Mattheyes define the gain of a vertex as the sum of external edge

weights minus the sum of internal edge weights. The initial Fidducia-Mattheyes

algorithm constrains edge weights to integers then uses a bin sort to order vertices by their gains in descending order. Fidducia-Mattheyes then swaps the partition of a

vertex in order from greatest gain to smallest gain while updating the gain values for

those neighbors of vertices that change their partition membership. Vertices are allowed

to swap partitions once per application of the algorithm. Since each edge weight is

considered exactly once per bin sort and moves are made until zero-weight gains are the only remaining moves, this is considered a linear time heuristic approach to the edge

separator problem.

Hendrickson introduced multilevel partitioning with his CHACO package [50]. In

multilevel edge separation, the input graph G is first coarsened until it is deemed small enough to apply expensive heuristics to form an initial partition before refining the

graph back to the size of the original input. Karypis and Kumar later introduced the

concept of the partition boundary to their algorithm [55, 56, 58]. Karypis and Kumar

introduced strategies for matching vertices during graph coarsening, including Heavy

Edge Matching (HEM), Sorted Heavy Edge Matching (SHEM), and Heavy Clique Matching (HCM) [59]. In HEM, vertices are selected in pseudorandom order, and each vertex matches to its heaviest unmatched neighbor. In SHEM, edges are sorted by decreasing edge weight, and vertices are selected in order of descending edge weight for matching. In HCM, Karypis and Kumar search the graph for 3-cliques, sort these cliques by sum of edge weights, and perform 3-way matchings of their constituent vertices.

Karypis and Kumar proposed a variety of region-growing techniques to develop an initial partition [55, 56, 58]. During refinement, Karypis and Kumar use iterations of

20 Fidducia-Mattheyes to make adjustments to the partition along the partition boundary. Karypis and Kumar note that using this scheme does not, in general, reduce the

asymptotic complexity, but it has been demonstrated that for 2D mesh problems, there √ are an expected n number of boundary vertices, which in turn can reduce the time

complexity to O(n log n) [54]. Gupta later proposed Heavy Triangle Matching (HTM) [43], which is an expensive

technique that seeks to match three vertices together. Unlike the Karypis-Kumar HCM,

HTM treats the absence of an edge between two vertices as a 0-weight edge. HTM

is similar to HEM in that both are local greedy heuristics that do not rely on a global

sorting. Early work into attacking the vertex separator problem came from George and Liu

when developing an automatic nested dissection algorithm for irregular finite element

problems [33]. Later, Lipton and Tarjan found an algorithm that finds vertex separators of √ size n for planar graphs and two-dimensional finite element meshes [64]. Although the edge separator problem is different than the vertex separator problem,

Liu [66], and later Pothen and Fan [78], noticed similarities between the edge separator problem and vertex separator problem. In related work, the aforementioned authors showed that there exists a simple conversion from a solution to the edge separator problem to the vertex separator problem [66, 78]. The authors showed that by computing a minimum vertex cover for the edges in the cut set of the edge separator problem,

the vertex cover set is the vertex separator for an equivalent vertex separator problem.

However, should an optimal solution to the edge separator problem be found, simply

computing a minimum vertex cover does not guarantee that the resulting solution to the

vertex separator problem is optimal. Hendrickson and Rothberg proposed a multilevel vertex separator algorithm

combining aspects of Fidducia-Mattheyes, Minimum Weight Vertex Cover, and

Approximate Minimum Degree [52, 83]. The algorithm first compresses the graph by

21 merging indistinguishable nodes. The algorithm then applies nested dissection until each recursive component has fewer than n/32 vertices, where n is the number of vertices in the original graph. The algorithm then performs multilevel coarsening until each component discovered by nested dissection has fewer than 50 vertices. The algorithm’s initial partitioning adds every vertex to the vertex separator and uses a modification of Fidducia-Mattheyes suited for the vertex separator problem. During every other refinement phase, the algorithm again applies the vertex separator version of

Fidducia-Mattheyes followed by finding a minimum weight vertex cover. In a final pass, the algorithm orders vertices remaining in the separator using constrained approximate minimum degree. The authors argue that finding an edge separator first followed by a conversion to an equivalent vertex separator produces a solution where the quantities being minimized are only indirectly related by the dual nature of the two problems.

Recently, continuous formulations of the edge separator problem and the vertex separator problem have been shown by Hager and Krylyuk [45], Park [75], and Hager and Hungerford [46, 47]. The continuous formulations realize the edge separator problem and vertex separator problems as quadratic optimization problems. The authors produced software to compute continuous solutions to each using gradient projection methods, Lanczos ball optimizations, and continuous equivalents of Kernighan-Lin’s swapping strategy. Powerful fill-reducing ordering techniques are often critical to extracting high performance from sparse factorization algorithms. Spending computational time up front can pay dividends if several numerical factorizations occur with the same initial non-zero pattern. Further discussion of research into the optimization of sparse factorization methods is discussed next.

22 2.2 Sparse QR Factorization

2.2.1 Early Work For QR Factorization

Early QR factorization methods ([10], [13], [35], [37], [32], [49], [48]) perform the

factorization either as a series of Givens Rotations, or as Householder factorization

followed by Householder applications across the matrix (single vector and block applies). For sparse matrices, these methods fail to achieve high performance due to irregular

memory accesses and a failure to optimize for cache-based memory hierarchies.

2.2.2 Multifrontal QR Factorization

In the multifrontal method, the factorization is completed as a series of dense

subproblems. A fill-reducing ordering is first computed for the original problem and

applied to the rows and columns of the system. Rows are grouped into supernodes

in a strategy similar to the indistinguishable node approach of the minimum degree algorithm. Once the supernodes are found, the fill-reducing ordering is used to compute

the frontal elimination tree such that each node of the tree is a dense subproblem with a

set of pivotal rows and a set of rows comprising the contribution block, which is data that must be passed to the node’s parent in the tree. A multifrontal factorization proceeds in a bottom-up fashion starting with the leaves of the frontal elimination tree and working its way to the root node. A parent may not begin until it has the contribution blocks from each of its children. The strategy was pioneered for symmetric indefinite matrices but was later extended for sparse LU ([25], [16], [12]) and sparse Cholesky factorizations

[67]. Puglisi [79] first extended the multifrontal approach to sparse QR factorization.

Other implementations include [3], [26], [68], [71], [72], [76], [85]. 2.2.3 Parallelizing Multifrontal QR Factorization

Many opportunities exist for parallelizing the multifrontal QR factorization method, provided the child-parent data dependence is maintained. Algorithms may exploit

parallelism in level order or decompose the frontal elimination tree into a forest which

23 can be executed in parallel across multiple CPU cores or across multiple compute nodes in an MPI fashion.

Davis developed a CPU-based multithreaded multifrontal sparse QR factorization

algorithm, and an associated commercial-quality implementation: SuiteSparseQR

[15]. SuiteSparseQR handles both structural and numerical rank deficient problems. Only one other prior sparse multifrontal QR factorization method exists that considers

rank deficiency ([76], based on [8]). This other method is numerically accurate but not

well-suited to preserving Householder coefficients (stored in Q), whereas Davis’ method

can return Q as a compact set of sparse Householder vectors.

Davis’ SuiteSparseQR symbolic analysis phase is modeled after his CHOLMOD package, specifically the Cholesky factorization of AT A. However, unlike every prior sparse QR method, SuiteSparseQR never explicitly forms AT A unless the user specifies an ordering method that requires it. Instead, Davis’ method can order the matrix using

COLAMD ([17], [18]), requiring O(|A|) memory. After the fill-reducing ordering, the row counts of R and the multifrontal structures are computed in nearly O(|A| + |R|) time [39] and in O(|A|) memory.

CPU-based multifrontal sparse QR factorizations can reveal and exploit a high degree of parallelism in a variety of ways. First, consider the frontal elimination tree.

Davis exploits task-based parallelism at this level using Intel’s Threading Building Blocks library (TBB [80]) to build tasks from independent subtrees gathered from the elimination

tree. Because memory requirements are determined in the symbolic analysis phase,

either determined exactly or using an upper bound, the factorization can be guaranteed

to always succeed or always fail in a deterministic manner. Davis’ method also exploits

Level-3 BLAS [23] calls for block Householder applications across an LAPACK panel within a frontal matrix factorization. All recent high-performance BLAS implementations

have the capability to exploit a shared-memory multicore architecture.

24 When introduced, Davis’ method outperformed contemporary methods. SuiteSparseQR

outperformed MATLAB⃝R ’s previous single-threaded sparse QR factorization method [40] which used the Givens-based methods of [31], [49]. Davis’ method also outperformed

MA49 ([3], [79]), a multifrontal sparse QR using AMD as its default ordering ([1], [2]).

Davis’ SuiteSparseQR also leverages modern multilevel graph partitioning-based

fill-reducing ordering methods, namely the METIS [57] package. MATLAB⃝R 7.8 used

AMD and COLAMD for LU and Cholesky, but it used COLMMD for QR, including

(x = A\b) when A is rectangular. COLMMD tends to result in more fill-in than COLAMD

or AMD ordering methods. When introduced, the performance improvements over

previous methods were dramatic (sometimes as much as 2000x speedup compared with

MATLAB⃝R 7.8) [15]. MATLAB⃝R 7.9 and later switched algorithms to use Davis’ method

in backslash (x = A\b) and qr.

2.3 GPU Computing

NVIDIA provides a CUDA BLAS for dense matrix operations such as matrix-matrix

multiply; see http://www.culatools.com. It includes a dense QR factorization which

can achieve 370 GFlops (single precision) on one C2050 Fermi GPU (130 GFlops in

double precision). Li, Ranka, and Sahni developed algorithms for the same architecture that provide up to 3% improvement over the CUDA BLAS [62].

NVIDIA has also developed an efficient sparse-matrix-vector multiplication algorithm [7], which achieves 36 GFlops in single precision on a GeForce GTX 280 (with a peak performance of 933 GFlops). The performance of sparse-matrix-vector multiplication is limited by the GPU memory bandwidth, since it computes only 2 floating-point operations per nonzero in A.

2.4 Communication-Avoiding QR Factorization

Demmel et al. [20] have considered how to exploit the orthogonal properties of

QR factorization to reduce communication costs in parallel methods for dense matrices.

They called their method, Communication-Avoiding QR (CAQR). CAQR realises

25 the contribution blocks as a communication overhead during factorization, as data needs to be transferred from child to parent. CAQR seeks to reduce this overhead by

strategically performing redundant computations at each level in the tree. They argued

that because of increasing memory latency, performing redundant computations on

real hardware sometimes removes the dependence on memory units thereby avoiding the memory latency problem alltogether. Additionally, for tall and skinny QR (TSQR),

they regard the column annihilation as a k-way reduction, where k is the number of cooperating processing units. Block columns are divided among k processors. Each

annihilates its subproblem, and in a subsequent cleanup step, one processor performs

the annihilation across subproblems. They apply these methods to many parallel computing environments, including GPU-based computing. However, their GPU-based

methodology simply uses the GPU for block Householder applies while the annihilations

are computed on the CPU.

Anderson et al. provide a CAQR algorithm that runs on a single GPU [5]. Anderson’s algorithm considers only dense problems with a focus on providing good performance

for tall and skinny problems. The algorithm implements Demmel et al.’s k-way reduction

strategy for block annihilations on the GPU, and it performs block Householder updates

on the GPU. The algorithm is able to achieve up to 15x speedup over the previous

GPU implementation of CAQR. However, Anderson’s algorithm does not consider large sparse matrices in which multiple dense QR factorizations occur on the GPU

simultaneously.

2.5 Sparse Multifrontal Methods on GPU Accelerators

Krawezik and Poole [61], Lucas et al. [69], Pierce et al. [77], and Vuduc et al. [88]

have worked on multifrontal factorization methods for GPUs. All four methods use the

GPU as a BLAS accelerator, by transferring one frontal matrix at a time to the GPU

and then retrieving the results. The assembly operations are done in the CPU. None of

26 the methods regard QR factorization as a set of tasks associated with multiple frontal matrices all residing on the GPU simultaneously.

27 CHAPTER 3 HYBRID COMBINATORIAL-CONTINUOUS GRAPH PARTITIONER

3.1 Problem Description

3.1.1 Definition

The binary edge separator problem is an NP-COMPLETE problem defined as taking an undirected input graph, G(V , E), and removing edges until the graph breaks into two disjoint subgraphs. The set of edges deleted in this manner is known as the “cut set.” When partitioning a graph, we seek to minimize the number of edges in the cut set while maintaining a target partitioning balance in the ratio of vertices in each component. When the input graph has weighted vertices and edges, we generalize the problem definition by seeking to minimize the sum of edge weights for edges in the cut set rather than simply the number of edges. Further, the partition balance ratio is determined by considering the sum of vertex weights in each partition rather than the number of vertices in each partition. If weights are absent from vertices or edges, we assume a weight of 1.

In the k-way edge separator problem, edges are deleted until there are k disjoint subgraphs. When k is a power of two, the k-way edge separator problem can be solved recursively by solving the edge separator problem on each of resulting disjoint components.

The vertex separator problem is a related NP-COMPLETE problem in which vertices are removed instead of edges until the graph breaks into two partitions. The set of removed vertices is the vertex separator. Once partitioned, vertices in one partition do not have edges to the other. 3.1.2 Applications

Graph partitioning arises in a variety of contexts including VLSI circuit design, dynamic scheduling algorithms, and fill-reducing orderings for sparse direct methods.

28 In VLSI circuit design, integrated circuit components must be arranged to allow uniform power demands across each silicon layer while simultaneously reducing the manufacturing costs by minimizing the required number of layers. Graph partitioning is used to determine when conductive material needs to be cut through to the next layer.

In the dynamic scheduling domain, task-based parallelism models dependencies using directed acyclic graphs. Graph partitioning is used to extract the maximum amount of parallelism for a set of nodes while maintaining uniform workload, maximizing high system utilization, and promoting high throughput.

Sparse direct methods use graph partitioning when computing fill-reducing orderings. The nested dissection method recursively computes vertex separators to minimize dependencies between subregions of the matrix to reduce fill-in during factorization.

3.1.3 Outline of this chapter

A brief discussion of multilevel edge separators appears in Section 3.2. Related work is discussed in Section 3.3. The main components of the project and their relationship are given in Section 3.4; results are provided in Section 3.5.

3.2 Multilevel Edge Separators

Multilevel edge separators seek to simplify the input graph in an effort to apply expensive partitioning techniques on a smaller problem. The motivation for such a strategy is due to limited memory and computational resources to apply a swath of combinatorial analysis techniques on large input problems. By reducing the size of the input, more advanced techniques can be applied and carried back up to the original input problem.

The process whereby an input graph is simplified is known as “graph coarsening.” In graph coarsening, the original input graph is reduced through a series of vertex matching operations to an acceptable size [50]. Vertices are merged together using strategies that exploit geometric and topological features of the problem.

29 Figure 3-1. Multilevel graph partitioning

High degree vertices that arise in irregular graphs, particularly social networks, impede graph coarsening by reducing the maximum number of matches that can be made per coarsening phase. When the number of coarsening phases becomes proportional to the degree of a vertex, we say that coarsening has “stalled.” Coarsening can be stalled by the presence of high degree vertices when the coarsening method only considers a vertex’s immediate neighbors. High degree vertices often arise in power law graphs, such as graphs representing social networks.

Once the input graph is coarsened to a size suitable for more aggressive algorithms, an initial guess partitioning algorithm is used. Initial partitioning strategies accumulate a number of vertices into one partition such that the desired partition balance is satisfied.

Karypis and Kumar demonstrated that region-growing techniques, such as applying a breadth-first search from random start vertices, tend to find higher quality initial partitions than random guesses or first/last half guesses [58]. Once a satisfactory guess partition is achieved at the coarsest level, transmitting the partition back to the original input graph requires the inverse operation of graph coarsening, known as graph refinement. In graph refinement, vertices expand back into their original representations at the finer level. The partition choice for each coarse vertex is applied to all of the vertices that participated in the matching used during graph

30 coarsening. Higher quality partitions, in terms of cut quality, can be achieved during refinement by changing the choices propagated from the coarse representation.

Because graph coarsening provides few guarantees about the uniformity of vertex and edge weights, traditional graph partitioning strategies are used to improve the initial guess partition as the graph is refined back to its original size. 3.3 Related Work

Kernighan and Lin at Bell Labs developed the first graph partitioning package for use at Bell Systems [60]. Their algorithm considers all pairs of vertices and makes swaps when a net gain in edge weights is detected.

Fidducia and Mattheyes improved upon the Kernighan-Lin swapping strategy by ranking vertices by using a metric called the “gain” of a vertex [28]. The Fidducia-Mattheyes algorithm constrains edge weights to integers and computes gains in linear time. The algorithm swaps the partitions of vertices in order from greatest to least gain while updating the gains of its neighbors. Vertices are allowed to swap partitions once per application of the algorithm.

Karypis and Kumar considered constraining swap candidates to those vertices lying in the partition boundary [55][56][58]. They show that although employing the strategy does not generally reduce the asymptotic complexity, the strategy reduces the time complexity for partitioning 2D meshes can to O(N log N) [54].

Coarsening strategies arise in multilevel graph partitioning, which dramatically affect the resulting cut quality and partition balance. Karypis and Kumar considered “Heavy Edge Matching,” (HEM), “Sorted Heavy Edge Matching,” (SHEM) and “Heavy Clique

Matching,” (HCM) [59]. Gupta considered “Heavy Triangle Matching,” (HTM) ([43]).

Hager and Krylyuk demonstrated the quadratic programming problem arising in the edge separator and vertex separator problems. Hager applied gradient projection and

Lanczos ball optimizations to find local minimizers to the respective problems [44][45].

31 3.4 Hybrid Combinatorial-Quadratic Programming Approach

We use a multilevel approach that blends combinatorial methods with continuous edge separator strategies. Our algorithm for the edge separator problem occurs as a sequence of four parts: preprocessing, coarsening, initial partitioning, and refinement.

Each of these components is discussed below.

3.4.1 Preprocessing

The algorithm verifies that the input graph is undirected, free from self-edges, and that vertex and edge weights (if provided) are positive.

The algorithm then uses a breadth-first search to discover the number of connected components comprising the input graph. As the algorithm visits each edge of the graph while computing the BFS, it also computes a preliminary matching using Heavy Edge

Matching, computes the sum of edge weights, sum of vertex weights, average degree of the vertices, and number of edges and vertices per component.

Once the connected components are discovered, they are sorted by the sum of vertex weights in descending order. The algorithm removes singleton connected components and loosens the desired partition balance accordingly. Singleton elements are added at the end of the algorithm on an as-needed basis to satisfy balance constraints. As an extension, the algorithm attempts to additionally exploit connected components with up to four vertices, but only if the input graph is unweighted. If the total number of non-exploitable connected components is fewer than 10, the algorithm checks every possible partitioning by connected component in a brute force manner. If the brute force approach cannot be used or fails to find a feasible solution, we try a greedy binary bin packing algorithm to place all but the largest connected component. The largest connected component is then the component passed to our hybrid multilevel partitioner.

32 3.4.2 Coarsening

Our algorithm’s coarsening phase uses a new matching algorithm to prevent the

coarsening operation from stalling. The details of this operation follow: When the partitioner coarsens the graph, it iterates over the entire adjacency

structure of the fine graph and constructs the set union of edges for the coarse

representation. The algorithm seizes this opportunity to simultaneously perform heavy

edge matchings for the coarse graph as it is being built. We call this strategy Jumpstart

Heavy Edge Matching. Upon completion of this preliminary matching, if a vertex is unmatched, all of its neighbors are in a matching. Figure 3-2 illustrates the preliminary

Heavy Edge Matching.

Figure 3-2. Graph coarsening using heavy edge matching

After the coarse graph has been constructed, the algorithm considers vertices which remain unmatched after the jumpstart matching. We perform a round of what we

call, stall-free matching. The description of stall-free matching follows.

First the algorithm finds a suitable pivot neighbor. The unmatched vertex scans its

adjacency list to find the neighboring matched vertex with maximum edge weight. We

call this neighbor Vpivot .

Then the algorithm attempts to resolve unmatched neighbors of Vpivot , pairwise

as illustrated in Figure 3-3. Since Vpivot has at least one unmatched neighbor, namely

33 Vunmatched , the algorithm shifts its focus to resolve all the unmatched neighbors of Vpivot

with the hopes that Vunmatched is not its only unmatched neighbor. The algorithm matches

the unmatched neighbors of Vpivot pairwise. Although the vertices matched in this

manner do not share an edge, they are topologically close in the graph.

Figure 3-3. Graph coarsening using heavy edge and Brotherly matching

Then the algorithm adopts any remaining unmatched neighbor, as illustrated in

Figure 3-4. If there were an odd number of unmatched neighbors of Vpivot then the pairwise matching strategy leaves one neighbor unmatched. Instead, Vpivot includes this unmatched neighbor to its matching.

Figure 3-4. Graph coarsening using heavy edge, Brotherly, and adoption matching

34 This adoption creates at least a 3-way match, but it could result in a 4-way match,

which our algorithm prevents, using the following technique. The first time Vpivot ’s

matching adopts a leftover unmatched vertex, it creates a 3-way matching. However,

Vpivot is not the only participant in its matching. Its match partner may have already

adopted a vertex from an earlier stall-free matching. In this case, by performing the

adoption, Vpivot would create a 4-way matching. Instead, Vpivot creates a new match

consisting of its would-be adoptee and the vertex its matching previously adopted, as

illustrated in Figure 3-5. This strategy prevents 4-way matches from occurring while guaranteeing coarsening progress.

Figure 3-5. Graph coarsening using heavy edge, Brotherly, and community matching

We call this matching strategy Stall-Free Matching because it guarantees that in a single pass over the unmatched vertices after a jumpstart matching that every vertex in the graph participates in a topologically relevant matching of at most 3 vertices.

Stall-free matching also guarantees that the coarse graph has at most half the number of vertices as its predecessor.

3.4.3 Initial Partitioning

Our algorithm’s initial partitioner uses region growing from a pseudoperipheral node,

followed by an iteration of the boundary Fidducia-Mattheyes vertex swapping algorithm,

and finally an iteration of Hager’s gradient projection.

35 The algorithm uses breadth-first searching to find a pseudoperipheral node, and it assign vertices in BFS order to the first partition until it exceeds the target partition balance. The remaining vertices are assigned to the second partition. Vertices with neighbors assigned to opposite partitions are added to the partition boundary vertex set.

The algorithm performs a round of Fidducia-Mattheyes but considers only those vertices in the boundary set. As swaps are made, vertices enter and leave the boundary set when swaps place them near or far from the boundary respectively.

These partition choices for vertices are then used as an initializer for gradient projection. Because gradient projection is a continuous method, it computes the affinity of a vertex as a floating point value between 0 and 1. Our algorithm quantizes this result and interprets values of ≤ 0.5 as the first partition and values > 0.5 as the second partition. Hager’s method ensures that at most one vertex has a fractional affinity.

3.4.4 Refinement

Partition refinement for our algorithm consists of two parts working in tandem:

First our algorithm uses a variation of the Fidducia-Mattheyes algorithm that exploits the boundary optimization first used with the Boundary Kernighan-Lin strategy.

This implementation of Fidducia-Mattheyes maintains one heap per partition for boundary vertices. Vertices contained in these heaps represent swap candidates.

Vertices in the heap are keyed by a sequential vertex identifier and are backed by the

Fidducia-Mattheyes gain value. The top three entries of the heap are considered, and their heuristic values are computed. The vertex with the maximum heuristic gain swaps partitions.

We also introduce a balance-aware heuristic that considers the effect of performing a swap with respect to the target balance constraints. This heuristic value is derived from the following formulation:

We define the gain of a vertex Va as the sum of edge weights in the adjacency of Va that lie in the cut set subtracted by the sum of the edge weights in the adjacency of Va

36 that does not lie in the cut set. These quantities are effectively the sum of edge weights from Va to vertices in its adjacency that lie in the other partition minus those in the same partition.

Without loss of generality, suppose Va currently lies in P0 and a flip operation would transfer Va to P1. The definitions follow from this construction.

∑ ∑ gaina = Ea,i − Ea,j

i∈P1 j∈P0

We then define a scalar imbalance resulting in a swap of vertex Va as the ratio of sum of vertex weights in the left partition after the move divided by the total sum of vertex weights W subtracted by the user’s desired target balance ratio. ∑ Imbalancea = Vi /W ∪ i∈Pleft Va

We finally define a heuristic value that compares the derived values considering the vertex gains as well as the impact to partition balance ratio should the vertex be swapped. In a way, this heuristic value considers both edge weights and vertex weights with one value.   2W ∗ Imbalancea : Imbalancea ≥ τ Heuristic a = Gaina +  0 : otherwise

If the current cut is balanced, the penalty term contributes nothing to the heuristic value. If the current cut is imbalanced, then we impose a balance penalty of the measure of imbalance times twice the sum of node weights. The algorithm explores suboptimal moves after all obvious moves have been made.

Our implementation of Fidducia-Mattheyes may make several net-zero moves, which shift the partition boundary in an attempt to locate vertices with positive heuristic gains.

Because the algorithm is balance-aware, such exploratory moves do not significantly disrupt the target balance partition.

37 By combining the boundary optimization with a balance-aware heuristic, the algorithm is able to consider moves which imbalance the problem. However it only

commits to such moves if it discovers that doing so results in an extraordinary reduction

in the weight of the cut set.

Following an application of our Balance-Aware Boundary Fidducia-Mattheyes, discussed above, we use the discrete partition choices for each vertex as an initializer

for gradient projection. Because the quadratic programming formulation ignores the

combinatorial notion of boundary, it is capable of identifying vertices to swap which

do not lie on the boundary. Gradient projection also adheres to strict balancing, and

its local minimizers result in cuts with better balance than our Fidducia-Mattheyes implementation.

3.5 Results

In this section, the experiments ran on a single CPU core of a machine equipped

with 24 AMD OpteronTM6168 processors and 64 GB of shared memory.

3.5.1 Hybrid Performance

We examined the impact of using the hybrid combinatorial-quadratic programming

algorithm against the combinatorial method and quadratic programming method in

isolation. In Figure 3-6, we plot the relative performance of each method in terms

of cut quality and runtime performance, respectively. The plot considers how each algorithm performs on the same subset of problems. The plot first computes the relative

performance metrics of each problem for each algorithm triplet-wise. The plot then sorts

each algorithm’s relative performance to determine the number of times the algorithm

produces the best result (or ties for the best result). Values to the right of 1 indicate that

the method is at least a factor of the x-value away from the best method. Figure 3-6 (a) illustrates that using the combinatorial method alone (red) results in an inferior cut quality. Using the quadratic programming formulation (blue) in a multi-level setting tends to dominate the combinatorial method. Our hybrid approach (green)

38 (a) (b)

Figure 3-6. Comparison of cut quality and timing between combinatorial, continuous, and our hybrid methodology on 1550 problems from the UF Collection [19]

(a) (b)

Figure 3-7. Comparison of cut quality and timing between METIS 5 and our hybrid methodology on 1550 problems from the UF Collection [19]

dominates either by itself. Figure 3-6 (b) illustrates the relative speed of each method.

The combinatorial method (red) dominates when run by itself, followed by the quadratic

programming (blue) method. The slowest is our hybrid approach (green). However, Figure 3-7 (b) compares the timing of this hybrid approach (green) against METIS 5

(blue). The figure suggests that no appreciable loss in speed is experienced when using

the hybrid algorithm. Additionally, as Figure 3-7 suggests, the hybrid approach’s cut

quality (green) ties METIS 5 (blue).

39 Table 3-1. Selected problems from the University of Florida Sparse Matrix Collection [19] Problem Name Description Vertices Edges SNAP/soc-Slashdot0811 Slashdot social network 77360 905468 Barabasi/NotreDame www Notre Dame web graph 325729 929849 SNAP/web-Stanford Stanford web graph 281903 2312497

Table 3-2. Performance results for the 3 matrices listed in Table 4-3 Problem Name Cut Edges (METIS) Cut Edges (Hybrid) Speedup SNAP/soc-Slashdot0811 110745 43402 1.26 Barabasi/NotreDame www 9233 1677 1.20 SNAP/web-Stanford 6252 1694 1.40

3.5.2 Power Law Graphs

We examined our hybrid combinatorial quadratic programming algorithm on power law graphs that arise in social networking and Internet networks. In Table 4-3, we describe a sample problem set comprised of three power law graphs.

Table 4-4 suggests that our hybrid approach sometimes results in significant improvement in cut quality over METIS 5. However, the time taken to partition the graph is not always less than METIS 5. We expanded our analysis to 25 power law problems found in the University of Florida Sparse Matrix Collection [19]. Figure 3-8 (b) suggests that the hybrid approach (green) requires time comparable to METIS 5 (blue).

Surprisingly, Figure 3-8 (a) suggests that the hybrid approach (green) finds superior graph cuts for power law graphs compared to METIS 5 (blue). We believe this is due to the following factors:

1. Our Coarsening Strategy is able to prevent stalling during coarsening while

preserving topological features.

2. Algorithmic Cooperation. The combinatorial algorithm provides the quadratic

programming formulation a guess partition that gradient projection can improve

on. Conversely, the quadratic programming formulation exchanges vertices that

40 (a) (b)

Figure 3-8. Comparison of cut quality and timing between METIS 5 and our hybrid methodology on 25 power law problems from the UF Collection [19]

are not necessarilly on the partition boundary, overcoming a limitation of our

combinatorial partitioner.

3.6 Future Work

Opportunities for expanding this work include leveraging Hager’s formulation of

the vertex separator problem to investigate the interplay between continuous edge separators, combinatorial edge separators, combinatorial vertex separators, and

continuous vertex separators. Additionally, investigating the analogous principle in a

multilevel vertex separator problem setting could produce similar benefits.

3.7 Summary

We demonstrated that combining combinatorial and continuous formulations of the edge separator problem leads to an improvement in cut quality and balance partition ratio for large power-law series graphs. Our hybrid multilevel combinatorial-continuous graph partitioner ties METIS 5 for many problems. However, for power-law graphs that arise in social networking modalities, our graph partitioner finds higher quality graph cuts and with similar runtime performance. More importantly, the hybrid approach finds higher quality cuts than either method acting in isolation.

41 Because graph partitioning lies at the heart of the nested dissection algorithm, finding good edge separators and translating those solutions into vertex separators by computing the minimum vertex cover leads to better fill-reducing orderings. With better

fill-reducing orderings, we are able to reorder sparse linear systems prior to applying a direct method, such as QR factorization in an effort to reveal, extract, and exploit additional parallelism from these sparse multifrontal direct methods.

42 CHAPTER 4 SPARSE MULTIFRONTAL QR FACTORIZATION ON GPU

4.1 Problem Description

QR factorization is an essential kernel in many problems in computational science.

It can be used to find solutions to sparse linear systems, sparse linear least squares problems, eigenvalue problems, rank and null-space determination, and many other mathematical problems in [42]. Although QR factorization and other sparse direct methods form the backbone of many applications in computational science, the methods are not keeping pace with advances in heterogeneous computing architectures, in which systems are built with multiple general-purpose cores in the CPU, coupled with one or more General Purpose Graphics Processing Units (GPGPUs) each with hundreds of simple yet fast computational cores. The challenge for computational science is for these algorithms to adapt to this changing landscape.

The computational workflow of sparse QR factorization [3, 14] is structured as a tree, where each node is the factorization of a dense submatrix (a frontal matrix

[25]). The edges represent an irregular data movement in which the results from a child node are assembled into the frontal matrix of the parent. Each child node can be computed independently. An assembly phase after the children are executed precedes the factorization of their parent. In this paper, we present a GPU-efficient algorithm for multifrontal sparse QR factorization that uses this tree structure and relies on a novel pipelined multifrontal algorithm for QR factorization that exploits the architectural features of a GPU. Our algorithm leverages dense QR factorization at multiple levels of the tree to achieve high performance.

4.1.1 Main Contributions

We developed a novel sparse QR factorization method that exploits the GPU by factorizing multiple frontal matrices at the same time, while keeping all the data on the

43 GPU. The result of one frontal matrix (a contribution block) is assembled into the parent frontal matrix on the GPU, with no data transfer to/from the CPU.

We developed a novel scheduler algorithm that extends the Communication-Avoiding

QR factorization [21], where multiple panels of the matrix can be factorized simultaneously, thereby increasing parallelism and reducing the number of kernel launches in the GPU. The algorithm is flexible in the number of threads/SMs used for concurrently executing multiple dense QRs (of potentially different sizes). At or near the leaves of the tree, each

SM works on its own frontal matrix. Further up the tree, multiple SMs collaborate to factorize a frontal matrix. The scheduling algorithm and software does not assume that the entire problem will fit in the memory of a single GPU. Rather, we move subtrees into the GPU, factorize them, and then move the resulting contribution block (of the root of the subtree) and the resulting factor (just R, since we discard Q) out of the GPU. This data movement between the CPU RAM and the GPU RAM is expensive, since it moves across the relatively slow PCI bus. We double-buffer this data movement, so that we can be moving data to/from the GPU for one subtree, while the GPU is working on another.

For large sparse matrices, the GPU-accelerated algorithm offers up to 10x speedup over CPU-based QR factorization methods. It achieves up to 80 GFlops as compared to a peak of 15 GFlops for the same algorithm on a multicore CPU (with 24 cores).

4.1.2 Chapter Outline

Section 4.2 presents the background of sparse QR factorization and the GPU computing model. The main components of the parallel QR factorization algorithm are given in Section 4.4. In Section 4.5, we compare the performance of our GPU-accelerated sparse QR to Davis’ SuiteSparseQR package on a large set of problems from the

UF Sparse Matrix Collection [19]. SuiteSparseQR is the sparse QR factorization in

MATLAB⃝R [14]. It uses LAPACK [4] for panel factorization and block Householder updates, whereas our GPU-accelerated code uses our GPU compute kernels for both the panel factorization and the update step. Future work in this algorithm is discussed

44 in Section 4.6. An overview of related work, and a summary of this work, are presented Sections 4.3 and Sections 4.7.

4.2 Preliminaries

An efficient sparse QR factorization is an essential kernel in many problems in

computational science. Application areas that can exploit our GPU-enabled parallel

sparse QR factorization are manifold. In our widely used and actively growing University

of Florida Sparse Matrix Collection [19], we have problems from structural engineering,

computational fluid dynamics, model reduction, electromagnetics, semiconductor devices, thermodynamics, materials, acoustics, computer graphics/vision, robotics/kinematics,

optimization, circuit simulation, economic and financial modeling, theoretical and

quantum chemistry, chemical process simulation, mathematics and statistics, power

networks, social networks, text/document networks, web-hyperlink networks, and many other discretizations, networks, and graphs. Although only some of these domains

specifically require QR factorization, most require a sparse direct or iterative solver. We

view our QR factorization method as the first of many sparse direct methods for the

GPU, since QR factorization is representative of many other sparse direct methods with

both irregular coarse-grain parallelism and regular fine-grain parallelism. In the next section, we briefly describe the multifrontal sparse QR factorization

method and explain why we have selected it as our target for a GPU-based method. We

then give an overview of the GPU computing landscape, which provides a framework for

understanding the challenges we addressed as we developed our algorithm. 4.2.1 Multifrontal Sparse QR Factorization

4.2.1.1 Ordering phase

The first step in solving a sparse system of equations Ax = b or solving a least squares problem is to permute the matrix A so that the resulting factors have fewer nonzeros than the factors of the unpermuted matrix. This step is NP-hard, but many efficient heuristics are available. In particular, Davis’ COLAMD is very effective at

45 reducing fill-in and takes time proportional to the number of nonzeros in A, on average and in practice [17, 18].

We can also take advantage of graph-partitioning based methods (METIS [57],

SCOTCH [11], CHACO [51], etc). These methods take very little time compared with

the numerical factorization, and need only be done once in the (common) case where multiple matrices A with identical pattern but different values need to be factorized [13].

These ordering techniques are very irregular in their computation and are thus best

suited to stay on the CPU.

4.2.1.2 Analysis phase

The second step is to analyze the matrix to set up the parallel multifrontal numerical

factorization. This step determines the elimination tree, the (related) multifrontal

assembly tree, the nonzero pattern of the factors, and the sizes of each frontal matrix. The row counts of R and the multifrontal structures are found in nearly O(|A| + |R|) time [39] and in O(|A|) memory, where |A| denotes the number of nonzeros in A. This

can be much less than the time to form AT A, particularly when m << n. From this

information, the coarse-grain parallelism (based on the tree) can be determined. In a

multifrontal method, the data flows only from child to parent in the tree, which makes the tree suitable for exploiting coarse-grain parallelism, where independent subtrees

are handled on widely separated processors. When the subtree is completed, it sends

just a single message to its parent. This analysis phase takes time that is no worse

than (nearly) proportional to the number of integers required to represent the nonzero pattern of the factors, plus the number of nonzeros in A. This can be much less than the

number of nonzeros in the factors themselves. Like the ordering phase, this step can

be done just once in the common case when a factorizing a sequence of matrices with

identical nonzero pattern. This step, too, is very irregular in nature and thus ill-suited to

computation on the GPU.

46 The ordering and analysis steps are based on our existing multifrontal sparse QR method (SuiteSparseQR) [14]. Unlike all other prior methods, SuiteSparseQR can

order and analyze the matrix in O(|A|) memory. In this method, each node in the tree represents one or more nodes in the column elimination tree. The latter tree is defined purely by the nonzero pattern of R, where the parent of node i is j if j > i is the smallest row index for which rij is nonzero. There is one node in column elimination tree for each column of A.

A multifrontal assembly tree is obtained by merging nodes in the column elimination tree. A parent j and child j − 1 are merged if the two corresponding rows of R have identical nonzero pattern (excluding the diagonal entry in the child). In general, this requirement is relaxed, so that a parent and child can be merged if their patterns are similar but not necessarily identical (this is called relaxed amalgamation [6]). Figure 4-1

gives an example of both trees, and the related A and R matrices. Each x is an nonzero in A, each dot is an entry that will become nonzero as the matrix is factorized, and each

r is a nonzero in R. Each node of the tree is a column of A or row of R, and they are grouped together when adjacent rows of R have the same nonzero pattern. The rows of

A are sorted according to the column index of the leftmost nonzero in each row, so as to clarify the next step, which is the assembly of rows of A into the frontal matrices.

These row and column permutations need not be explicit, but can be represented with the original A plus a row and column permutation vector. However, if they are made explicit prior to passing the matrix to the GPU, then the GPU need not be straddled with the burden of computing the permutations as it accesses the matrix A. This is essential because the GPU can incur significant data access overheads when accessing memory through a permutation vector. Prepermuting the matrix avoids this problem and reduces the amount of data sent to the GPU. We refer to the permuted matrix as S in

Section 4.4.3.

47 Figure 4-1. A sparse matrix A, its factor R, and its column elimination tree

In the assembly process, the incoming data for a frontal matrix is concatenated

together to construct the frontal matrix. No flops are performed. Each row of A is assembled into a single frontal matrix. If the leftmost nonzero of row i is in column j, then row i is assembled into the frontal matrix containing node j of the elimination tree.

Figure 4-2 illustrates a leaf frontal matrix with no children in the assembly tree. It is the

first frontal matrix, and contains nodes 1 and 2 of the column elimination tree. Six rows

of A have leftmost nonzeros in columns 1 or 2. These are concatenated together to form

48 a 6-by-5 frontal matrix, held as a 6-by-5 dense matrix. Note that the dimensions of a frontal matrix are typically much less than the dimensions of A itself.

The frontal matrix does include some explicit zero entries, in the first column. This is due to the amalgamation of the two nodes into the front.

4.2.1.3 Factorization phase

This is where the bulk of the floating-point operations are performed (the remainder

are done in the next step, the solve phase). All of the flops are computed within

small dense frontal matrices (small relative to the dimensions of A, to be precise; the frontal matrices can be quite large). These computations are very regular, very

compute-intensive (relative to the memory traffic requirements), and thus well-suited to

be executed on one or more GPUs.

Continuing the example in Figure 4-1, after the 6 rows of A are assembled into the front, we compute its QR factorization, reflected in the matrix on the right side of

Figure 4-2. In the figure, each r is a nonzero in R, each h is a nonzero Householder

coefficient, and each c is an entry in the contribution block.

Figure 4-2. Assembly and factorization of a leaf frontal matrix

Figure 4-3 illustrates what happens in the factorization of a frontal matrix that is not a leaf in the tree. Three prior contribution blocks are concatenated and interleaved together, along with all rows of A whose leftmost nonzero falls in columns 5, 6, or 7 (the

49 fully-assembled columns of this front). The QR factorization of this rectangular matrix is then computed.

Figure 4-3. Assembly and factorization of a frontal matrix with three children

Computation across multiple CPU cores and multiple GPUs can be obtained by splitting the tree in a coarse-grain fashion. At the root of each subtree, a single contribution block would need to be sent, or distributed, to the CPU/GPU cores that handle the parent node of the tree. Our current method exploits only a single GPU, but to handle very large problems, it splits the trees into subtrees that fit in the global memory of the GPU. 4.2.1.4 Solve phase

Once the system is factorized, the factors typically need to be used to solve a linear system or least-squares problem. These computations are regular in nature, but they perform a number of flops proportional to the number of nonzeros in the factors. The ratio of flop count per memory reference is quite low. Thus, we perform this computation on the CPU, and leave a GPU implementation for future work.

4.2.2 GPU architecture

Before we consider the design of algorithms suitable for use on a GPU, an overview of GPU architecture must be considered. A GPU provides the promise of

50 high-performance computing with many floating-point cores, but the cores in a GPU are not as flexible as the cores in a CPU. The transistor budget for GPUs heavily favors arithmetic logic units (ALUs) rather than control units, branch predictors, translation look-aside buffers, etc.

A GPU consists of a set of SMs (Streaming Multiprocessors), each with a set of cores that operate in lock-step manner. The memory available on each SM is very limited (32K to 64K bytes) and must be shared among multiple blocks of threads.

Minimizing the memory traffic between slow GPU global memory and very fast shared memory is crucial to achieve high performance. Dense QR factorization has a very good compute-to-data-movement ratio and can achieve high performance even under these limitations.

Code written for execution on a GPU is called a GPU kernel, and kernels can be written using NVIDIA CUDA-C or OpenCL. When designing a kernel, the programmer imagines how the work can be divided into subparts that can execute in parallel. When launching a kernel, the programmer specifies a number of thread blocks and the number of threads per block the GPU should commit to the kernel launch. These kernel launch parameters describe how the work is intended to be divided by the GPU among its SMs, and GPUs have varying upper bounds on the allowed parameter values. Regardless of the number of available SMs or cores per SM for a particular GPU, the GPU’s scheduler assigns thread blocks to SMs and executes the kernel until the thread block completes its execution. The GPU scheduler organizes threads into collections of 32, called a warp, and threads constituting a warp execute code in a Single Program Multiple Data

(SPMD) fashion within an SM [74].

51 Each thread on the GPU has access to a small number of registers, which cannot be shared with any other thread.1 The GPU in our experiments, an NVIDIA Tesla

C2070, provides up to 31 double-precision floating-point registers for each thread.

Each SM has a small amount of shared memory that can be accessed and shared by all threads on the SM, but which is not accessible to other SMs. That is, the shared memory acts like a user-programmable cache for the SM, and there is no cache coherency across multiple SMs. The shared memory is arranged in banks, and bank conflicts occur if multiple threads attempt to access different entries in the same bank at the same time. Matrices are padded to avoid bank conflicts, so that a row-major matrix of size m-by-n is held in an m-by-(n + 1) array. Accessing of shared memory can be done in randomly-accessed order without penalty, so long as bank conflict is avoided.

The NVIDIA Tesla C2070 provides 48K of shared memory for each SM.

To communicate with other thread blocks, global memory on the GPU card must be used. This global memory is large, but its bandwidth is much smaller than the bandwidth of shared memory, and the latency is higher. Global memory is also limited in that it does not perform well if accessed in a permuted or irregular fashion, rather than a contiguous fashion. In contrast, the shared memory can be accessed in irregular order by the different threads in a thread block. Global memory can be read by all SMs, but must be read with stride-one access for best performance. If 32 threads in a warp access global memory locations i, i + 1, i + 2, ... , i + 31, then all of their memory accesses are coalesced into a single memory transaction. The NVIDIA Tesla C2070 has

6GB of global memory.

The GPU provides hardware support for fast warp-level context switching on an

SM, and the GPU scheduler attempts to overlap global memory transactions with

1 New GPUs allow for some sharing of register data amongst the threads in a single warp, but our current algorithm does not exploit this feature.

52 Figure 4-4. High Level GPU Architecture [74] computation. The scheduler identifies and swaps warps performing global memory reads/writes with warps ready to perform arithmetic operations. While a memory transaction for one warp is pending, the SM executes another warp whose memory transactions are ready. Thus it is advantageous for programmers to design kernels that launch with multiple thread blocks and enough threads per block, so the GPU scheduler can leverage this memory latency hiding.

All three layers of memory (global, shared, and register) must be carefully and explicitly managed for best performance, with multiple memory transactions between each layer “in flight” at the same time, with many warps, so that computation can proceed in one warp while another warp is waiting for its memory transaction to complete. If a matrix is to be accessed both by row and column order, it is best to copy it into shared memory first. Matrices accessed only in a single manner (by row, in particular, if the matrix is stored in row-major format) can be more easily loaded directly from global memory into register, bypassing shared memory.

53 The NVIDIA Tesla C2070 (Fermi) has 448 double-precision floating point cores. It operates at up to 515 GFlops, and twice that speed in single-precision. The 448 cores are partitioned into 14 thread groups/blocks called Streaming Multiprocessors, or SM of

32 cores each. Within an SM, the cores operate in lock-step fashion. A single SM of 32 cores has access to 64KB of shared RAM, typically configured as 16K of L1 cache and 48K of addressable shared-memory for sharing data between threads in a block. The

SM has 32K of register space, but these are partitioned amongst the threads, with no sharing of registers between threads. All 14 SMs share an L2 cache of 768KB. Sharing between SMs is done via the 6GB of global shared memory. Up to 16 kernels can be active concurrently on the Fermi architecture, on different SMs. At launch, the cost of this device was about $2200, or about $5 per GFlop. This is much less expensive than the same performance on multicore CPUs, and the power consumption of GPUs is a small fraction (1/20th) of multicore CPUs. These metrics provide a strong rationale for the development of algorithms that exploit general-purpose GPU computing. 4.3 Related Work

Anderson et al. [5] and Demmel et al. [20] consider how to exploit the orthogonal properties of QR factorization to reduce communication costs in parallel methods for dense matrices. They apply these methods to many parallel computing environments, including GPU-based computing. Their results are particularly important for our paper, since our bucket scheduler is an extension of this idea. They do not consider the sparse case, nor the staircase-form of our frontal matrices. They do not consider multiple factorizations and multiple assembly operations simultaneously active on the same

GPU, as we must do in our sparse QR factorization.

NVIDIA provides a CUDA BLAS for dense matrix operations such as matrix-matrix multiply; see http://www.culatools.com. It includes a dense QR factorization that can achieve 370 GFlops (single precision) on one C2070 Fermi GPU (130 GFlops in double precision).

54 NVIDIA has also developed an efficient sparse-matrix-vector multiplication algorithm [7], which achieves 36 GFlops, in single precision, on a GeForce GTX 280 (with a peak performance of 933 GFlops). The performance of sparse-matrix-vector multiplication is limited by the GPU memory bandwidth, since it computes only 2 floating-point operations per nonzero in A. Sparse matrix multiplication has similarities to the irregular assembly step in the sparse multifrontal QR factorization.

Krawezik and Poole [61], Lucas et al. [69], Pierce et al. [77], and Vuduc et al. [88] have worked on multifrontal factorization methods for GPUs. All four methods exploit the GPU by transferring one frontal matrix at a time to the GPU and then retrieving the results. The assembly operations are done in the CPU. As far as we know, no one has yet considered a GPU-based method for multifrontal sparse QR factorization, and no one has considered a GPU-based multifrontal method (LU, QR, or Cholesky) where an entire subtree is transferred to the GPU, as is done in the work reported here.

4.4 Parallel Algorithm

The computational workflow of QR factorization is structured as a tree, where each node is the factorization of a dense submatrix. The edges represent an irregular data movement in which the results from a child node are assembled into the frontal matrix of the parent. Each child node can be computed independently. However, an assembly phase after the children are executed precedes the factorization of their parent. The dense QR factorization of each frontal matrix has a very good compute to data movement ratio and can achieve high performance even under these limitations. Our algorithm is flexible in the number of threads/SMs used for concurrently executing multiple dense QRs (of potentially different sizes). At or near the leaves of the tree, each SM in a GPU works on its own frontal matrix. Further up the tree, multiple SMs collaborate to factorize a frontal matrix.

The scheduling algorithm does not assume that the entire problem will fit in the memory of a single GPU. Rather the algorithm moves subtrees to the GPU, factorizes

55 them, and moves the resulting contribution blocks and their resulting R factors off of the GPU. This data movement between CPU RAM and GPU RAM is expensive, since it moves across the relatively slow PCI bus. We double-buffer this data movement, so that we can be moving data to/from the GPU for one subtree, while the GPU is working on another. Although hardware instructions are provided for atomic operations and intra-SM thread synchronization, GPU devices offer poor support for inter-SM synchronization.

Our execution model uses a master-slave paradigm where the CPU is responsible for building a list of tasks, sending the list to the GPU, synchronizing the device, and launching the kernel. Since a GPU has poor inter-SM synchronization, we construct the set of tasks so that they have no dependencies between them at all. The GPU receives the list of tasks from the CPU for each kernel launch, performing operations described by each task. The kernel implementation is monolithic, inspecting the task descriptor to execute the appropriate device function. In this manner, a single kernel launch simultaneously computes the results for tasks in many different stages of the factorization pipeline. The CPU arranges tasks such that there are no data dependencies within the task list for a particular kernel launch. This software design pattern is called the uberkernel¨ [86]. Thus factorization of the matrix may take several kernel launches to complete. We launch each kernel asynchronously using the NVIDIA CUDA events and streams model. While one kernel is executing within a CUDA stream, the CPU builds the list of tasks for the next kernel launch. We use another stream to send the next list of tasks asynchronously. The CPU is responsible for synchronizing the device prior to launching the next kernel in order to ensure that the task data has arrived and that the previous kernel launch has completed.

The details of the algorithm are presented in the next several subsections.

56 4.4.1 Dense QR Scheduler (the Bucket Scheduler)

Our CPU-based Dense QR Scheduler is comprised of a data structure representing the factorization state of the matrix together with an algorithm for scheduling tasks to be performed by the GPU. We call this algorithm and its data structure the bucket scheduler.

The algorithm partitions the input matrix into 32-by-32 submatrices that we call tiles.

We use this term because the term block is so heavily overloaded already, and using it for our bucket scheduler algorithm would lead to confusion. 2 The choice of a tile size reflects the thread geometry of the GPU and the amount of shared memory that each

SM can access.

All tiles in a single row are called row tiles, so that a single row tile in a 256-by-160 matrix consists of a submatrix of size 32-by-160. In our scheduler, we refer to a row tile by a single integer, its row tile index. In contrast, a column tile is just a single tile, so one row tile in a 256-by-160 matrix consists of a 32-by-160 submatrix, containing 5 column tiles. The leftmost column tile in a row tile refers to the nonzero column tile with the least column tile index. In a row tile, all column tiles to the left of the leftmost column tile are all zero. Each row tile has a flag indicating whether or not its leftmost column tile is in upper triangular form. The goal of the QR factorization is to reduce the matrix so that the kth row tile has a leftmost column tile k in upper triangular form.

2 The term block is often used in the context of block matrix algorithms that operate on submatrices, rather than vectors and scalars [27]. Furthermore, in the context of GPU computing, a block may also refer to a thread block executing on an SM [74]. Another usage of the term occurs in the commonly-used phrase “block Householder”[9]. We use the term block yet again in our GPU kernels, in this paper, to refer to a very small submatrix operated on by a single thread (typically 4-by-4, 8-by-1, or 4-b-2, called the bitty block of a thread). Fortunately, we do not use the term “blocking” in the context of synchronizing threads in parallel algorithm, but the phrase is often used that way by other authors [22, 53]. So to avoid confusion, the term tile is used here, exclusively for the 32-by-32 submatrices in operated on by our bucket scheduler algorithm.

57 All 32 rows in a row tile are contiguous. A set of two or more row tiles with the same leftmost column tile can be placed in a bundle, where the row tiles in a bundle need not be contiguous.

We place row tiles into column buckets, where row tile i with leftmost column tile j is placed into column bucket j. During factorization, row tiles move from their initial positions in the column buckets to the right until each column bucket contains exactly one row tile with its flag set to indicate that it is upper triangular. Figure 4-5 shows a

256-by-160 matrix and its corresponding buckets after initialization. In the figure, tiles

(7,1) and (8,1) are numerically all zero.

Figure 4-5. A 256-by-160 dense matrix blocked into 32-by-32 tiles and the corresponding state of the bucket scheduler.

The CPU is responsible for manipulating row tiles within bundles, filling a queue of work for the GPU to perform, and advancing row tiles across column buckets until exactly one row tile remains in each column bucket. Each round of factorization builds a set of tasks by iterating over the column buckets and symbolically manipulating their constituent row tiles.

All row tiles in a bundle have the same leftmost column tile, prior to factorization of the bundle. After factorization, the leftmost column tile of the topmost row tile is placed into upper triangular form, and the leftmost column tile of the remaining row tiles are all

58 zeroed out. The bundle now represents a set of row tiles to which a block Householder update must be allied, to all column tiles to the right of the leftmost column tile of the

bundle.

We iterate over the column buckets and for each column bucket, we perform the

operations described below, building a set of tasks to be executed by the GPU at each kernel launch. Referring to Figure 4-6, each image represents the tasks at each kernel launch and is color-coded by bundle. Gray tiles are unmodified by their respective kernel launches. White tiles are finished. The following operations continue until only one upper triangular row tile appears in each column bucket:

1. Generate bundles and build apply tasks on the CPU: Row tiles that are unassociated

with a bundle become new bundles ready for factorization. Figure 4-6b illustrates

this by grouping tiles into bundles of size 3 (red), 3 (magenta), and 2 (green).

Factorize tasks are created for such bundles.

2. Launch the kernel with its current set of non-uniform tasks: Launch the GPU kernel

and perform any queued Factorize and Apply tasks. Factorize tasks factorize the

leftmost column tile in each bundle, and Apply tasks apply a block Householder

update from a prior Factorize task in the previous kernel launch.

3. Advance the bundles on the CPU: Next we advance the bundles, leaving the

topmost row tile in upper triangular form, as shown in Figure 4-6c for column

bucket 1. These advancing bundles move to the next column bucket and represent

a pending block Householder apply from the previous factorization step.

The bucket scheduler can further exploit parallelism and decrease the number of

idle tiles (shown as gray in Figure 4-6), by following a block Householder update with an

59 (a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k) (l)

(m) (n)

Figure 4-6. Factorization of a 256-by-160 matrix in 12 kernel launches

60 immediate factorization of the same bundle. In a modification of the procedure described above, we regard advancing bundles to be candidates for pipelining. This pipelining approach occurs in two different scenarios.

1. The first scenario involves adding idle tiles to preexisting bundles. A row tile may

become idle if its bundle has just been factorized and it is the only member of its

bundle following bundle advancement. The kernel launch between Figure 4-6b

and Figure 4-6c leaves tile (7,2) upper triangular and idle following bundle

advancement (green bundle). Instead of leaving the tile idle, a new bundle could

be formed, consisting of tiles (5,2), (6,2), and (7,2). This new bundle could be

factorized immediately after the block Householder apply of the magenta bundle.

We call this strategy bundle growth, and we call the set of newly added idle row

tiles the bundle’s delta. Although the bundle delta does not participate in its host

bundle’s block Householder apply, it does participate in a subsequent factorization

of the host bundle. In the example discussed above, tile (7,2) would be the new

bundle’s delta. This blue bundle can then be be pipelined with the preexisiting

magenta bundle, where the magenta task that applies a Householder update

to tiles (4,2),(5,2) and (6,2) can immediately follow this with the factorizations

of the blue bundle, instead of waiting until the next kernel launch. Comparing

Figure 4-6c and Figure 4-7c illustrates the addition of this tile into the blue bundle

in Figure 4-7c. With pipelining, the blue bundle is factorized in the second kernel

launch (Figure 4-6c), rather than waiting until the third kernel launch (Figure 4-7d)

when pipelining is not exploited.

2. The second scenario involves bundles that do not undergo bundle growth and

are scheduled only for block Householder applies. The red bundle in Figure 4-6c

61 (a) (b) (c) (d)

(e) (f) (g) (h)

(i)

Figure 4-7. Pipelined factorization of the 256-by-160 matrix in 7 kernel launches

62 is an example of this. When such bundles perform their block Householder

applies, we can pipeline the factorization of the bundle’s leftmost column tile.

Comparing Figure 4-6c to Figure 4-7c demonstrates that a new yellow bundle

has been created, representing the pipelined task of performing the red bundle’s

block Householder apply followed by a fresh factorization of the tiles contained

therein. We call tasks representing this pipelined approach, Apply/Factorize tasks.

With pipelining, the yellow bundle with tiles (2,2) and (3,2) is factorized in the

second kernel launch (Figure 4-6c), rather than waiting until the third kernel launch

(Figure 4-7d) when pipelining is not exploited.

Without pipelining, the bundle performs its apply and factorize on the next kernel

launch, resulting in a portion of the matrix remaining idle during the next kernel

launch. For example, in Figure 4-6d, row tiles 2, 3, 5, 6, and 7 are mostly gray,

except for a factorization of their leftmost column tiles. By comparison, these same

row tiles are fully active in Fiugre 4-7d.

4.4.2 Computational Kernels on the GPU

In this section, we provide details of the key computational kernels required for the simultaneous dense QR factorization of multiple frontal matrices on the GPU.

The Factorize task factorizes the leading tiles of a bundle, producing a block Householder update (the V and T matrices) and an upper triangular factor R in the top row tile. The Apply task uses the V and T matrices to apply the block Householder to the remaining column tiles, to the right in this bundle.

Our tile size is selected to be a 32-by-32 submatrix, so that the row and column dimensions match the size of a warp (32 threads). Six tiles can fit exactly into the 48K of shared RAM available each SM, but with padding this drops to five tiles. The V matrix is lower trapezoidal and T is a single upper triangular tile, so three tiles are set aside for

63 V and T (plus one row). Three tiles are used for V , and the upper triangular part of its topmost tile (where V is lower triangular) holds the matrix T . Two tiles hold a temporary

matrix C.

4.4.2.1 Factorize kernel

The Factorize task requires a description of the bundle, the column tile, and a

memory address to store T . The GPU factorizes the leftmost column tile of the bundle into upper triangular form (called A in this task) and overwrites it with the Householder vectors, V , and the upper triangular T matrix, which are then written back into GPU global memory. Saving V and T is necessary because on the next GPU kernel launch, they are involved in the block Householder application across the remaining columns in this bundle. The lower triangular portion of the topmost tile of V is stored together

with T . Because both V and T contain diagonal values, the resulting memory space is a 33-by-32 tile with V offset by 1. We call this combined structure a VT tile. This in

turn leaves the first tile in the bundle upper triangular, and the other tiles in the bundle

contain V .

This description assumes a bundle of three tiles and a kernel launch with 384

threads per task, but we have other kernels for different bundle sizes.

1. Load A from global memory. All 384 threads in the task cooperate to load the

bundle’s tiles from global memory (the A matrix of size m-by-n) into a single shared

memory array of size m-by-(n + 1), where m = 96 and n = 32. No computation

is performed while A is being loaded. After A is loaded, each thread loads into

register 8 entries of A along a single column for which it is responsible. It keeps

these entries in register for the entire factorization. We call this 8-by-1 submatrix

operated on by a single thread a bitty block.

64 ∑ 2 2. Compute σ for the first column, where σ = (A2:m,1) (where 2 : m denotes

2...m). This computation is composed of two reduction operations. First, the 12

threads responsible for the first column of A compute the sum of squares using

an 8-way fused multiply-add reduction in register memory, saving the final result

into shared memory. Threads in the thread block synchronize to ensure they have

all finished the first phase before entering the second phase of the reduction. We

designate the first thread in the thread block to be the master thread. The master

thread completes the computation with a 12-way summation reduction reading

from shared memory into register memory. The master thread retains σ in register

during the factorization loop.

3. The main factorization loop iterates over the columns of A, performing the

following operations (a) through (f) in sequence.

(a) Write the k th column of A back into shared memory. The threads

responsible for the k th column write their A values from register back

into shared memory. The SM then synchronizes all 384 threads before

proceeding, maintaining memory consistency. Note that once the diagonal

value is computed, the k th column becomes the k th Householder vector.

Additionally, values above the diagonal are the k th column of R.

(b) Compute the k th diagonal. The master thread is responsible for computing

the k th diagonal entry of R, the k th diagonal entry of V , and τ: √ 2 s = akk + σ

65    akk − s if akk ≤ 0, vkk =   −σ  akk > 0 akk +s if

τ = −1 svkk

If σ is very small, the square root is skipped, and vkk along with τ are set to 0.

Because all threads will need vkk and τ, the remaining threads synchronize

with the master thread.

(c) Compute an intermediate z vector. All threads cooperate to perform a

matrix-vector multiply to produce an intermediate vector z, which is used to

compute the V and T matrices. The vector z is defined as:

T z = −τv Ak:m,1:n

where v is the Householder vector, held in Ak:m,k .

To compute z, each thread loads the entries of the v vector it requires from

shared memory into register memory, from Ak:m,k . The thread’s bitty block of A

is already in register (the 8 entries of A that the thread operates on).

The calculation is done in two parts. First, each thread performs an 8-way

partial dot product reduction using fused multiply-adds in register memory.

The threads store the partial result in a 12 by 32 region of shared memory.

The threads synchronize to guarantee that they have completed the operation

before proceeding with the second phase of the calculation. The second

phase of the calculation involves only a single warp. This warp performs

the final 12-way summation reduction into shared memory, completing the

calculation of z.

66 (d) Update A in register memory. All threads responsible for columns to the

right of the k th column of A participate in updating A by summing values with

the outer product of v and z:

Ak:m,k+1:n = Ak:m,k+1:n + vzk+1:n

Threads involved in the outer product computation already have v in register

memory, and they need only load z from shared memory once to update their

values of A. Each thread needs to load a single value of z, since each bitty

block is 8-by-1, in a single column of A.

(e) Compute the next σ value. Some threads participating in updating A in

register memory may also begin to compute the next σ value if they are

st responsible for the (k + 1) column of A. These threads participate in

computing σ, and the process is the same as the computation of σ for the first

column.

(f) Construct the k th column of T . The matrix T is augmented by appending

the result of the matrix-vector multiply:

T T1:k−1,k = T1:k−1,1:k−1z1:k−1

Threads 1 to k − 1 are assigned to compute the kth column of T , where the

ith thread performs the inner product to compute tik . Threads load values of

T and z from shared memory, accumulating the result in register memory.

Finally, the participating threads each write their scalar result tik from register

memory into shared memory. The master thread writes tkk = τ.

4. Store A, V , and T back into global memory. All 384 threads in the task

cooperate to store the bundle’s tiles back into global memory. Since the first

67 tile of V and T are held in a single VT tile in global memory, they are stored

together to maintain coalesced global memory transactions.

Once the VT tile is stored, the three tiles that hold A are stored back into global

memory in the frontal matrix being factorized. The first tile of A is the upper

triangular matrix R, and the remaining tiles are the second and third tiles of the

Householder vectors, V .

Following a Factorize task, the corresponding bundle’s topmost tile contains R.

The remaining leftmost column tiles contain V , which is used in the subsequent block Householder applies. The first tile of V (which is lower triangular) is stored together with the upper triangular T matrix in a separate 33-by-32 global memory space (the VT tile), since the top left tile in the frontal matrix now holds R. The VT tile remains only until the next kernel launch, when the block Householder update is applied to the column tiles to the left, in this bundle. At that point, the space is freed to hold another VT tile, from another bundle in this frontal matrix or in another one being factorized at the same time.

4.4.2.2 Apply kernel

Each Apply task involves a bundle, an originating column tile, a column tile range,

and the location of the VT tile. The GPU loads the VT tile and iterates over the column tile range performing the block Householder update, (1) C = V T A, (2) C = T T C, and

(3) A = A − VC.

Since the V and T matrices are used repeatedly, and since V is accessed both by row and column order (V and V T ), they are loaded from global memory by the SM and held in shared memory until the Apply task completes. The temporary C matrix is also held in shared memory or register. The A matrix remains only in global memory, and is staged into a shared memory buffer and then into register, one chunk at a time. The algorithm is as follows:

68 • Load V and T . All 384 threads in the task cooperate to load the V and T

matrices from global memory into a single shared memory array of size 97-by-32.

Since the first tile of V and T are held in a single VT tile in global memory, they are

loaded together to maintain coalesced global memory accesses. No computation

is performed while V and T are being loaded.

• Apply the block Householder: A = A − VT T V T A. The A matrix is 3-by-t tiles

in size in global memory, and represents a portion of the frontal matrix being

factorized. Since V and T take up three tiles of shared memory, two tiles remain

for a temporary matrix (C) required to apply the block Householder update.

Registers also limit the size of C and the submatrix of A that can be operated on. If

held in register, each thread can operate on at most a 4-by-4 submatrix of A or C

(its bitty block). With 384 threads, this results in a submatrix of A of size 96-by-64,

or 3-by-2 tiles. The same column dimension governs the size of C, which is 2-by-1

tiles in size.

Thus, the t column tiles of A are updated two at a time. Henceforth, to simplify

the discussion, A refers to the 96-by-64 submatrix (3-by-2 tiles) updated in each

iteration across the t column tiles. The block update is computed in three phases

as (1) C = V T A, (2) C = T T C and (3) A = A − VC, as follows:

1. Load A and compute C = V T A. This work is done in steps of 16 rows

each (a halftile), in a pipelined manner, where data for the next halftile is

loaded from global memory into shared, while the current halftile is being

computed. This enables the memory / computation overlap required for best

performance.

69 The C matrix is held in register, so the 2 tiles of shared memory (for C) are used to buffer the A matrix. This 32-by-64 matrix is split into two buffers B0 and B1, each of size 16-by-64 (two halftiles).

All threads prefetch the first halftile (p = 0) into register, which is the topmost

16-by-64 submatrix of A. This starts the pipeline going. Next, C = V T A is computed across six halftiles, one halftile (p) at a time: for p = 1 to 6

a. Write this halftile (p) of A from register into shared buffer Bp mod 2.

b. syncthreads.

c. Prefetch the next halftile (p + 1) of A from global to register.

T d. Compute C = V A, where A is in buffer Bp mod 2. end for

In step (b), all threads must wait until all threads reach this step, since there is a dependency between steps (a) and (d). However, steps (c) and (d) can occur simultaneously since they operate on different halftiles. In step (d), each thread computes a 4-by-2 bitty block of C, held in register for phases 1 and 2

(only the first 256 threads do step (d); the other 128 threads remain idle and are only used for memory transactions in this phase).

The global memory transactions for a warp are scheduled in step (c), but the warp does not need to wait for them to be completed before computing step (d) (they are not needed until step (d) of iteration p + 1). Likewise, no synchronization is required between step (d) of iteration p and step (a) of the next iteration p + 1.

Since steps (c) and (d) (for iteration p) can overlap with step (a) (for iteration p + 1), this algorithm keeps all parts of the SM busy at the same time:

70 computation (step (d)), global memory (step (c)), and shared memory (steps

(a) and (d)).

2. Compute C = T T C. All matrices are now in shared memory. Each thread

operates on the same 4-by-2 bitty block of C it operated on in phase 1, above,

and now writes its bitty block into the two tiles of shared memory. These are

no longer needed for the buffer B, but now hold C instead. Only the first 256

threads take part in this computation.

3. Compute A = A − VC, where V and C are in shared memory but A

remains in global. The A matrix had already been loaded in from global

memory once, in phase 1, but it was discarded since the limited shared

memory is already exhausted by holding V , T and the C/B buffer. Each of

the 384 threads updates a 4-by-4 bitty block of A.

The layout of the bitty blocks of A and C is an essential component to the algorithm. Proper design of the bitty blocks avoids bank conflicts and ensures that A is accessed with coalesced global memory accesses. Both A and C bitty blocks are spread across the matrices. They are not contiguous submatrices of A and C. The C matrix is 32-by-64 and is operated on by threads 0 to 255. Using

0-based notation, the 4-by-2 bitty block for thread i is defined as   c c  (i mod 8),(⌊i/8⌋) (i mod 8),(32+⌊i/8⌋)       c c   (8+i mod 8),(⌊i/8⌋) (8+i mod 8),(32+⌊i/8⌋)  C[i] =      c(16+i mod 8),(⌊i 8⌋) c(16+i mod 8),(32+⌊i 8⌋)   / / 

c(24+i mod 8),(⌊i/8⌋) c(24+i mod 8),(32+⌊i/8⌋)

71 where c0,0 is the top left entry of C. For example, the bitty blocks of threads 0 and 1 are, respectively:      c0,0 c0,32   c1,0 c1,32           c c   c c   8,0 8,32   9,0 9,32  C[0] =   , C[1] =        c16,0 c16,32   c17,0 c17,32     

c24,0 c24,32 c25,0 c25,32

The 4-by-4 bitty block of A for thread i is defined very differently than the C bitty block,   a ··· a  (⌊i/16⌋),(i mod 16) (⌊i/16⌋),(48+i mod 16)       a ··· a   (24+⌊i/16⌋),(i mod 16) (24+⌊i/16⌋),(48+i mod 16)  A[i] =      a(48+⌊i 16⌋),(i mod 16) ··· a(48+⌊i 16⌋),(48+i mod 16)   / / 

a(72+⌊i/16⌋),(i mod 16) ··· a(72+⌊i/16⌋),(48+i mod 16) so that thread 0 owns a0,0 and thread 1 owns a0,1. When used in our algorithm, these layouts of the C and A bitty blocks ensure that all global memory accesses are coalesced, that no memory bank conflicts occur, and that no significant register spilling occurs in our kernels.

With a 4-by-4 bitty block for A, each thread loads in 8 values from shared memory

(a 4-by-1 column vector of V and a 1-by-4 row vector of C), and then performs 32

floating point operations (a rank 1 outer product update of its 4-by-4 bitty block).

This gives a flops per memory transfer ratio of 4, which is essential because the floating point units for this particular GPU are 4 times faster than register bandwidth.

The 4-by-2 bitty block for C requires 6 loads for 16 operations, a ratio of 16/6 =

2.67. Since this is less than 4, it is sub-optimal, but unavoidable in the context of the entire block Householder update.

72 4.4.2.3 Apply/Factorize kernel

In an effort to reduce global memory traffic on the GPU, the Apply/Factorize

pipelined task attempts to avoid superfluous global memory loads and stores for tiles of the matrix modified in both the Apply segment and the Factorize segment. For

example, the last step of the Apply task performs a read-modify-write operation of the A matrix in global memory. However, for the Apply/Factorize task, A may instead be read from global memory, modified, and stored into shared memory, priming the immediate

Factorize. This modification saves two global memory operations. Because the completion of a kernel launch synchronizes the device, we regard it as an expensive barrier synchronization. The completion of a kernel launch wipes data from shared memory, and the execution of our dense QR kernels ceases. When the bucket scheduler adds Apply/Factorize tasks to the GPU work list, it reduces the number of kernel launches (i.e. barriers) with the goal that reducing the number of kernel launches increases GPU occupancy and throughput. We discuss the performance impact of the Apply/Factorize pipelined tasks in Section 4.5.

4.4.3 Sparse QR Scheduler

The CPU-based Sparse QR Scheduler represents the factorization state for each

dense front using a finite state machine, and it uses the Bucket Scheduler for the

simultaneous factorization of each those dense fronts. In other words, we have many bucket schedulers active at the same time. The Sparse QR Scheduler manages both

assembly and factorization kernel launches, coalescing the schedules of tasks from

many assembly operations and many dense QR bucket schedulers into a single kernel

launch.

Figure 4-8 illustrates the assembly tree for a sparse matrix with 68 fronts. Arrows point in the direction of contribution block data flow from child to parent. Fronts with no children have been activated and performing S-Assembly, as identified in light blue leaves. The size of each node reflects the size of the corresponding frontal matrix.

73 Fronts that are leaves in the assembly tree have no children and are activated for factorization first, as illustrated in Figure 4-8. The scheduler builds S-Assembly tasks for each front. Once values from the input problem are in place within the dense front, the front must now wait for contribution blocks from its children to be assembled into it. Once every child of a frontal matrix completes, the scheduler advances that front into factorization. The Bucket Scheduler is invoked to factorize the dense matrix.

Once dense factorization of the front completes, its rows of the result, the R factor, are ready to be transferred off the GPU. Further, its contribution block rows are ready to be assembled into its parent front. The scheduler builds Pack Assembly tasks to perform this operation. A front is finished when its rows of R are transferred off the GPU and its contribution block rows have been assembled into its parent.

We build tasks and execute kernels using a strategy similar to the Bucket Scheduler.

Using CUDA events and streams, the Sparse QR Scheduler builds a list of tasks to be completed by a kernel while the previous kernel executes on the GPU. This strategy affords us additional benefits. We are able to hide the latency of memory traffic between the GPU device and the CPU host. We perform a transfer of the R factor in a non-blocking fashion by initiating an asynchronous memory transfer on a CUDA stream and marking an event to record when the transfer completes. Furthermore, the R factor may become available before factorization completes. This occurs when the remaining factorization tasks involve only contribution block rows.

We choose to exploit the zeroes in the dense front by avoiding transferring these zeroes to the GPU. Instead, we copy values from the input problem into their corresponding dense fronts on the GPU. We do not exploit the zeroes in the lower triangular portion of the R factor. Instead of packing the R factor on the GPU, we simply transfer R as is.

In addition to the compute kernels used in the dense QR factorization, sparse QR factorization employs two kernels responsible for data movement:

74 Figure 4-8. Assembly tree for a sparse matrix with 68 fronts

75 • S-Assembly refers to scattering values from the sparse input problem, S, into the

dense frontal matrices residing on the GPU. The CPU packs all S entries for fronts

within a stage into a list of index-value tuples, and describes to the GPU where

each front can find its S entries. The value is copied within global memory to a

frontal matrix at the location referred to by the index field of the tuple. The data

movement is embarrassingly parallel since multifrontal QR factorization relies on

concatenation of the children contribution blocks. This is in contrast to multifrontal

LU or Cholesky factorization, where the contribution blocks of multiple children

must be summed, not concatenated.

We select a granularity with which to build S-Assembly tasks. In our implementation,

each thread is responsible for moving 4 values into position. S-Assembly may

occur concurrently with children pushing their contribution blocks into the front.

• Pack Assembly refers to scattering values from a front’s contribution block into

its parent. The CPU builds and sends two maps to the GPU that describe the

correspondence between a front’s row and column indices to its parent’s row and

column indices. We call these two maps Rimap and Rjmap, respectively. When a

front completes its factorization step, the values in its contribution block are copied

into its parent front. The CPU describes to the GPU where the front’s contribution

block begins, where its parent resides in GPU memory, the number of values to

copy, and the location of Rimap and Rjmap. The GPU reads Rimap and Rjmap into

shared memory and uses shared memory as a cache for fast index translations.

The data movement is embarrassingly parallel as with S-Assembly, and we

select a granularity that best suits GPU shared memory limits per streaming

multiprocessor. We select a maximum Pack Assembly tile size of 2048 entries of

Rimap and Rjmap.

76 4.4.4 Staging for Large Trees

During symbolic analysis, the CPU may discover that the amount of memory required to store the frontal matrices and assembly data on the GPU exceeds the total amount of memory available to the device. When this occurs, we switch to a strategy where we divide the assembly tree and perform the factorization in stages.

During symbolic analysis we compute a postordering of the fronts. We keep a list of stages to be executed by the GPU. Each entry in the staging list is an index into the postordered list. As we iterate over the postordering, we keep a running summation of the memory required by each front. The memory required by each front in a stage is the summation of the number of entries in the front, the entries of its children, and number of entries in the original sparse input matrix that are to be assembled into the front. As we traverse the fronts in this postordered manner, a new stage is created when the next front would exceed the memory limitation of the GPU.

Executing a staged sparse factorization uses the CPU-based Sparse QR Scheduler for each front in the stage. We transfer relevant values from the original input problem and assembly mappings and allocate space on the GPU for each front participating in the stage. We then invoke the Sparse QR Scheduler, and we flag fronts whose parents are in subsequent stages, signalling to the Sparse QR Scheduler to bypass the Pack

Assembly phase. Such fronts are roots of the subtrees in Figure 4-9.

Figure 4-10 illustrates the second stage of a multi-stage factorization. Arrows point in the direction of contribution block data flow from child to parent. Fronts with no children have been activated and performing either Pack Assembly if the front was in stage 1 or S-Assembly, if it is new to this stage. Children performing pack assembly are identified as yellow leaves, and children performing S assembly are identified as light blue leaves.

When crossing staging boundaries, the contribution block must be marshalled into the next stage. We perform this marshalling at the end of a stage when pulling rows of R

77 Figure 4-9. Stage 1 of an assembly tree for a sparse matrix with 68 fronts

78 Figure 4-10. Stage 2 of an assembly tree for a sparse matrix with 68 fronts from the GPU. In addition to the rows of R, we also pull the contribution block rows into a temporary location in CPU memory. As we build the data for the next stage, we send the contribution block back to the GPU. When invoking the Sparse QR Scheduler for the next stage, we flag fronts whose only data is contribution blocks, and those fronts begin factorization at the Pack Assembly phase, as illustrated in Figure 4-10.

We exploit zeroes during the marshalling of the contribution block by transferring the contribution block from the GPU into a temporary CPU workspace. We then perform a

79 submatrix copy of just the contribution block, ignoring zeroes in pivotal columns. We do not exploit the zeroes in the lower triangular portion of the contribution block.

4.5 Experimental Results

Experimental results were obtained on a single shared-memory system equipped

with two 12-core AMD OpteronTM 6168 processors, 64 GB of shared memory, and an

NVIDIA Tesla C2070 with 14 SMs each with 32 cores. The Tesla C2070 has a total of

6 GB memory, 4 GB of which is available as global device memory, 2 GB as texture

memory. We measured the performance of each of our compute kernels individually. Apply

tasks are able to achieve up to 183.3 GFlops. Factorize tasks are able to achieve up

to 23.62 GFlops. When a frontal matrix is small enough that it can be factorized by a

single task, the VT tile need not be computed. In this case, the factorize tasks are able to achieve up to 34.80 GFlops on a 72-by-64 problem. Factorize tasks suffer from a

hefty initial serial fraction computing σ for the first column.

We also measured the performance of QR factorization for dense square and

rectangular matrices, presented in Tables 4-1 and 4-2. In the tables, Canonical GFlops

reflects the Golub and Van Loan flop count for factoring dense matrices [42], and GPU GFlops is based on the number of flops actually performed by the GPU device.

The algorithm is able to achieve up to 31.83% of the Tesla C2070’s peak theoretical double-precision performance.

Table 4-1. Performance of 1x16 “short and fat” dense rectangular problems. Rows Cols Canonical GFlops GPU GFlops 128 2048 88.90 89.60 256 4096 110.30 150.09 384 6144 118.97 159.17

We compare our GPU-Accelerated Sparse QR to Davis’ SuiteSparseQR package

on 650 problems from the UF Sparse Matrix Collection [19]. SuiteSparseQR uses

80 Table 4-2. Performance of 16x1 “tall and skinny” dense rectangular problems. Rows Cols Canonical GFlops GPU GFlops 2048 128 29.42 45.47 4096 256 47.40 69.80 6144 384 60.31 87.03

Table 4-3. Five selected matrices from the UF Sparse Matrix Collection [19]. Problem Name Problem Type Rows Cols Nonzeros Intensity Flopcount Bomhof/circuit 2 circuit simulation 4510 4510 21199 7.1 0.02 LPnetlib/lp cre d linear programming 8926 73948 246614 203.5 11.33 Dattorro/EternityII A optimization 150638 7362 782087 425.5 43.18 GHS indef/olesnik0 2D/3D 88263 88263 744216 192.5 338.98 Qaplib/lp nug20 linear programming 15240 72600 304800 2110.4 6947.30

LAPACK for panel factorization and block Householder applies while our GPU-accelerated code uses our GPU compute kernels to accomplish the same. In Table 4-3, we describe a sample problem set representing a variety of domains.

Intensity refers to arithmetic intensity, the number of floating point operations required to factorize the matrix divided by the amount of memory (in bytes) required to represent the matrix. Flopcount refers to the number of billions of floating point operations needed to factorize the matrix (a canonical count, not what the GPU actually performs). Table 4-4 shows the results for these 5 matrices on our CPU and our GPU, and the relative speedup obtained on the GPU.

Figure 4-11 illustrates each of the 5 matrices using a force-directed rendering scheme developed by Yifan Hu [19]. The force-directed rendering realizes each matrix as an undirected graph, revealing the complexity latent within each problem.

Force-directed renderings of this kind treat the vertices as point masses in space and the edges as springs, and the rendering algorithm attempts to assign geometric coordinates to the point masses in order to minimize the force exerted upon the springs.

Figure 4-12 shows the speedup of our GPU-accelerated code over the SuiteSparseQR code as a function of arithmetic intensity for all 650 test matrices, in a logarithmic scale.

Each dot represents an individual matrix in the collection. The green horizontal line

81 Table 4-4. Performance results for the 5 matrices listed in Table 4-3. Problem CPU Time (s) GPU Time (s) CPU GFlops GPU GFlops Speedup Name Bomhof/circuit 2 0.05 0.20 0.4 0.1 0.25 LPnetlib/lp cre d 4.05 0.61 2.8 18.5 6.68 Dattorro/EternityII A 13.97 1.37 3.1 31.5 10.19 GHS indef/olesnik0 42.36 5.20 8.0 65.2 8.14 Qaplib/lp nug20 171.26 54.65 40.6 127.1 3.13

(a) (b) (c)

(d) (e)

Figure 4-11. Force-directed renderings of 5 selected problems from the UF Collection indicates no speedup. Dots appearing above the green line experienced speedup when using the GPU-accelerated code. Dots below the green line experienced slowdown.

Many problems experience significant speedup of up to 10x over the CPU-based method. Speedup is limited by two factors:

82 Figure 4-12. GPU-accelerated speedup over the CPU-only algorithm versus arithmetic intensity on a logarithmic scale.

1. Available parallel flops: Dense QR factorization offers O(n3) flops for O(n2)

memory storage. As arithmetic intensity increases, the algorithm is able to exploit

more parallelism than the CPU-based method. However, for small problems

such as Bomhof/circuit 2 in Table 4-3, the algorithm is unable to exploit enough

parallelism. As a result, the time to factorize for small problems is dominated by

memory transfer costs.

2. Hardware resources on the GPU: Current GPU devices offer several cores

arranged into SMs along with small amounts of fast shared memory per SM.

Our algorithm is designed to flood the GPU device with many parallel tasks.

However, as problem size grows with arithmetic intensity, we reach a performance

asymptote as the amount of available GPU hardware resources begins to limit the

performance of our algorithm.

83 We examined the impact of the pipelined factorization method described in Section 4.4.1 in which a bundle may be factorized immediately following a block

Householder apply.

Pipelining both reduces the number of kernel launches required to factorize the problem by a nearly factor of 2, and increases the amount of parallel work sent to the GPU per kernel launch. Pipelining also ensures that nearly every tile of the matrix is modified at each kernel launch, and it also leads to a significantly more uniform amount of workload per kernel launch. However, in context of our current GPU, the pipelined strategy requires 5% more time to factorize large problems. We anticipate that as GPU devices continue to add more SMs, or as we move to a multiple-GPU algorithm with many GPUs, the pipelined factorization will eventually outperform the non-pipelined strategy.

4.6 Future Work

There are several opportunities for future work for sparse QR factorization and related problems. QR factorization is representative of many other sparse direct methods, with both irregular coarse-grain parallelism and regular fine-grain parallelism, and any methodologies developed will be very relevant for these other methods. Pipelined Factorization: Our pipelining strategy for factoring a matrix affords more parallelism than our non-pipelined strategy. However, the pipelined factorization performs worse than the non-pipelined factorization on current hardware. We think this could be attributed to non-uniform task size granularity between ApplyFactorize pipelined tasks and Apply tasks. Apply tasks perform block Householder applies across several column tiles while ApplyFactorize tasks apply to only a single column tile. We intend to investigate this in future work.

Leverage Multiple GPU Devices: Our current GPU-based sparse multifrontal

QR factorization algorithm assumes the CPU scheduler interacts with a single GPU accelerator. We intend to investigate exploiting additional parallelism using our staging

84 strategy in a multi-GPU environment where the CPU scheduler divides a stage into multiple units of work that can execute on multiple GPUs concurrently. The coordinating

responsibilities of the CPU increase because the contribution blocks from child fronts on

one GPU could assemble into parent fronts on another GPU across different stages.

Heterogeneous Computing: Our current algorithm regards the CPU as a master scheduler, responsible for managing bucket schedulers for each front, assigning tasks

to the GPU, and transferring data between the host environment and the device. We

imagine a more heterogeneous computing model where in addition to these CPU

responsibilities, the CPU may also participate in the factorization itself. We will develop

a performance model and adapt Davis’ multicore sparse multifrontal QR factorization algorithm accordingly.

Distributed Memory Computing: We intend to extend our CPU scheduler to a distributed memory model using MPI. The scheduler would construct a staging scheme similar to a multi-GPU setting in which front data and tasks distributed to MPI nodes could execute concurrently.

Increasing Bundle Size: As GPU devices become more sophisticated, we expect the amount of available shared memory for each streaming multiprocessor to increase.

Our current implementation limits our bundle and panel size to 3 to maximize the ratio of computation to memory while remaining within the limits of available shared memory on the GPU. As GPU streaming multiprocessors gain more available shared memory, we would increase our bundle and panel size limits accordingly. This would increase the amount of work described by each GPU task and reduce the total number of kernel launches required to factorize the matrix. We expect that pipelining in the bucket scheduler will be more effective with larger bundle sizes.

85 4.7 Summary

In this chapter, we presented a novel sparse QR factorization method tailored for use on GPU-accelerated systems. The algorithm is able to factorize multiple frontal matrices simultaneously, while limiting costly memory transfers between CPU and GPU.

The algorithm uses the master-slave paradigm where the CPU serves as the master and the GPU as the slave. We extend the Communication-Avoiding QR factorization [21] strategy using our bucket scheduler, exploiting a large degree of parallelism and reducing the overall number of GPU kernel launches required to factorize the problem.

The algorithm uses the uberkernel¨ design pattern, allowing many different tasks for many different fronts to be computed simultaneously in a single kernel launch.

Additionally, the algorithm schedules two flavors of assembly tasks that move data between memory spaces on the GPU. These assembly tasks are responsible for transfering data from a packed input into frontal matrices prior to factorization as well as transfering data from child fronts to parent fronts. As fronts are factorized, their rows of R are asynchronously transfered off the GPU using the CUDA events and streams model. For large sparse problems whose frontal matrices cannot simultaneously fit on the

GPU, our algorithm examines the frontal matrix assembly tree and divides the fronts into stages of execution. The algorithm then moves data in stages to the GPU, factorizes the fronts within the stage, and transfers the results off the GPU. Contribution blocks are then passed back to the GPU, ready for push assembly.

For large sparse matrices, the GPU-accelerated code offers up to 10x speedup over

CPU-based QR factorization methods, and achieves up to 80 GFlops as compared to a peak of 15 GFlops for the same algorithm on a multicore CPU (with 24 cores).

Our code is available at http://www.suitesparse.com.

86 CHAPTER 5 CONCLUSION

In this dissertation, we claimed that combining combinatorial and continuous formulations of the edge separator problem leads to an improvement in cut quality and balance partition ratio for modern large power law series graphs. We also claimed that a sparse multifrontal QR factorization method implemented on GPU accelerators is more efficient than contemporary CPU-based multifrontal QR factorization methods.

Our hybrid multilevel combinatorial-continuous graph partitioner ties METIS 5 for many problems. However, for power-law graphs that arise in social networking modalities, our graph partitioner finds higher quality graph cuts and with similar runtime performance. More importantly, the hybrid approach finds higher quality cuts than either method acting in isolation. For our GPU-accelerated sparse multifrontal QR factorization, many problems experience significant speedup of up to 10x over the CPU-based method. In these cases, speedup is limited by the amount of parallel flops in the given problem as well as the available hardware resources on the current generation of GPUs.

This dissertation examined a hybrid multilevel approach to the edge separator problem using combinatorial methods together with continuous optimization techniques, specifically the quadratic programming formulation. A similar quadratic programming formulation exists for the vertex separator problem as well as analogs for the Fidducia-Mattheyes algorithm in a multilevel vertex separator setting. Examining the difference in fill-reducing orderings generated by these two methods remains unexplored. This dissertation also examined a GPU implementation of sparse multifrontal QR factorization tailored for execution on a single GPU device. Examining the computational speedup for multiple GPU devices on a single machine, multiple GPU devices spread across multiple machines, and heterogeneous computing where the CPU participates in the factorization remain unexplored. Algorithmically, investigating why pipelined factorization

87 is slower despite exposing more parallelism and the effects of increasing the bundle size remain unexplored. Finally, extending this work to sparse Cholesky and sparse LU factorization methods remains unexplored.

The broader impact of this work validates the value of intermixing continuous methods with combinatorial methods. By alternating between methods, our hybrid graph partitioner is able to find higher quality cuts than either method acting alone. With respect to GPU-accelerated sparse multifrontal QR factorization, this work illustrates the advantages of tailoring an algorithm to a hardware architecture focused on parallelism.

Instead of simply reimplementing existing superscalar algorithms for GPU devices, a reimagining of the essential algorithm may be necessary to exploit all of its latent parallelism.

88 REFERENCES [1] Amestoy, P. R., Davis, T. A., and Duff, I. S. “An approximate minimum degree ordering algorithm.” SIAM J. Matrix Anal. Appl. 17 (1996).4: 886–905.

[2] ———. “Algorithm 837: AMD, an approximate minimum degree ordering algorithm.” ACM Trans. Math. Softw. 30 (2004).3: 381–388.

[3] Amestoy, P. R., Duff, I. S., and Puglisi, C. “Multifrontal QR factorization in a multiprocessor environment.” Numer. Linear Algebra Appl. 3 (1996).4: 275–300.

URL http://dx.doi.org/10.1002/(SICI)1099-1506(199607/08)3:4<275:: AID-NLA83>3.0.CO;2-7 [4] Anderson, E., Bai, Z., Bischof, C. H., Blackford, S., Demmel, J. W., Dongarra, J. J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., and Sorensen, D. C. LAPACK Users’ Guide. Philadelphia, PA: SIAM, 1999, 3rd ed.

[5] Anderson, Michael, Ballard, Grey, Demmel, James, and Keutzer, Kurt. “Communication-Avoiding QR Decomposition for GPUs.” Tech. Rep. UCB/EECS-2010-131, EECS Dept., UC Berkeley, 2010.

[6] Ashcraft, C. C. and Grimes, R. “The Influence of Relaxed Supernode Partitions on the Multifrontal Method.” ACM Trans. Math. Softw. 15 (1989).4: 291–309.

[7] Bell, N. and Garland, M. “Efficient sparse matrix-vector multiplication on CUDA.” Tech. Rep. NVR-2008-004, NVIDIA, Santa Clara, CA, 2008. Http://www.nvidia.com/object/nvidia research pub 001.html.

[8] Bischof, C. H., Lewis, J. G., and Pierce, D. J. “Incremental Condition Estimation for Sparse Matrices.” SIAM J. Matrix Anal. Appl. 11 (1990).4: 644–659. [9] Bischof, Christian and van Loan, Charles. “The WY representation for products of householder matrices.” SIAM J. Sci. Stat. Comput. 8 (1987).1: 2–13.

[10] Bjorck,¨ A. Numerical methods for least squares problems. Philadelphia, PA: SIAM, 1996.

[11] Chevalier, C. and Pellegrini, F. “PT-SCOTCH: a tool for efficient parallel graph ordering.” Parallel Computing 34 (2008).6-8: 318–331. [12] Davis, T. A. “A column pre-ordering strategy for the unsymmetric-pattern multifrontal method.” Tech. Rep. TR-02-001, Univ. of Florida, CISE Dept., Gainesville, FL, 2002. (www.cise.ufl.edu. To appear in ACM Trans. Math. Softw.). [13] ———. Direct Methods for Sparse Linear Systems. Philadelphia, PA: SIAM, 2006.

[14] ———. “Algorithm 915: SuiteSparseQR, a multifrontal multithreaded sparse QR factorization package.” ACM Trans. Math. Softw. 38 (2011).1.

89 [15] ———. “Multifrontal multithreaded rank-revealing sparse QR factorization.” ACM Trans. Math. Softw. (2011). To appear.

[16] Davis, T. A. and Duff, I. S. “An unsymmetric-pattern multifrontal method for sparse LU factorization.” Tech. Rep. TR-93-018, CISE Dept., Univ. of Florida, Gainesville, FL, 1993. (appeared in SIAM J. Matrix Analysis and Applications, Jan. 1997).

[17] Davis, T. A., Gilbert, J. R., Larimore, S. I., and Ng, E. G. “Algorithm 836: COLAMD, a column approximate minimum degree ordering algorithm.” ACM Trans. Math. Softw. 30 (2004).3: 377–380.

[18] ———. “A column approximate minimum degree ordering algorithm.” ACM Trans. Math. Softw. 30 (2004).3: 353–376.

[19] Davis, T A. and Hu, Y. “The University of Florida sparse matrix collection.” ACM Trans. Math. Softw. 38 (2011).1: 1:1–1:25.

[20] Demmel, J. W., Grigori, L., Hoemmen, M., and Langou, J. “Communication-avoiding parallel and sequential QR factorizations.” Tech. rep., Univ. of Berkeley, EECS, 2008. Http://techreports.lib.berkeley.edu/accessPages/EECS-2008-74.html.

[21] ———. “Communication-optimal Parallel and Sequential QR and LU Factorizations.” ”SIAM J. Sci. Comput.” 34 (2012).1: 206–239. [22] Dijkstra, E. W. “Solution of a problem in concurrent programming control.” Commun. ACM 8 (1965).9: 569–.

[23] Dongarra, J., Croz, J. Du, Duff, I. S., and Hammarling, S. “Algorithm 679: A Set of Level 3 Basic Linear Algebra Subprograms.” ACM Transactions on Mathematical Software 16 (1990): 1–17, 18–28.

[24] Duff, I. S. and Reid, J. K. “A Comparison of Sparsity Orderings for Obtaining a Pivotal Sequence in Gaussian Elimination.” J. Inst. Math. Appl. 14 (1974): 281–291.

[25] ———. “The Multifrontal Solution of Indefinite Sparse Symmetric Linear Equations.” ACM Trans. Math. Softw. 9 (1983).3: 302–325. [26] Edlund, O. “A software package for sparse orthogonal factorization and updating.” ACM Trans. Math. Softw. 28 (2002).4: 448–482.

[27] Eves, Howard. Elementary Matrix Theory. DOVER PUBN INC, 1980.

[28] Fiduccia, C. M. and Mattheyses, R. M. “A linear-time heuristic for improving network partition.” Proc. 19th Design Automation Conf.. Las Vegas, NV, 1982, 175–181.

[29] Garey, Michael R. and Johnson, David S. Computers and Intractability: A Guide to the Theory of NP-Completeness. New York, NY, USA: W. H. Freeman & Co., 1979.

90 [30] George, A. “Nested Dissection of a Regular Finite Element Mesh.” SIAM J. Numer. Anal. 10 (1973).2: 345–363.

[31] George, A. and Heath, M. T. “Solution of Sparse Linear Least Squares Problems Using Givens Rotations.” Linear Algebra Appl. 34 (1980): 69–83.

[32] George, A., Heath, M. T., and Ng, E. G. “Solution of sparse underdetermined systems of linear equations.” SIAM J. Sci. Statist. Comput. 5 (1984).4: 988–997.

[33] George, A. and Liu, J. W. H. “An Automatic Nested Dissection Algorithm for Irregular Finite Element Problems.” SIAM J. Numer. Anal. 15 (1978): 1053–1069.

[34] ———. Computer Solution of Large Sparse Positive-Definite Systems. Englewood Cliffs, NJ: Prentice-Hall, 1981. [35] George, A., Liu, J. W. H., and Ng, E. G. “A Data Structure for Sparse QR and LU Factorizations.” SIAM J. Sci. Statist. Comput. 9 (1988).1: 100–121.

[36] George, A. and McIntyre, D. R. “On the application of the minimum degree algorithm to finite element systems.” SIAM J. Numer. Anal. 15 (1978): 90–111.

[37] George, Alan, Heath, Michael T., and Ng, Esmond. “A Comparison of Some Methods for Solving Sparse Linear Least-Squares Problems.” SIAM J. Sci. Statist. Comput. 4 (1983).2: 177–187.

[38] Gilbert, J. R. “Some nested dissection order is nearly optimal.” Information Processing Letters 26 (1988).6: 325 – 328. [39] Gilbert, J. R., Li, X. S., Ng, E. G., and Peyton, B. W. “Computing row and column counts for sparse QR and LU factorization.” BIT 41 (2001).4: 693–710.

[40] Gilbert, J. R., Moler, C., and Schreiber, R. “Sparse Matrices in MATLAB: Design and Implementation.” SIAM J. Matrix Anal. Appl. 13 (1992).1: 333–356.

[41] Gilbert, John R. and Tarjan, R. E. “The analysis of a nested dissection algorithm.” Numerische Mathematik 50 (1986): 377–404. [42] Golub, Gene Howard and Van Loan, Charles F. Matrix computations. Johns Hopkins studies in the mathematical sciences. Baltimore, London: The Johns Hopkins University Press, 1996.

URL http://opac.inria.fr/record=b1103116 [43] Gupta, A. “Fast and effective algorithms for graph partitioning and sparse matrix ordering.” Tech. Rep. RC 20496 (90799), IBM Research Division, Yorktown Heights, NY, 1996.

[44] Hager, W. W. and Krylyuk, Y. “Graph partitioning and continuous quadratic programming.” Tech. rep., Dept. of Mathematics, Univ. of Florida, Gainesville, FL, 1998. F, see SIAM JADM vol 12, 1999.

91 [45] ———. “Graph partitioning and continuous quadratic programming.” SIAM J. Disc. Math. 12 (1999): 500–523.

[46] Hager, William W. and Hungerford, James T. “A Continuous Quadratic Programming Formulation of the Vertex Separator Problem.” Tech. rep., Univ. of Florida, 2012.

URL http://www.math.ufl.edu/~hager/papers/GP/vertex.pdf [47] ———. “Optimality conditions for maximizing a function over a polyhedron.” Mathematical Programming (2013): 1–20.

[48] Heath, M. T. and Sorensen, D. C. “A Pipelined Givens Method for Computing the QR Factorization of a Sparse Matrix.” Linear Algebra Appl. 77 (1986): 189–203. [49] Heath, Michael T. “Some Extensions of an Algorithm for Sparse Linear Least Squares Problems.” SIAM J. Sci. Statist. Comput. 3 (1982).2: 223–237.

URL http://link.aip.org/link/?SCE/3/223/1 [50] Hendrickson, B. and Leland, R. “A multilevel algorithm for partitioning graphs.” Tech. Rep. SAND93-1301, Sandia National Laboratory, 1993.

[51] ———. “An improved spectral graph partitioning algorithm for mapping parallel computations.” SIAM J. Sci. Comput. 16 (1995).2: 452–469.

[52] Hendrickson, B. and Rothberg, E. “Improving the runtime and quality of nested dissection ordering.” Tech. rep., Sandia National Laboratories, Albuquerque, NM, 1997.

[53] Herlihy, Maurice, Luchangco, Victor, and Moir, Mark. “Obstruction-Free Synchronization: Double-Ended Queues as an Example.” ICDCS ’03: Pro- ceedings of the 23rd International Conference on Distributed Computing Systems. Washington, DC, USA: IEEE Computer Society, 2003.

URL http://portal.acm.org/citation.cfm?id=850929.851942 [54] Karypis, G. and Kumar, V. “Analysis of multilevel graph partitioning.” Tech. Rep. TR-95-037, Computer Science Dept., Univ. of Minnesota, Minneapolis, MN, 1995.

[55] ———. “METIS: unstructured graph partitioning and sparse matrix ordering system.” Tech. rep., Dept. of Computer Science, Univ. of Minnesota, 1995.

[56] ———. “Multilevel graph partitioning and sparse matrix ordering.” Proc. Intl. Conf. on Parallel Processing. 1995, ?

[57] ———. “A fast and high quality multilevel scheme for partitioning irregular graphs.” SIAM J. Sci. Comput. 20 (1998): 359–392.

92 [58] ———. “Metis: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices.” Tech. rep., Computer Science Dept., Univ. of Minnesota1, Minneapolis, MN, 1998. [59] Karypis, George and Kumar, Vipin. “Multilevel graph partitioning schemes.” Proceedings of The International Conference on Parallel Processing 10 (1995).10: 113–122.

URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.8807\ &rep=rep1\&type=pdf [60] Kernighan, B. W. and Lin, S. “An efficient heuristic procedure for partitioning graphs.” Bell System Tech. J. 49 (1970): 291–307.

[61] Krawezik, Geraud and Poole, Gene. “Accelerating the ANSYS Direct Sparse Solver with GPUs.” Proc. Symposium on Application Accelerators in High Performance Computing (SAAHPC). Urbana-Champaign, IL: NCSA, 2009. Http://saahpc.ncsa.illinois.edu/09.

[62] Li, J., Ranka, S., and Sahni, S. “GPU Matrix Multiplication.” Handbook on Multicore Computing. ed. S. Rajasekaran. Chapman Hill, 2011 (to appear).

[63] Lipton, R. J., Rose, D. J., and Tarjan, R. E. “Generalized Nested Dissection.” SIAM J. Numer. Anal. 16 (1979): 346–358. [64] Lipton, R. J. and Tarjan, R. E. “A separator theorem for planar graphs.” SIAM J. Appl. Math. 36 (1979): 177–189.

[65] Liu, J. W. H. “Modification of the Minimum-Degree Algorithm by Multiple Elimination.” ACM Trans. Math. Softw. 11 (1985).2: 141–153.

[66] ———. “A graph partitioning algorithm by node separators.” ACM Trans. Math. Softw. 15 (1989).3: 198–219.

[67] ———. “The multifrontal method and paging in sparse Cholesky factorization.” ACM Trans. Math. Softw. 15 (1989).4: 310–325.

[68] Lu, S. M. and Barlow, J. L. “Multifrontal computation with the orthogonal factors of sparse matrices.” SIAM J. Matrix Anal. Appl. 17 (1996).3: 658–679.

[69] Lucas, Robert, Wagenbreth, Gene, Davis, Dan, and Grimes, Roger. “Multifrontal Computations on GPUs and Their Multi-core Hosts.” VECPAR’10: Proc. 9th Intl. Meeting for High Performance Computing for Computational Science. 2010. Http://vecpar.fe.up.pt/2010/papers/5.php.

[70] Markowitz, H. M. “The Elimination Form of the Inverse and Its Application to Linear Programming.” Management Sci. 3 (1957): 255–269.

93 [71] Matstoms, P. “Sparse QR factorization in MATLAB.” ACM Trans. Math. Softw. 20 (1994).1: 136–159.

[72] ———. “Parallel sparse QR factorization on shared memory architectures.” Parallel Computing 21 (1995).3: 473–486.

[73] Natanzon, Assaf, Shamir, Ron, and Sharan, Roded. “A polynomial approximation algorithm for the minimum fill-in problem.” Proceedings of the thirtieth annual ACM symposium on Theory of computing. STOC ’98. 1998, 41–47.

[74] NVIDIA Corporation. NVIDIA CUDA C Programming Guide, 2011.

[75] Park, S. C. A continuous quadratic programming approach to two-set graph partitioning. PhD thesis, Univ. of Florida, Dept. of Mathematics, 1999. [76] Pierce, D. J. and Lewis, J. G. “Sparse Multifrontal Rank Revealing QR Factorization.” SIAM J. Matrix Anal. Appl. 18 (1997).1: 159–180.

[77] Pierce, Dan, Hung, Y., Liu, C.-C., Tsai, Y.-H., Wang, W., and Yu, D. “Sparse multifrontal performance gains via NVIDIA GPU.” Workshop on GPU Supercomputing. Taipei: National Taiwan University, 2009. Http://cqse.ntu.edu.tw/cqse/gpu2009.html.

[78] Pothen, A. and Fan, C. “Computing the Block Triangular Form of a Sparse Matrix.” ACM Trans. Math. Softw. 16 (1990).4: 303–324.

[79] Puglisi, C. QR factorization of large sparse overdetermined and square matrices using a multifrontal method in a multiprocessor environment. Ph.D. thesis, Institut National Polytechnique de Toulouse, 1993. CERFACS report TH/PA/93/33.

[80] Reinders, J. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. Sebastopol, CA: O’Reilly Media, 2007.

[81] Rose, D. J. Symmetric Elimination on Sparse Positive Definite Systems and the Potential Flow Network Problem. PhD thesis, Applied Math., Harvard Univ., 1970. [82] ———. “A Graph-Theoretic Study of the Numerical Solution of Sparse Positive Definite Systems of Linear Equations.” Graph Theory and Computing. ed. R. C. Read. New York: Academic Press, 1973. 183–217.

[83] Rothberg, E. and Hendrickson, B. “Sparse matrix ordering methods for interior point linear programming.” Tech. rep., Silicon Graphics, Inc., Mountain View, CA, 1996.

[84] Speelpenning, B. “The generalized element method.” Tech. Rep. Technical Report UIUCDCS-R-78-946, Dept. of Computer Science, Univ. of Illinois, Urbana, IL, 1978.

[85] Sun, Chunguang. “Parallel Sparse Orthogonal Factorization on Distributed-Memory Multiprocessors.” SIAM J. Sci. Comput. 17 (1996).3: 666–685.

94 URL http://link.aip.org/link/?SCE/17/666/1 [86] Tatarinov, A. and Kharlamov, A. “Alternative rendering pipelines using NVIDIA CUDA.” Talk at SIGGRAPH 2009. 2009.

[87] Tinney, W. F. and Walker, J. W. “Direct Solutions of Sparse Network Equations by Optimally Ordered Triangular Factorization.” Proc. IEEE 55 (1967): 1801–1809.

[88] Vuduc, Richard, Chandramowlishwaran, Aparna, Choi, Jee, Guney, Murat, and Shringarpure, Aashay. “On the limits of GPU acceleration.” Proceedings of the 2nd USENIX conference on Hot topics in parallelism. HotPar’10. Berkeley, CA, USA: USENIX Association, 2010, 13–13.

URL http://portal.acm.org/citation.cfm?id=1863086.1863099 [89] Yannakakis, M. “Computing the Minimum Fill-In is NP-Complete.” SIAM J. Alg. Disc. Meth. 2 (1981): 77–79.

95 BIOGRAPHICAL SKETCH Nuri grew up in Gainesville and worked professionally for seven years in software development while concurrently earning a bachelor’s degree in Computer Engineering.

Prior to graduate school, he worked at MindSolve Technologies full-time as a Web

Developer for five years and at Sage Software as a Software Architect for two years. He was accepted into the CISE PhD program in 2008 and graduated in 2014.

He loved anything involving adrenaline and the outdoors including diving, camping, , , riverboarding, sailing, fishing, , mtb, downhill, sky, , and big air. He participated in the 2008 Ekstremsportveko in Voss, Norway. He loved to travel, considered himself a “foodie,” and had a passion for all-grain homebrewing. He was instrumental in saving his department from being dissolved during the 2011 budget crisis at the University of Florida. As a result, he became interested in campus politics. He became a senior organizer for the Graduate Assistants United, served as the 2012-2013 President of ASCIE, UF’s CISE graduate student organization, and he served on the 2012-2013 UF Mission Statement Task Force to rewrite UF’s mission statement with a focus on 21st century learning objectives.

He was awarded a Graduate Teaching Award for the 2012-2013 academic year for excellence in teaching COP4331, Object-Oriented Programming.

Following his graduate school days, he took a position at Microsoft in Redmond, Washington where he worked on global static code analysis.

96