HIGH PERFORMANCE COMPUTING with SPARSE MATRICES and GPU ACCELERATORS by SENCER NURI YERALAN a DISSERTATION PRESENTED to the GRAD

HIGH PERFORMANCE COMPUTING WITH SPARSE MATRICES AND GPU ACCELERATORS By SENCER NURI YERALAN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2014 ⃝c 2014 Sencer Nuri Yeralan 2 For Sencer, Helen, Seyhun 3 ACKNOWLEDGMENTS I thank my research advisor and committee chair, Dr. Timothy Alden Davis. His advice, support, academic and spiritual guidance, countless one-on-one meetings, and coffee have shaped me into the researcher that I am today. He has also taught me how to effectively communicate and work well with others. Perhaps most importantly, he has taught me how to be uncompromising in matters of morality. It is truly an honor to have been his student, and I encourage any of his future students to take every lesson to heart - even when they involve poetry. I thank my supervisory committee member, Dr. Sanjay Ranka, for pressing me to strive for excellence in all of my endeavors, both academic and personal. He taught me how to conduct myself professionally and properly interface and understand the dynamics of university administration. I thank my supervisory committee member, Dr. Alin Dobra, for centering me and teaching me that often times the simple solutions are also, surprisingly, the most efficient. I also thank my supervisory committee members, Dr. My Thai and Dr. William Hager for their feedback and support of my research. They challenged me to look deeper into the problems to expose their underlying structures. I thank Dr. Meera Sitharam, Dr. Beverly Sanders, Dr. Richard Newman, Dr. Jih-Kwon Peir, Dr. Alireza Entezari, and Dr. Jorg¨ Peters for challenging me in the most interesting courses I have ever had the pleasure of taking. Their guidance and advice throughout the years have been invaluable. I thank Dean Cammy Abernathy for teaching me that life is not just about research. She taught me that we all have to sometimes stop ourselves from being distracted by minutia and focus on matters that truly have an impact on each others’ lives. She was the primary inspiration for me to become active in campus politics. As a result I learned 4 how to cultivate and foster lasting coalitions and raise awareness for Computer Science in the Gator Nation. I thank my mother and father for always being there for me and lavishing me with love, advice, and support. My parents have always been proud of me, but this is my opportunity to mention how proud I am of them. Without my parents, completing graduate school would not have been possible. I thank Sean and Jeanne Conner for providing me a home away from home in Tampa, Florida without judgements, reservations, pocket aces, or a 28-point crib. I thank my friends Dr. Joonhao Chuah, John Corring, Diego Rivera-Gutierrez, Michael Borish, Paul Accisano, Braden Dunn, Peter Dobbins, Kristina Sapp, Joan Crisman, Dr. Jose Roberto Soto, Graham Picklesimer, Rob Short, Kelsey Antle, Patrick Battoe, Nick Kesler, Lance Boverhof, Anthony LaRocca, Kevin Babcock, Luke McLeod, Cale Flage, and Evan James Ferl J.D. for challenging, supporting, and humoring me throughout the years. These are the best friends anyone could ever ask for, and I’m truly lucky to have you all in my life. 5 TABLE OF CONTENTS page ACKNOWLEDGMENTS .................................. 4 LIST OF TABLES ...................................... 8 LIST OF FIGURES ..................................... 9 ABSTRACT ......................................... 10 CHAPTER 1 INTRODUCTION ................................... 11 2 REVIEW OF LITERATURE ............................. 15 2.1 Fill-Reducing Orderings ............................ 15 2.1.1 Minimum Degree ............................ 15 2.1.2 Nested Dissection ........................... 17 2.1.3 Graph Partitioning ........................... 19 2.2 Sparse QR Factorization ............................ 23 2.2.1 Early Work For QR Factorization ................... 23 2.2.2 Multifrontal QR Factorization ..................... 23 2.2.3 Parallelizing Multifrontal QR Factorization .............. 23 2.3 GPU Computing ................................ 25 2.4 Communication-Avoiding QR Factorization ................. 25 2.5 Sparse Multifrontal Methods on GPU Accelerators ............. 26 3 HYBRID COMBINATORIAL-CONTINUOUS GRAPH PARTITIONER ...... 28 3.1 Problem Description .............................. 28 3.1.1 Definition ................................ 28 3.1.2 Applications ............................... 28 3.1.3 Outline of this chapter ......................... 29 3.2 Multilevel Edge Separators .......................... 29 3.3 Related Work .................................. 31 3.4 Hybrid Combinatorial-Quadratic Programming Approach .......... 32 3.4.1 Preprocessing .............................. 32 3.4.2 Coarsening ............................... 33 3.4.3 Initial Partitioning ............................ 35 3.4.4 Refinement ............................... 36 3.5 Results ..................................... 38 3.5.1 Hybrid Performance .......................... 38 3.5.2 Power Law Graphs ........................... 40 3.6 Future Work ................................... 41 3.7 Summary .................................... 41 6 4 SPARSE MULTIFRONTAL QR FACTORIZATION ON GPU ............ 43 4.1 Problem Description .............................. 43 4.1.1 Main Contributions ........................... 43 4.1.2 Chapter Outline ............................. 44 4.2 Preliminaries .................................. 45 4.2.1 Multifrontal Sparse QR Factorization ................. 45 4.2.1.1 Ordering phase ....................... 45 4.2.1.2 Analysis phase ........................ 46 4.2.1.3 Factorization phase ..................... 49 4.2.1.4 Solve phase ......................... 50 4.2.2 GPU architecture ............................ 50 4.3 Related Work .................................. 54 4.4 Parallel Algorithm ................................ 55 4.4.1 Dense QR Scheduler (the Bucket Scheduler) ............ 57 4.4.2 Computational Kernels on the GPU .................. 63 4.4.2.1 Factorize kernel ....................... 64 4.4.2.2 Apply kernel ......................... 68 4.4.2.3 Apply/Factorize kernel .................... 73 4.4.3 Sparse QR Scheduler ......................... 73 4.4.4 Staging for Large Trees ........................ 77 4.5 Experimental Results ............................. 80 4.6 Future Work ................................... 84 4.7 Summary .................................... 86 5 CONCLUSION .................................... 87 REFERENCES ....................................... 89 BIOGRAPHICAL SKETCH ................................ 96 7 LIST OF TABLES Table page 3-1 Selected problems from the University of Florida Sparse Matrix Collection [19] 40 3-2 Performance results for the 3 matrices listed in Table 4-3 ............. 40 4-1 Performance of 1x16 “short and fat” dense rectangular problems. ........ 80 4-2 Performance of 16x1 “tall and skinny” dense rectangular problems. ....... 81 4-3 Five selected matrices from the UF Sparse Matrix Collection [19]. ....... 81 4-4 Performance results for the 5 matrices listed in Table 4-3. ............ 82 8 LIST OF FIGURES Figure page 3-1 Multilevel graph partitioning ............................. 30 3-2 Graph coarsening using heavy edge matching .................. 33 3-3 Graph coarsening using heavy edge and Brotherly matching .......... 34 3-4 Graph coarsening using heavy edge, Brotherly, and adoption matching .... 34 3-5 Graph coarsening using heavy edge, Brotherly, and community matching ... 35 3-6 Comparison of cut quality and timing between combinatorial, continuous, and our hybrid methodology on 1550 problems from the UF Collection [19] ..... 39 3-7 Comparison of cut quality and timing between METIS 5 and our hybrid methodology on 1550 problems from the UF Collection [19] ................... 39 3-8 Comparison of cut quality and timing between METIS 5 and our hybrid methodology on 25 power law problems from the UF Collection [19] .............. 41 4-1 A sparse matrix A, its factor R, and its column elimination tree ......... 48 4-2 Assembly and factorization of a leaf frontal matrix ................ 49 4-3 Assembly and factorization of a frontal matrix with three children ........ 50 4-4 High Level GPU Architecture [74] .......................... 53 4-5 A 256-by-160 dense matrix blocked into 32-by-32 tiles and the corresponding state of the bucket scheduler. ............................ 58 4-6 Factorization of a 256-by-160 matrix in 12 kernel launches ............ 60 4-7 Pipelined factorization of the 256-by-160 matrix in 7 kernel launches ...... 62 4-8 Assembly tree for a sparse matrix with 68 fronts ................. 75 4-9 Stage 1 of an assembly tree for a sparse matrix with 68 fronts .......... 78 4-10 Stage 2 of an assembly tree for a sparse matrix with 68 fronts .......... 79 4-11 Force-directed renderings of 5 selected problems from the UF Collection ... 82 4-12 GPU-accelerated speedup over the CPU-only algorithm versus arithmetic intensity on a logarithmic scale. ................................ 83 9 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy HIGH PERFORMANCE COMPUTING WITH SPARSE MATRICES AND GPU ACCELERATORS By Sencer Nuri Yeralan December 2014 Chair: Timothy Alden Davis Major: Computer Engineering Sparse matrix factorization relies on high quality fill-reducing orderings in order to limit the

HIGH PERFORMANCE COMPUTING with SPARSE MATRICES and GPU ACCELERATORS by SENCER NURI YERALAN a DISSERTATION PRESENTED to the GRAD

Recursive Approach in Sparse Matrix LU Factorization

Schoolhouse Migrates to Norway

A Survey of Direct Methods for Sparse Linear Systems

A Fast Implementation of the Minimum Degree Algorithm Using Quotient Graphs

State-Of-The-Art Sparse Direct Solvers

On the Performance of Minimum Degree and Minimum Local Fill

Tiebreaking the Minimum Degree Algorithm For

Ekstremsport, Business Og Bulyst

Components of a Successful Strategy Execution Process in an Adventure Tourism Destination Therese Lundqvist

Sport Futura Awards 22-23 April, Göteborgs Stadsmuseum Syftet Med Priset?

An Approximate Minimum Degree Ordering Algorithm

On Computing Min-Degree Elimination Orderings