Unstructured Computations on Emerging Architectures

Unstructured Computations on Emerging Architectures Dissertation by Mohammed A. Al Farhan In Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy King Abdullah University of Science and Technology Thuwal, Kingdom of Saudi Arabia May 2019 2 EXAMINATION COMMITTEE PAGE The dissertation of M. A. Al Farhan is approved by the examination committee Dissertation Committee: David E. Keyes, Chair Professor, King Abdullah University of Science and Technology Edmond Chow Associate Professor, Georgia Institute of Technology Mikhail Moshkov Professor, King Abdullah University of Science and Technology Markus Hadwiger Associate Professor, King Abdullah University of Science and Technology Hakan Bagci Associate Professor, King Abdullah University of Science and Technology 3 ©May 2019 Mohammed A. Al Farhan All Rights Reserved 4 ABSTRACT Unstructured Computations on Emerging Architectures Mohammed A. Al Farhan his dissertation describes detailed performance engineering and optimization Tof an unstructured computational aerodynamics software system with irregu- lar memory accesses on various multi- and many-core emerging high performance computing scalable architectures, which are expected to be the building blocks of energy-austere exascale systems, and on which algorithmic- and architecture-oriented optimizations are essential for achieving worthy performance. We investigate several state-of-the-practice shared-memory optimization techniques applied to key kernels for the important problem class of unstructured meshes. We illustrate for a broad spectrum of emerging microprocessor architectures as represen- tatives of the compute units in contemporary leading supercomputers, identifying and addressing performance challenges without compromising the floating-point numerics of the original code. While the linear algebraic kernels are bottlenecked by memory bandwidth for even modest numbers of hardware cores sharing a common address space, the edge-based loop kernels, which arise in the control volume discretization of the conservation law residuals and in the formation of the preconditioner for the Jacobian by finite-differencing the conservation law residuals, are compute-intensive and effectively exploit contemporary multi- and many-core processing hardware. We therefore employ low- and high-level algorithmic- and architecture-specific code optimizations and tuning in light of thread- and data-level parallelism, with a focus on 5 strong thread scaling at the node-level. Our approaches are based upon novel multi- level hierarchical workload distribution mechanisms of data across different compute units (from the address space down to the registers) within every hardware core. We analyze the demonstrated aerodynamics application on specific computing architectures to develop certain performance metrics and models to bespeak the upper and lower bounds of the performance. We present significant full application speedup relative to the baseline code, on a succession of many-core processor architectures, i.e., Intel Xeon Phi Knights Corner (5.0x) and Knights Landing (2.9x). In addition, the performance of Knights Landing outperforms, at significantly lower power consumption, Intel Xeon Skylake with nearly twofold speedup. These optimizations are expected to be of value for many other unstructured mesh partial differential equation-based scientific applications as multi- and many- core architecture evolves. To my family My parents { Aminah and Ahmed My fiancée{ Wijdan My siblings { Riyadh, Huda, Adi, Iman, Zahra, and Qassim For their constant love and support For enduring my absence while working on my PhD research For always believing in me even when I do not I LOVE YOU! 7 ACKNOWLEDGMENTS \As we express our gratitude, we must never forget that the highest appreciation is not to utter words, but to live by them." John F. Kennedy Over the course of seven years of my PhD experience, so many people have been helping me at King Abdullah University of Science and Technology (KAUST) and elsewhere. I therefore would like to take this opportunity to express my deepest appreciation to those people who joined me throughout this wonderful and enjoyable journey. Your presence has enriched my life in many incredible ways { Thank you very much! First and foremost, I would like to thank my advisor, Professor David E. Keyes for the guidance and intellectual challenges he has tirelessly given me. I especially fortunate to work with him and benefit from his prospectives. In addition, I am very grateful to my PhD committee members for their feedback, in particular, Professor Edmond Chow of Georgia Institute of Technology, for his insightful comments and suggestions that helped me to augment and improve the dissertation. Many thanks and appreciation go to my friends and colleagues in the KAUST Ex- treme Computing Research Center (ECRC) and KAUST Supercomputing Laboratory (KSL), in particular, Mustafa Abduljabbar, for providing a spectacular environment 8 for learning, sharing ideas, and conducting a word-class research in high performance computing. Support in the form of computing resources was provided by ECRC, KSL, KAUST Information Technology Research Division, Intel Parallel Computing Centers, Isam- bard Project at University of Bristol, CUDA Center of Excellence at KAUST, Blue Waters Supercomputer at University of Illinois at Urbana-Champaign, and Cray Cen- ter of Excellence at KAUST. I am very thankful to my family for reaching out with encouragement and love. It really means more than you know. Thanks to my parents, Aminah and Ahmed, for being so patient and for bearing with me over the last couple of years. Allah bless them for everything they have done for me. I am really honored to have parents like them. Furthermore, a lot of gratitude and appreciation go to my wonderful fiancée, Wijdan. No words can ever express how grateful I am to her, how I am honored to have her in my life, and how much I appreciate her consistent and unwavering kindness, love, and support. Also, many thanks to my siblings, Riyadh, Huda, Adi, Iman, Zahra, and Qassim, who have been extremely understanding and supportive for my studies. I feel very lucky to have a family that shares my enthusiasm for academic pursuits. Last but not least, I want to thank all of my friends at KAUST and elsewhere for their support and kindness during this incredible journey. Yours Sincerely, Mohammed A. Al Farhan 9 TABLE OF CONTENTS Examination Committee Page 2 Copyright 3 Abstract 4 Dedication 6 Acknowledgments 7 List of Figures 12 List of Tables 15 I Preliminaries 16 1 Introduction 17 1.1 Dissertation Overview . 21 1.1.1 Statement of Contributions . 21 1.1.2 Summary of Results . 22 1.1.3 Dissertation Structure . 23 1.2 Other Research Projects . 24 1.2.1 Optimizing FMM Kernels on Emerging Architecture . 24 1.2.2 Extreme Scale FMM-accelerated BIE Solver for Wave Scattering 25 2 Background 27 2.1 Unstructured Computations . 27 2.1.1 Fully Unstructured Navier-Stokes in 3 Dimensions . 27 2.1.2 Pseudo-transient Newton-Krylov-Schwarz ( NKS) . 35 2.1.3 Indirect Addressing . 39 2.2 Emerging Architectures . 40 2.2.1 The Golden Age of Microprocessor Architecture . 41 10 2.2.2 Intel® Xeon® Phi™ ....................... 43 2.2.3 Intel® Xeon® ........................... 56 3 Related Work 59 3.1 Unstructured Computations . 59 3.1.1 Porting PETSc-FUN3D to Shared-memory Parallelism . 59 3.1.2 Emerging Unstructured CFD Research Code . 61 3.2 Emerging Architectures . 62 3.3 Our Contributions to the State-of-the-art Many-core Optimizations . 64 II Optimizing the Unstructured Grid Motif 67 4 Porting PETSc-FUN3D to Knights Corner 68 4.1 Highlights of the Contributions . 69 4.2 Thread Affinity Control { Pinning and Binding . 70 4.3 Thread-level Parallelism . 71 4.4 Experimental Setup . 76 4.4.1 Platforms Used for Experiments . 76 4.4.2 Software Stacks . 77 4.4.3 Input Data Sets . 79 4.5 Performance Results and Analysis . 79 4.5.1 Offload Baseline Model . 80 4.5.2 Native Baseline Model . 82 4.5.3 Performance Results with the Coarse Mesh . 82 4.5.4 Performance Results with the Fine Mesh . 84 4.5.5 Comparison of Optimized KNC Performance to CPU Performance 85 4.5.6 Large-scale Strong Scalability Study . 88 5 Porting PETSc-FUN3D to Knights Landing 91 5.1 Highlights of the Contributions . 92 5.2 PETSc-FUN3D Computational Routines . 92 5.2.1 Preprocessing and Setup Phase . 93 5.2.2 NKS Kernels Phase . 94 5.3 Data-level Parallelism . 103 5.3.1 Vectorizing Edge-based Loop Kernels . 104 5.3.2 Fine-grained Data Partitioning . 108 5.4 Experimental Setup . 111 11 5.4.1 Platforms Used for Experiments . 113 5.4.2 Software Stacks . 113 5.4.3 Input Data Sets . 114 5.5 Performance Results and Analysis . 114 5.5.1 Performance of the Flux Routine . 115 5.5.2 Performance of the Gradient Routine . 118 5.5.3 Vectorization Efficiency of the Edge-based Loop . 119 5.5.4 Performance of Explicit Vectorization . 121 5.5.5 Strong Thread Scalability Study . 121 5.5.6 Memory Bandwidth and Flop/s Performance . 124 5.5.7 Roofline Model . 125 5.5.8 Performance on Multi/Many-core Hardware . 127 III Summary and Reflections 133 6 Concluding Remarks and Future Outlook 134 6.1 Broader Impact . 136 6.2 Lessons Learned . 136 6.3 Future Directions . 138 References 139 12 LIST OF FIGURES 2.1 The surface triangulation of the ONERA M6 wing. The wing surface triangulation is shown in green; the symmetry root plane in red; and the far-field boundary in blue. 29 2.2 Tetrahedral mesh edge-based loop kernel. 34 2.3 Baseline performance analysis of PETSc-FUN3D application code. The edge-based loops phase contains the flux evaluation kernel ( 45%), ≥ gradient kernel using weighted least squares that applies Gram-Schmidt ( 10%), and Jacobian matrix construction ( 7%), whereas the sparse ≥ ≥ recurrences phase includes the Incomplete LU factorization ( 16%) ≥ and the Sparse Triangular Solve ( 17%).

Unstructured Computations on Emerging Architectures

A Type Inference on Executables

Intelligent Systems and Platforms Transforming the Industrial Cloud Era

Targeting Embedded Powerpc

Embedded DRAM

SAMPLE CHAPTER 1 Chapter Personal Computer 1 System Components the FOLLOWING COMPTIA A+ ESSENTIALS EXAM OBJECTIVES ARE COVERED in THIS CHAPTER

H1 2015-2016 Results

SEP8253 User Manual

Multiprocessing Contents

Interprocedural Analysis of Low-Level Code

HEP Computing Trends

Chapter 6 : Memory System

Multi-Core Processors and Systems: State-Of-The-Art and Study of Performance Increase