Compiler and Runtime Support for the Execution of Scientific Codes with Unstructured Datasets on Heterogeneous Parallel Architectures

UNIVERSIDAD POLITECNICA´ DE MADRID ESCUELA TECNICA´ SUPERIOR DE INGENIEROS DE TELECOMUNICACION´ COMPILER AND RUNTIME SUPPORT FOR THE EXECUTION OF SCIENTIFIC CODES WITH UNSTRUCTURED DATASETS ON HETEROGENEOUS PARALLEL ARCHITECTURES Doctoral Thesis Pablo Barrio López-Cortijo Ingeniero en Informática M.S. in High Performance Computing 2017 DEPARTAMENTO DE INGENIERÍA ELECTRONICA´ ESCUELA TECNICA´ SUPERIOR DE INGENIEROS DE TELECOMUNICACION´ UNIVERSIDAD POLITECNICA´ DE MADRID COMPILER AND RUNTIME SUPPORT FOR THE EXECUTION OF SCIENTIFIC CODES WITH UNSTRUCTURED DATASETS ON HETEROGENEOUS PARALLEL ARCHITECTURES Doctoral Thesis Author: Pablo Barrio López-Cortijo Ingeniero en Informática M.S. in High Performance Computing Advisor: Carlos Carreras Vaquer Profesor Titular del Dpto. de Ingenier´ıaElectrónica 2017 Contents Page Abstract v Resumen vii Acknowledgements ix List of Figures xi List of Tables xv 1 Introduction 1 1.1 Motivation . 2 1.1.1 FPGA acceleration of codes with unstructured-mesh datasets 2 1.1.2 Design automation . 4 1.2 Challenges . 5 1.2.1 Memory bandwidth limitations . 5 1.2.2 Irregular data accesses . 6 1.2.3 Code transformations for distributed systems . 7 1.2.4 Automated insertion of communications . 8 1.3 Structure of the thesis . 9 1.4 Contributions . 9 2 Memory optimization for unstructured meshes 13 2.1 Background and related work . 14 2.1.1 Unstructured meshes . 14 2.1.2 The memory wall . 17 2.1.3 Improving data locality . 19 2.1.4 Reducing data size . 23 2.1.5 Memory systems for FPGA accelerators . 25 2.2 Locality optimization with mesh sorting . 26 2.2.1 The BFsort breadth-first algorithm . 27 i Contents 2.2.2 Window size results . 30 2.2.3 Seed selection . 34 2.3 A streaming methodology for pipelined kernels . 42 2.3.1 Hybrid sequential/random access . 42 2.3.2 Coding and compression with BFsort . 43 2.3.3 Implementation details . 46 2.4 Performance results . 52 2.4.1 Data locality improvements on a multicore . 53 2.4.2 Streaming methodology on an FPGA accelerator . 54 2.5 Conclusions . 60 3 Refactoring of sequential codes for distributed systems 61 3.1 Related work . 63 3.1.1 Code refactoring . 63 3.1.2 Distributed control flow . 64 3.2 High-level view of the compiler toolchain . 65 3.3 Block outlining . 69 3.3.1 Initiator search . 70 3.3.2 Updating the control structures . 71 3.3.3 Data structures . 74 3.4 Remote Block Branches (RBBs) . 74 3.4.1 Compiler passes . 78 3.4.2 The runtime library . 79 3.5 Results . 82 3.5.1 Results of refactoring with vectorization on a multicore . 82 3.5.2 Remote Block Branches on synthetic benchmarks . 84 3.5.3 2-dimensional matrix multiplication . 91 3.6 Conclusions . 94 3.6.1 Future work . 94 4 Automated pipelining with the inspector-executor paradigm 97 4.1 Background and Related Work . 99 4.1.1 Skeletons . 100 4.1.2 The ROSE compiler . 101 4.1.3 Inspector-Executor . 101 4.1.4 IEC . 103 4.2 Modifications to IEC for pipelining . 108 4.2.1 Load partitioning in two dimensions . 109 4.2.2 Data dependencies . 110 4.2.3 Changes to the inspector . 113 4.2.4 Changes to the executor . 115 ii Contents 4.3 Performance investigation . 119 4.3.1 Performance results of the pipelined quake . 120 4.3.2 Parallel profiling of the pipelined code . 123 4.4 Conclusions . 126 5 Conclusions and future work 129 5.1 Summary of the conclusions of the Thesis . 129 5.2 Future lines of work . 131 Bibliography 133 Glossary 149 iii Contents iv Abstract Simulation codes based in the discretization of time and space for solving systems of partial differential equations are used nowadays in relevant industrial sectors. These include, for example, finite volumes and finite element methods typically used in Computational Fluid Dynamics (CFD). These simulation applications are widely used various industries such as aeronautics, automotive or weather prediction. More recently, they have been used in novel applications such as the simulation of blood streams for medical diagnosis, or the design of wind turbines for energy generation. The complexity of these simulations and the sheer size of the datasets used often require a considerable amount of computing power in order to obtain results in a reasonable time. In the last years, the optimization of these applications has been a priority for many companies, and even real time performance is now often considered as a stretch goal for optimization attempts. High Performance Computing (HPC) systems have been used for some time now to run these simulations, achieving reasonable speedups by partitioning the datasets and running the simulation in several processors. However, the need to run them in as little time as possible, the increase in the dataset sizes to allow for higher accuracy, and limitations in the multi-process scalability of CFD simulations, have resulted in joint worldwide efforts to achieve the ever-increasing performance and problem size objectives by means of newer, disruptive technologies. This thesis approaches the problem of optimizing the execution of these codes with the help of Heterogeneous HPC systems. These systems differ from stan- dard, homogeneous HPC systems in that the architectures of the processing el- ements used as building blocks differ from each other. A special interest of this work is to analyze the feasibility of using Field-Programmable Gate Arrays (FP- GAs) in these systems as accelerators to scientific simulations. Because these devices are essentially reconfigurable hardware, they allow a finer-grain parallelism than general-purpose processors, which translates into a higher throughput when computational kernels are ported to them. Additionally, FPGAs achieve levels of power efficiency that are currently unparalleled by any existing mainstream computing device. Unfortunately, the novelty of this approach implies that the v Contents programming effort required to implement such solutions is still high compared to other well-known accelerating technologies such as General-Purpose Graphics Processing Units (GPGPUs). FPGAs also show limitations with external memory throughput. Due to the characteristics of data dependencies in the datasets used for these codes, this problem might render FPGAs useless for the optimization of such applications. A deep optimization of the data transfer mechanisms is vital to ensure that the high-performance capabilities of these reconfigurable devices are not shadowed by their lower input/output capabilities as compared to other existing processing devices. The algorithms, methodologies and software libraries introduced throughout this Thesis improve data transfer of the aforementioned codes in FPGA-based parallel heterogeneous systems, while also reducing the development effort required to implement these. Specifically, we propose a methodology to reduce the size of the datasets and transfer them efficiently to the FPGA, as well as two compiler and runtime techniques to automate the parallelization of the codes suitable for heterogeneous systems, one focused on control flow distribution and another one based on pipelining of loop sequences. The latter are techniques at the system level, therefore they are independent from the architectures used in the heterogeneous system. Although the original purpose of this work was to optimize CFD simulations, it has become clear that the techniques proposed are applicable to a more general set of applications. In the future, HPC systems could use FPGAs in the same way that they now benefit from GPGPU tecnologies, but with the added benefits of reconfigurable hardware. vi Resumen Las simulaciones computacionales basadas en la discretizacióntemporal y espacial para la resoluciónde sistemas de ecuaciones en derivadas parciales se utilizan hoy en d´ıaen sectores industriales de gran relevancia económica. Un ejemplo signi- ficativo de dichas aplicaciones son las simulaciones de dinámicade fluidos. Estas simulaciones se utilizan constantemente en industrias tales como la aeronáutica o la automoción.Másrecientemente, se han identificado nuevos campos de apli- cacióncomo la simulacióndel torrente sangu´ıneopara el diagnósticomédico,o el dise~node turbinas eólicaspara la generaciónde energ´ıa. La complejidad de dichas simulaciones y el propio tama~node los conjuntos de datos usados hacen que requieran de una capacidad de procesamiento considerable si se quieren obtener resultados en un tiempo razonable. En los últimosa~nos,la optimizaciónde estas aplicaciones se ha convertido en una prioridad para diversos organismos, hasta el punto de considerar aplicaciones en tiempo real como un objetivo a largo plazo. Estas simulaciones han venido usando recursos de Computaciónde Altas Prestaciones (HPC, por sus siglas en inglés),los cuales obtienen aceleraciones razonables mediante el particionado de datos y su ejecuciónen múltiplesproce- sadores. Aúnas´ı, la necesidad de ejecutarlas en el menor tiempo posible, el crecimiento de los conjuntos de datos para incrementar la exactitud de los resultados, y las limitaciones de escalabilidad de los códigosde CFD, han motivado la apariciónde iniciativas conjuntas entre varios pa´ıses,con el objetivo de con- seguir los objetivos de rendimiento y tama~no,siempre en crecimiento, a travésde tecnolog´ıasnovedosas e incluso disruptivas. La presente tesis aborda el tema de la optimizaciónde dichos códigoscon la ayuda de sistemas heterogéneosde altas prestaciones. Dichos sistemas se difer- encian de los sistemas homogéneosen el hecho de que las arquitecturas de sus elementos de procesamiento son distintas entre s´ı.Se pone un especial énfasisen el análisisde viabilidad en el uso de FPGAs en dichos sistemas como aceleradores para simulaciones cient´ıficas.Al estar estos dispositivos basados en hardware reconfigurable, permiten un grado másfino de paralelismo que los procesadores de propósitogeneral, lo cual se traduce en un incremento del rendimiento de las porciones de códigomásrelevantes cuando éstasse asignan a la FPGA.

Compiler and Runtime Support for the Execution of Scientific Codes with Unstructured Datasets on Heterogeneous Parallel Architectures

Supercomputing Resources in BSC, RES and PRACE

RES Updates, Resources and Access / European HPC Ecosystem

La Difusió D

Sistema De Recuperación Automática De Un Supercomputador Con Arquitectura De Cluster

Computational Astrophysics Computer Architectures Alexander Knebe, Universidad Autonoma De Madrid Computational Astrophysics Computer Architectures

BSC-Microsoft Research Centre Overview Osman Unsal

Numerical Simulations of Massive Separated Flows: Flow Over a Stalled NACA0012 Airfoil

The Mont-Blanc Approach Towards Exascale Prof

High Performance Computing for Electron Dynamics in Complex

MAP of Unique Scientific and Technical Infrastructures (ICTS)

Locally Managed Galaxy Instances with Access to External Resources in a Supercomputing Environment

Models, Methods and Tools for Availability Assessment of IT-Services and Business Processes