Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly

57 Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly Fabio Luporini, Imperial College London Ana Lucia Varbanescu, University of Amsterdam Florian Rathgeber, Imperial College London Gheorghe-Teodor Bercea, Imperial College London J. Ramanujam, Louisiana State University David A. Ham, Imperial College London Paul H.J. Kelly, Imperial College London We study and systematically evaluate a class of composable code transformations that improve arithmetic intensity in local assembly operations, which represent a significant fraction of the execution time in finite element methods. Their performance optimization is indeed a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions, which vary among different problems, make it hard to determine an optimal sequence of successful transformations. Our investigation has resulted in the implementation of a compiler for local assembly kernels, fully integrated with a framework for developing finite element methods. The compiler manipulates ab- stract syntax trees generated from a domain-specific language by introducing domain-aware optimizations for instruction-level parallelism and register locality. Eventually, it produces C code including vector SIMD intrinsics. Experiments using a range of real-world finite-element problems of increasing complexity show that significant performance improvement is achieved. The generality of the approach and the applicability of the proposed code transformations to other domains is also discussed. Categories and Subject Descriptors: G.1.8 [Numerical Analysis]: Partial Differential Equations - Finite element methods; G.4 [Mathematical Software]: Parallel and vector implementations General Terms: Design, Performance Additional Key Words and Phrases: Finite element integration, local assembly, compilers, optimizations, SIMD vectorization ACM Reference Format: Fabio Luporini, Ana Lucia Varbanescu, Florian Rathgeber, Gheorghe-Teodor Bercea, J. Ramanujam, David A. Ham, and Paul H. J. Kelly, 2014. Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly. ACM Trans. Arch. & Code Opt. 11, 4, Article 57 (January 2015), 25 pages. DOI:http://dx.doi.org/10.1145/2687415/0000000.0000000 1. INTRODUCTION In many fields such as computational fluid dynamics, computational electromagnetics and structural mechanics, phenomena are modelled by partial differential equations This research is partly funded by the MAPDES project, by the Department of Computing at Imperial College London, by EPSRC through grants EP/I00677X/1, EP/I006761/1, and EP/L000407/1, by NERC grants NE/K008951/1 and NE/K006789/1, by the U.S. National Science Foundation through grants 0811457, 0926687 and 1059417, by the U.S. Army through contract W911NF-10-1-000, by Louisiana State University and by a HiPEAC collaboration grant. The authors would like to thank Carlo Bertolli, Lawrence Mitchell and Francis Russell for their invaluable suggestions and their contribution to the Firedrake project. We thank the ACM TACO and HiPEAC’15 reviewers for their suggestions that led to a significantly improved version. Author’s addresses: Fabio Luporini & Florian Rathgeber & Gheorghe-Teodor Bercea & Paul H. J. Kelly, De- partment of Computing, Imperial College London; Ana Lucia Varbanescu, Informatics Institute, University of Amsterdam; David A. Ham, Department of Computing and Department of Mathematics, Imperial College London; J. Ramanujam, Center for Computation and Technology and the School of Elec. Engr.& Comp. Sci., Louisiana State University. c 2015 ACM 1539-9087/2015/01-ART57 . This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM Transactions on Architecture and Code Optimization, http://dx.doi.org/10.1145/2687415. ACM Transactions on Architecture and Code Optimization, Vol. 11, No. 4, Article 57, Publication date: January 2015. 57:2 F. Luporini et al. (PDEs). Numerical techniques, such as the finite volume method and the finite element method, are widely employed to approximate solutions of these PDEs. Unstruc- tured meshes are often used to discretize the computational domain, since they allow an accurate representation of complex geometries. The solution is sought by applying suitable numerical operations, described by computational kernels, to all entities of a mesh, such as edges, vertices, or cells. On standard clusters of multicores, typically, a kernel is executed sequentially by a thread, while parallelism is achieved by par- titioning the mesh and assigning each partition to a different node or thread. Such an execution model, with minor variations, is adopted, for example, by Markall et al. [2013], Logg et al. [2012], AMCG [2010], DeVito et al. [2011]. The time required to execute the numerical kernels is a major issue, since the equa- tion domain needs to be discretized into an extremely large number of cells to obtain a satisfactory approximation of the PDE, possibly of the order of trillions, as in Rossinelli et al. [2013]. For example, it has been well established that mesh resolution is critical in the accuracy of numerical weather forecasts. However, operational forecast centers have a strict time limit in which to produce a forecast—60 minutes in the case of the UK Met Office. Producing efficient kernels has a direct scientific payoff in higher resolution, and therefore more accurate, forecasts. Computational cost is a dominant problem in computational science simulations, especially for those based on finite elements, that are the subject of this paper. In particular, we address the well-known problem of optimizing the local assembly phase of the finite element method [Russell and Kelly 2013; Ølgaard and Wells 2010; Knepley and Terrel 2013; Kirby et al. 2005], which can be responsible for a significant fraction of the overall computation run-time, often in the range 30-60%. With respect to these studies, we propose a novel set of composable code transformations targeting, for the first time, cross-loop arithmetic intensity, with emphasis on instruction-level parallelism, redundant computation and register locality. Also, we discuss how such transformations generalize to domains other than finite element local assembly. During the assembly phase, the solution of the PDE is approximated by executing a problem-specific kernel over all cells, or elements, in the discretized domain. In this work, we focus on relatively low order finite element methods, in which an assembly kernel’s working set is usually small enough to fit in the L1 cache of traditional CPU architectures. Low order methods are by no means exotic: they are employed in a wide variety of fields, including climate and ocean modeling, computational fluid dynamics and structural mechanics. The efficient assembly of high order methods such as the spectral element method [Vos et al. 2010] requires a significantly different loop nest structure. High order methods are therefore excluded from our study. We aim to maximize our impact on the platforms that are realistically used for finite element applications, so we target conventional CPU architectures rather than GPUs. The key limiting factor for the execution on GPUs is the stringent memory requirements. Only relatively small problems fit in GPU memory, and support for distributed GPU execution in general-purpose finite element frameworks is minimal. There has been some research on adapting local assembly to GPUs, which we men- tion in Section 7; these works differ from ours in several ways, including: (i) not optimizing for cross-loop arithmetic intensity (the goal is rather effective multi-thread parallelization), and (ii) not relying on automated code generation from a domain- specific language (explained next); moreover, they have done testing only on order 1 methods, while we focus on the order range 1–4. In addition, our code transformations would drastically impact the GPU parallelization strategy, for example by increasing a thread’s working set. For all these reasons, a study on extending our work to GPU architectures is beyond the scope of this paper. In Section 6, however, we provide some intuitions about this research direction. ACM Transactions on Architecture and Code Optimization, Vol. 11, No. 4, Article 57, Publication date: January 2015. Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly 57:3 An assembly kernel is characterized by the presence of an affine, often non-perfect loop nest, in which individual loops are rather small: their trip count rarely exceeds 30, and may be as low as 3 for low order methods. In the innermost loop, a problem-specific compute-intensive expression evaluates a two dimensional array, representing the re- sult of local assembly in an element of the discretized domain. With such kernel and working set structures, traditional transformations for cache locality (e.g., blocking) are not helpful; instead, we focus on aspects such as the optimization of floating-point operations, register locality and instruction-level parallelism, especially vectorization. Achieving high performance on CPUs is non-trivial. The complexity of the mathematical expressions, often characterized by a large number of operations on constants and small matrices, makes it hard to determine an optimal sequence of transformations that is successfully applicable to different problems (equations). Loop trip counts are

Load more