Kernel Optimization by Layout Restructuring

THESE` PRESENT´ EE´ A` L’UNIVERSITE´ DE BORDEAUX ECOLE´ DOCTORALE DE MATHEMATIQUES´ ET D’INFORMATIQUE par Christopher Haine POUR OBTENIR LE GRADE DE DOCTEUR SPECIALIT´ E´ : INFORMATIQUE Kernel optimization by layout restructuring Date de soutenance : 3 Juillet 2017 Devant la commission d’examen composéede : Henri-Pierre CHARLES . Directeur de Recherche, CEA . Rapporteur Allen MALONY . Professeur, Universite´ de l’Oregon . Rapporteur Emmanuel JEANNOT . Directeur de Recherche, Inria . Examinateur Pascale ROSSE-LAURENT Ingenieur,´ Atos . Examinateur Olivier AUMAGE . Charge´ de Recherche, Inria . Encadrant Denis BARTHOU . Professeur des Universites,´ Bordeaux INP Directeur de These` 2017 2 Abstract Careful data layout design is crucial for achieving high performance, as nowadays processors waste a considerable amount of time being stalled by memory transactions, and in particular spa- cial and temporal locality have to be optimized. However, data layout transformations is an area left largely unexplored by state-of-the-art compilers, due to the difficulty to evaluate the possi- ble performance gains of transformations. Moreover, optimizing data layout is time-consuming, error-prone, and layout transformations are too numerous to be experimented by hand in hope to discover a high performance version. We propose to guide application programmers through data layout restructuring with an ex- tensive feedback, firstly by providing a comprehensive multidimensional description of the initial layout, built via analysis of memory traces collected from the application binary in fine aiming at pinpointing problematic strides at the instruction level, independently of the input language. We choose to focus on layout transformations, translatable to C-formalism to aid user understanding, that we apply and assess on case study composed of two representative multithreaded real-life applications, a cardiac wave simulation and lattice QCD simulation, with different inputs and parameters. The performance prediction of different transformations matches (within 5%) with hand-optimized layout code. Keywords: Performance profiling, Layout restructuring, Vectorization, Binary rewriting 4 Résumé Bien penser la structuration de donnees´ est primordial pour obtenir de hautes performances, alors que les processeurs actuels perdent un temps considerable´ a` attendre la completion´ de transactions memoires.´ En particulier les localites´ spatiales et temporelles de donnees´ doivent etreˆ optimisees.´ Cependant, les transformations de structures de donnees´ ne sont pas proprement ex- plorees´ par les compilateurs, en raison de la difficulte´ que pose l’evaluation´ de performance des transformations potentielles. De plus, l’optimisation des structures de donnees´ est chronophage, sujette a` erreur et les transformations a` considerer´ sont trop nombreuses pour etreˆ implement´ ees´ a` la main dans l’optique de trouver une version de code efficace. On propose de guider les programmeurs a` travers le processus de restructuration de donnees´ grace a` un retour utilisateur approfondi, tout d’abord en donnant une description multidimen- sionnelle de la structure de donnee´ initiale, faite par une analyse de traces memoire´ issues du binaire de l’application de l’utilisateur, dans le but de localiser des problemes` de stride au niveau instruction, independamment´ du langage d’entree.´ On choisit de focaliser notre etude´ sur les transformations de structure de donnees,´ traduisibles dans un formalisme proche du C pour fa- voriser la comprehension´ de l’utilisateur, que l’on applique et evalue´ sur deux cas d’etude´ qui sont des applications reelles,´ a` savoir une simulation d’ondes cardiaques et une simulation de chromo- dynamique quantique sur reseau,´ avec differents´ jeux d’entrees.´ La prediction´ de performance de differentes´ transformations est conforme a` 5% pres` aux versions reécrites´ a` la main. Keywords: Profilage de performance, Restructuration de donnees,´ Vectorisation, Reécriture´ de binaire 6 Contents Introduction 11 Data layout design for high performance . 11 Contributions . 12 Outline . 13 1 Context and Problem 15 1.1 The memory layout/pattern issue . 16 1.1.1 Modern architectures complexity . 16 1.1.2 The locality issue . 18 1.1.3 Layout abstractions . 22 1.2 The complicated relationship between compilers and data layouts . 23 1.2.1 Static approaches . 24 1.2.2 Dynamic approaches . 25 1.2.3 Data restructuring techniques . 27 1.2.4 Vectorization . 28 1.3 The lack of user feedback . 29 1.3.1 Compiler feedback . 29 1.3.2 Tools . 31 1.3.3 The importance of quantification . 32 1.4 Towards a data layout restructuring framework proposal . 32 1.4.1 Proposition overview . 33 1.4.2 Data restructuring evaluation . 35 2 Vectorization 39 2.1 Layout Transformations To Unleash Compiler Vectorization . 39 2.2 Hybrid Static/Dynamic Dependence Graph . 40 2.2.1 Static, Register-Based Dependence Graph. 40 2.2.2 Dynamic Dependence Graph. 41 2.3 SIMDization Analysis . 42 2.3.1 Vectorizable Dependence Graph. 42 2.3.2 Code Transformation Hints. 43 2.4 Conclusion . 44 7 CONTENTS 3 Layout/pattern analysis 45 3.1 Data layout formalism . 45 3.2 Layout detection . 48 3.3 Delinearization . 50 3.3.1 General Points . 52 3.3.2 Properties . 54 3.3.3 Characterization . 55 3.3.4 Case Study . 57 3.4 Conclusion . 59 4 Transformations 61 4.1 Layout Operations . 61 4.1.1 Permutation . 62 4.1.2 Splitting . 63 4.1.3 Compression . 64 4.2 Exploration . 65 4.2.1 Basic Constraints . 65 4.2.2 Locality Constraints . 67 4.2.3 Parallelism Constraints . 69 4.3 Case Study . 71 4.4 Conclusion . 75 5 Code Rewriting and User Feedback 77 5.1 Systematic Code Rewriting . 77 5.1.1 On locality . 78 5.1.2 Formalism Interpretation . 79 5.1.3 Copying . 81 5.1.4 Remapping . 83 5.2 User Feedback . 84 5.2.1 Layout issues pinpointing . 85 5.2.2 Hinting the rewriting . 85 5.3 Low-level implementation . 86 5.3.1 Loop kernel rewriting . 86 5.3.2 SIMDization . 87 5.4 Conclusion . 88 6 Transformations Evaluation 91 6.1 Evaluation methodology . 91 6.1.1 Principle of in-vivo evaluation . 91 6.1.2 Automatic mock-up generation and vectorization . 92 6.1.3 Current state of implementation . 93 6.2 Experimental results . 94 6.2.1 TSVC . 94 6.2.2 Lattice QCD benchmark without preconditioning . 95 6.2.3 Lattice QCD benchmark with even/odd preconditioning . 96 6.2.4 Lattice QCD application without preconditioning . 97 8 CONTENTS 6.2.5 2D cardiac wave propagation simulation application . 98 6.3 Conclusion . 98 Conclusion and Future Challenges 101 6.4 Summary . 101 6.5 Perspectives . 102 9 CONTENTS 10 Introduction Data layout design for high performance Applications performance is a major concern in many domains, indeed constraints from embed- ded system for mobile technologies, constraints of graphics rendering for video processing or scientific contraints such as precision and problem sizes for instance in the area of scientific computing – or High Performance Computing –, require significant focus on efficient computing resources utilization. Applications are so demanding that they require supercomputers, that is clusters of computers working together on the same scientific problem, to even address the most computa- tionally challenging tasks. In such context, performance is key to minimize the time to solution, in particular some problems are unsolvable without high performance. In computer science, there are many vectors of performance to address, one of them is the memory usage, as modern processors waste a considerable amount of time being stalled by memory transactions completion. Intrinsic memory technologies considerations prevent proper syn- ergy between processor and memory, and memory caches has proven necessary to partially cir- cumvent this issue. Therefore, application code performance is highly sensitive to cache usage, however optimal cache usage is difficult to achieve and poorly approximable. Pushed by specialized domains needs, many architecture developped with distinguishing fea- tures such as Graphic Processing Units (GPUs) that use in particular large vectors to tackle imag- ing problems. Also, tremendous demand on computer resources fuels systematic architectural changes, creating a wide landscape of computer architectures. In high performance computing for instance, it is common to schedule tasks on heterogeneous systems, that is systems composed of different flavours of processors such as CPUs/GPUs couples. Given that performance.

Kernel Optimization by Layout Restructuring

Efficient Parallel Algorithms for 3D Laplacian Smoothing on The

Understanding the Performance of HPC Applications

Best Practice Guide Modern Processors

SAMS Multi-Layout Memory: Providing Multiple Views of Data to Boost SIMD Performance

Soft Vector Processors with Streaming Pipelines

LLAMA: the Low Level Abstraction for Memory Access

Intel(R) 64 and IA-32 Architectures Optimization Reference Manual

Broadening the Applicability of FPGA-Based Soft Vector Processors

Intel 64 and IA-32 Architectures Optimization Reference Manual [PDF]

Execution Unit ISA (Ivy Bridge)

Impact of Data Layouts on the Efficiency of GPU-Accelerated IDW Interpolation