Improving Tiling, Reducing Compilation Time, and Extending the Scope of Polyhedral Compilation Mohamed Riyadh Baghdadi

Improving tiling, reducing compilation time, and extending the scope of polyhedral compilation Mohamed Riyadh Baghdadi To cite this version: Mohamed Riyadh Baghdadi. Improving tiling, reducing compilation time, and extending the scope of polyhedral compilation. Programming Languages [cs.PL]. Université Pierre et Marie Curie - Paris VI, 2015. English. NNT : 2015PA066368. tel-01270558 HAL Id: tel-01270558 https://tel.archives-ouvertes.fr/tel-01270558 Submitted on 8 Feb 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THESE DE DOCTORAT DE L’UNIVERSITE PIERRE ET MARIE CURIE École Doctorale Informatique, Télécommunications et Électronique de paris Improving Tiling, Reducing Compilation Time, and Extending the Scope of Polyhedral Compilation Présentée par Riyadh BAGHDADI Codirigée par Albert COHEN et Sven VERDOOLAEGE soutenue le 25/09/2015 devant le jury composé de : Rapporteurs M. Cédric Bastoul Professor, University of Strasbourg M. Paul H J Kelly Professor, Imperial College London Examinateurs M. Stef Graillat Professor, University Pierre et Marie Curie (Paris 6) M. Cédric Bastoul Professor, University of Strasbourg M. Paul H J Kelly Professor, Imperial College London M. J. Ramanujam Professor, Louisiana State University M. Albert Cohen Senior Research Scientist, INRIA, France M. Sven Verdoolaege Associate Researcher, Polly Labs and KU Leuven Contents List of Figures vii List of Tables ix Nomenclature xi 1 Introduction and Background 1 1.1 Introduction................................... 1 1.2 BackgroundandNotations . 5 1.2.1 MapsandSets ............................. 5 1.2.2 AdditionalDefinitions . 7 1.2.3 PolyhedralRepresentationofPrograms . ... 8 1.2.4 DataLocality.............................. 13 1.2.5 LoopTransformations . 14 1.3 Outline ..................................... 15 2 Tiling Code With Memory-Based Dependences 19 2.1 Introductionandrelatedwork. ... 19 2.2 SourcesofFalseDependences . 21 2.3 MotivatingExample .............................. 23 2.4 LiveRangeNon-Interference . .. 24 2.4.1 Liveranges ............................... 24 2.4.2 Constructionofliveranges . 25 2.4.3 Liverangenon-interference . 26 2.4.4 Live range non-interference and tiling: a relaxed permutability criterion 27 2.4.5 Illustrativeexamples . 29 2.4.6 Parallelism ............................... 31 2.5 Effectof3AContheTilabilityofaLoop. .... 32 2.6 ExperimentalEvaluation . 34 iv Contents 2.7 ExperimentalResults . .. .. .. .. .. .. .. .. 36 2.7.1 Evaluating the Effect of Ignoring False Dependences on Tiling . 38 2.7.2 Performance evaluation for Polybench-3AC . .... 43 2.7.3 Reasonsforslowdowns. 45 2.7.4 Performance evaluation for Polybench-PRE . .... 45 2.8 ConclusionandFutureWork . 46 3 Scheduling Large Programs 49 3.1 Introduction................................... 49 3.2 Definitions.................................... 51 3.3 ExampleofStatementClustering. ... 51 3.4 ClusteringAlgorithm . .. .. .. .. .. .. .. .. 53 3.5 Using Offline Clustering with the Pluto Affine Scheduling Algorithm. 54 3.5.1 Correctness of Loop Transformations after Clustering ......... 56 3.6 ClusteringHeuristics . 56 3.6.1 ClusteringDecisionValidity . 56 3.6.2 SCCClustering............................. 57 3.6.3 Basic-blockClustering . 59 3.7 Experiments................................... 60 3.8 RelatedWork .................................. 65 3.9 ConclusionandFutureWork . 67 4 The PENCIL Language 69 4.1 Introduction................................... 69 4.2 PENCIL Language................................ 70 4.2.1 Overviewof PENCIL .......................... 70 4.2.2 PENCIL DefinitionasaSubsetofC99 . 73 4.2.3 PENCIL ExtensionstoC99. .. .. .. .. .. .. 77 4.3 DetailedDescription .............................. 77 4.3.1 , and ..................... 77 4.3.2 DescriptionoftheMemoryAccessesofFunctions . .... 78 4.3.3 FunctionAttribute. 83 4.3.4 PragmaDirectives ........................... 83 4.3.5 BuiltinFunctions .. .. .. .. .. .. .. .. 88 4.4 Examplesof PENCIL and non-PENCIL Code.................. 93 4.4.1 BFSExample.............................. 93 4.4.2 ImageResizingExample. 95 Contents v 4.4.3 Recursive Data Structures and Recursive Function Calls . 96 4.5 PolyhedralCompilationofPENCILcode . ... 97 4.6 RelatedWork .................................. 100 5 Evaluating PENCIL 103 5.1 Introduction................................... 103 5.2 Benchmarks................................... 105 5.2.1 ImageProcessingBenchmarkSuite . 105 5.2.2 RodiniaandSHOCBenchmarkSuites . 109 5.2.3 VOBLADSLforLinearAlgebra . 110 5.2.4 SpearDEDSLforData-StreamingApplications . .... 114 5.3 DiscussionoftheResults . 114 5.4 ConclusionandFutureWork . 115 6 Conclusions and Perspectives 117 6.1 Contributions .................................. 117 6.2 FutureDirections ................................ 118 Bibliography 121 A EBNF Grammar for PENCIL 131 B Personal Publications 135 Index 136 List of Figures 1.1 Graphicalrepresentationofaset . 6 1.2 Graphicalrepresentationofamap . ... 7 1.3 Exampleofakernel............................... 9 1.4 Iteration domain of Ë½ .............................. 9 2.1 Acodewithfalsedependencesthatpreventtiling . ........ 20 2.2 Gemm kernel .................................. 22 2.3 Gemm kernelafterapplyingPRE. 22 2.4 Threenestedloopswithloop-invariantcode . ....... 22 2.5 Three nested loops after loop-invariantcode motion . .......... 22 2.6 Tiledversion of the gemm kernel......................... 24 2.7 Oneloopfromthe mvt kernel.......................... 25 2.8 Examples where {S1 → S2} and {S3 → S4} donotinterfere . 26 2.9 Examples where {S1 → S2} and {S3 → S4} interfere ............. 26 2.10 Anexampleillustratingtheadjacencyconcept . ......... 27 ℄ 2.11 Live ranges for Ø and in mvt ....................... 30 2.12 Examplewheretilingisnotpossible . ..... 31 2.13 Live ranges for the example in Figure 2.12 ................... 31 2.14 Comparing the number of private copies of scalars and arrays created to enable parallelization,tilingorboth . 32 2.15 Original gesummv kernel............................. 35 2.16 gesummv in3ACform. ............................. 35 2.17 Original gemm kernel. ............................. 35 2.18 gemm afterapplyingPRE. ........................... 35 2.19 2mm-3AC (2mm in3ACform).......................... 36 2.20 Expanded version of 2mm-3AC. ........................ 36 2.21 gesummv kernelafterapplyingPRE. 36 2.22 Theeffectof3ACtransformationontiling . ....... 38 viii List of Figures 2.23 Speedup against non-tiled sequential code for Polybench—3AC ....... 39 2.24 2mm-3AC optimized while false dependences are ignored. .......... 40 2.25 Optimizingtheexpandedversionof2mm-3AC. ...... 41 2.26 Speedup against non-tiled sequential code for Polybench—PRE ....... 42 2.27 Innermost parallelism in the expanded version of trmm............. 46 3.1 Illustrativeexampleforstatementclustering . .......... 51 3.2 Originalprogramdependencegraph . ... 52 3.3 Dependencegraphafterclustering . .... 52 3.4 Flowdependencegraph. .. .. .. .. .. .. .. .. 57 3.5 Illustrativeexamplefor basic-blockclustering . ........... 59 3.6 Speedup in scheduling time obtained after applying statementclustering . 63 3.7 Execution time when clustering is enabled normalized to the execution time whenclusteringisnotenabled(lowermeansfaster) . 65 4.1 High level overview of our vision for the use of PENCIL ........... 72 4.2 General2Dconvolution. 91 4.3 Code extracted from Dilate (imageprocessing) . 99 4.4 Example of code containing a Ö statement................. 99 A.1 PENCIL syntaxasanEBNF. .......................... 131 A.1 PENCIL syntaxasanEBNFcontinuedoverleaf. 132 A.1 PENCIL syntaxasanEBNFcontinuedoverleaf. 133 List of Tables 2.1 Atableshowingwhich3ACkernelsweretiled. 37 2.2 AtableshowingwhichPREkernelsweretiled. ..... 37 3.1 The factor of reduction in the number of statements and dependences after applyingstatementclustering. 62 5.1 Statistics about patterns of non static-affine code in the image processing benchmark.................................... 106 5.2 Effect of enabling support for individual PENCIL features on the ability to generatecodeandongainsinspeedups . 106 5.3 Speedups of the OpenCL code generated by ÈÈ overOpenCV . 107 5.4 SelectedbenchmarksfromRodiniaandSHOC. .... 109 5.5 PENCIL Features that are useful for SHOC and Rodinia benchmarks . 109 5.6 Speedups for the OpenCL code generated by ÈÈ for the selected Rodinia andSHOCbenchmarks............................. 110 5.7 Gains in performance obtained when support for the ××ÙÑ builtin is enabled inVOBLA.................................... 113 5.8 Speedups obtained with ÈÈ overhighlyoptimizedBLAS libraries . 113 Nomenclature Roman Symbols 3AC Three-Address Code ABF Adaptive Beamformer APIs Application Program Interface AST Abstract Syntax Tree BFS Breadth-First Search C11 C Standard 2011 C99 C Standard 1999 CUDA Compute Unified Device Architecture DSL Domain-Specific Language EBNF Extended Backus-Naur Form GCC GNUCompilerCollection GFLOPS Giga FLoating-point Operations Per Second GPU Graphics Processing Unit HMPP Hybrid Multicore Parallel Programming

Load more