D´epartement d’informatique Ecole´ doctorale ”SoFt” UFR de Sciences

Optimisation It´erative de Biblioth`eques de Calculs par Division Hi´erarchique de Codes

Iterative Optimization of Performance Libraries by Hierarchical Division of Codes

THESE`

pr´esent´ee et soutenue publiquement le 14 septembre 2007

pour l’obtention du

Doctorat de l’universit´ede Versailles Saint-Quentin (sp´ecialit´einformatique)

par S´ebastien Donadio

Composition du jury Directeur : William Jalby Pr´esident : Albert Cohen Rapporteurs : Jean-Fran¸cois Collard Boris Sabanin Examinateurs : Denis Barthou David Padua Michel Guillemet Mis en page avec la classe thloria. Remerciements

Ces trois ans de th`ese sur l’optimisation de code pour le calcul scientifique m’ont permis de rencontrer de nombreuses personnes talentueuses. Ce travail a pu ˆetre men´egrˆace au financement de la soci´et´eBULL, de l’Association Nationale de la Recherche Technique, du laboratoire ITACA et du Commissariat `al’Energie´ Atomique D´epartement des Applications Militaires. Je tiens `aremercier particuli`erement mon directeur de th`ese Monsieur William Jalby qui par son investissement et sa motivation m’a permis d’acqu´erir une tr`es grande rigueur dans mon travail. Sa disponibilit´e, son engagement, son aide ne me feront jamais oublier la chance que j’ai eu `atravailler avec lui. Je voudrais ´egalement exprimer toute ma reconnaissance `amon encadrant de th`ese Monsieur Denis Barthou qui a su me diriger tout au long de ce travail. Son travail et ses explications m’ont beaucoup appris sur ce m´etier de chercheur. Il en est de mˆeme pour Monsieur Albert Cohen qui m’a co-encadr´edurant cette th`ese. Personne avec qui j’ai eu plaisir `atravailler et qui a su me montrer par son extrˆeme motivation des domaines passionnants. Je remercie ´egalement dans une autre langue le Professeur David Padua qui m’a accueilli dans son ´equipe dans l’Universit´ed’Illinoisa ` Urbana-Champaign tout comme Monsieur Claude Camozzi qui m’a suivi pendant ces 3 ans avec la soci´et´eBULL ainsi que Monsieur Michel Guillemet qui a particip´e`ace jury de th`ese. Je tenais `aremercier particuli`erement la soci´et´eBULL qui, grˆace `asa bourse CIFRE, m’a permis de travailler dans de tr`es bonnes conditions tout en assurant un support technique et un acc`es `aune technologie de pointe (machines, compilateurs, syst`emes d’exploitation). Je remercie ´egalement le travail de mes deux rapporteurs Messieurs Jean-Fran¸cois Collard et Boris Sabanin qui m’ont apport´ebeaucoup d’informations pendant la fin de l’´ecriture de ce manuscrit. Je voulais ´egalement remercier l’´equipe dans laquelle j’ai travaill´equi m’a toujours beaucoup apport´e. J’ai aim´eparticuli`erement travailler avec Patrick Carribault qui a ´et´eun tr`es bon coll`egue durant toutes ces ann´ees. Merci ´egalement `aChristophe Lemuet pour m’avoir laiss´e l’utilisation d’un tr`es beau programme, `aJean-Thomas Acquaviva, Sid Touati, Henri-Pierre Charles, Jean Papadopoulo, St´ephane Zuckerman d’avoir pris la suite avec plusieurs coeurs `a l’ouvrage, Marc Perache, Lamia Djoudi, Minhaj Khan, Emmanuel Oseret, Alexandre Duchateau, Julien Jaeger et Souad Koliai. Je remercie ´egalement les membres de l’´equipe de Cryptographie, Aur´elie Bauer, Joana Treger et Sorina Ionica ainsi que d’autres tr`es bon coll`egues du laboratoire PRiSM comme Xiaohui Xue, Tao Wan, Veronika Peralta ou Amir Djouama. Je remercie tout particuli`erement ma famille qui m’a toujours fait confiance ainsi que tous mes amis.

i ii R´esum´e

La complexit´egrandissante des architectures ne simplifie pas la tˆache des compilateurs `a g´en´erer du code performant et ceci en d´epit de nouvelles phases d’optimisation. Les g´en´erateurs de biblioth`eques comme ATLAS, FFTW et SPIRAL ont r´eussi `a int´egrer cette difficult´epar l’utilisation de recherche it´erative. Cette derni`ere g´en`ere diff´erentes versions de programmes et s´electionne la meilleure d’entre elles. Cette th`ese explore une solution automatique pour adap- ter les applications de calculs intensifs `al’architecture complexe des machines. En reprenant des optimisations d´ej`aconnues, nous montrerons qu’une approche g´en´erative peut ˆetre un outil utile `al’impl´ementation d’une nouvelle approche de compilation hi´erarchique pour la g´en´era- tion de code efficace. Cette m´ethode s’appuiera sur l’utilisation des compilateurs du march´e. Contrairement `aATLAS, cette approche n’est pas du tout sp´ecifique `aun domaine d’applica- tion. Elle peut ˆetre appliqu´ee sur des structures de boucle assez g´en´erales qu’elle divisera en des fragments de code plus simples `aoptimiser pour un compilateur. Grˆace `aces noyaux de codes, nous proposerons une nouvelle approche de g´en´eration de biblioth`eques pour le calcul haute performance. Cette approche s’appuiera sur la recomposition de ces codes avec un mod`ele tr`es simplifi´e, ce qui nous permettra de concurrencer largement certaines biblioth`eques du march´e et particuli`erement celles pour BLAS.

Mots-cl´es: Optimisation, Compilateur, Calcul Haute Performance, Transformation, G´en´era- tion, Recherche it´erative Abstract

The increasing complexity of hardware features incorporated in modern processors makes high performance code generation very challenging. Library generators such as ATLAS, FFTW and SPIRAL overcome this issue by empirically searching in the space of possible program versions for the one that performs the best. This thesis explores fully automatic solution to adapt a compute-intensive application to the target architecture. By mimicking complex sequences of transformations useful to optimize real codes, we show that generative programming is a practical tool to implement a new hierarchical compilation approach for the generation of high performance code relying on the use of state-of-the-art compilers. As opposed to ATLAS, this approach is not application-dependant but can be applied to fairly generic loop structures. Our approach relies on the decomposition of the original loop nest into simpler kernels. These kernels are much simpler to optimize and furthermore, using such codes makes the performance trade off problem much simpler to express and to solve. Finally, we propose a new approach for the generation of performance libraries based on this decomposition method. We show that our method generates high-performance libraries, in particular for BLAS.

Keywords: Optimization, Compiler, High Performance Computing, Transformation, Genera- tion, Iterative Search Contents

List of Figures viii

Glossary xi

1 Introduction 1.1 Contexte...... 1 1.2 Historique...... 1 1.3 Lam´emoirecache ...... 3 1.4 Traductionettransformations...... 5 1.5 Les g´en´erateurs de code sp´ecifique aux domaines ...... 8 1.6 Contributions...... 9 1.7 Plan...... 10

2 Meta-programming Languages for High-Performance Computing 12 2.1 Motivation ...... 12 2.2 Features of a meta-programmation language for high-performance computing . . 15 2.3 MetaOCaml, purely generative approach ...... 20 2.4 PrerequisiteofMetaOCaml ...... 21 2.5 Generative Strategies for Loop Transformations ...... 22 2.5.1 PrimitiveTransformations...... 22 2.5.2 Composition of Transformations ...... 25 2.5.3 Generative Implementation of Complex Optimizations ...... 29 2.5.4 SafeMeta-ProgramminginC ...... 36 2.5.5 Conclusion ...... 40 2.6 A generation language : X-Language ...... 42 2.6.1 Macro-languages ...... 42 2.6.2 X-Languagepragmause ...... 43 2.6.3 Implementation...... 46 2.6.4 Experimentalresults ...... 52 2.6.5 Bibliography ...... 54 2.7 Conclusion ...... 56

v 2.7.1 Comparisonoftwoapproaches ...... 56 2.7.2 Limitations ...... 57

3 using Kernel Decomposition 58 3.1 Introduction...... 58 3.2 Why is it important to divide the problem? ...... 60 3.2.1 X-languageFramework ...... 60 3.3 Hierarchical decomposition in kernels ...... 61 3.3.1 LoopTiling...... 63 3.3.2 LoopTransformations ...... 64 3.3.3 Data-Layout Optimization ...... 64 3.3.4 Kernel Micro-optimization and Execution ...... 65 3.3.5 PuttingKernelstoWork ...... 67 3.4 Experimentalresults ...... 67 3.4.1 Implementation...... 67 3.4.2 ExperimentalEnvironment ...... 68 3.4.3 Afew1Dkernels...... 68 3.4.4 KernelsforDGEMM...... 69 3.5 Conclusion ...... 72

4 Kernel Recomposition 75 4.1 LibraryGenerationScheme ...... 75 4.1.1 PerformanceModeling ...... 78 4.2 Recompositionalgorithm ...... 78 4.2.1 Withonlyonekernel...... 78 4.2.2 Extension for different kernels ...... 79 4.3 Code generation from constraint systems ...... 80 4.4 Experimentalresults ...... 81 4.4.1 Matrix-vector multiply on Itanium 2 ...... 82 4.4.2 A real example: dot-product library generation ...... 83 4.5 Decision tree for DGEMV and DGEMM ...... 87 4.5.1 Results compared to vendor libraries ...... 87 4.5.2 LAPACKpotrs...... 91 4.6 Methodextension...... 92 4.6.1 Kerneltests...... 92 4.6.2 Model for an accurate performance prediction ...... 94

vi 5 Conclusion 95 5.1 Contributions...... 95 5.2 Methodlimitations ...... 96 5.3 Futurework...... 96 5.3.1 GPUprocessors...... 96 5.3.2 Multi-coreprocessors...... 97 5.4 Summary ...... 97

Bibliography

vii List of Figures

1.1 Loi de Moore (`agauche) Evolution du mat´eriel (`adroite)...... 2 1.2 Lam´emoirecached’unItanium...... 4 1.3 FFT compil´epar un compilateur et une application sp´ecifique...... 6 1.4 Ensemble des optimisations de code s´equentiel ...... 10

2.1 Programming adaptive library generators ...... 13 2.2 Iterativesearch ...... 14 2.3 Influence of parameter selection with the size NB ...... 14 2.4 UnrollandJam...... 19 2.5 Fullunrollingexample...... 23 2.6 Partialunrolling ...... 24 2.7 Optimizing galgel (base171s)...... 25 2.8 Composition of loop transformations ...... 26 2.9 Strip-mining...... 27 2.10 Factoring strip-mining and unrolling ...... 28 2.11 Simplifiedmatrixproducttemplate ...... 30 2.12 Array of code and loop iteration combinators ...... 32 2.13 Boxed scalars without cross-stage persistence ...... 32 2.14 Scalar promotion — loop template ...... 34 2.15 More efficient variant without cross-stage persistence ...... 34 2.16 Scalarpromotion—loopbody ...... 35 2.17 Scalar promotion — generated code ...... 35 2.18 Some support functions for C generation ...... 37 2.19 Some generation primitives for C ...... 38 2.20 ExampleofCgeneration...... 40 2.21 Supportforscalar-promotion ...... 41 2.22 Revisited matrix product template ...... 41 2.23 Loopunrollusingmacrostatements...... 42 2.24 (a)-Matrix-multiply code. (b)-Tiled Matrix-multiply code with macros...... 44 2.25 Example in X of loop unroll. (a)- Pragmas to name the loop and specify the unroll4(b)-Generatedcode ...... 44 2.26 Example in X of stripmine.(a)-Pragmas to name loops and specify the stripmine transformation. (b)-Generated code...... 45 2.27 Example in X of the scalarize-in transformation. (a)-Pragmas for scalarize-in. (b)-Code after scalarize-in array a in l1...... 46 2.28 Example in X of scalarize-out and lift transformation. (a)-Pragmas pour scalarize-out et lift. (b)-Code g´en´er´e...... 47 2.29 Example split. (a)-Pragmas for split. (b)-Generated code...... 47

viii 2.30 Example shift for software pipeline. (a)-Pragmas for shift. (b)-Generated code, includingfullunroll...... 48 2.31 (a) mini-mmm code in X. (b) Code after transformation with MU = 4, NU = 1. 53 2.32 Preliminary results comparing ATLAS to naive code with pragmas for DGEMM. . . 54 2.33Relatedworks...... 56

3.1 DGEMM performance and L2 and L3 behavior for ATLAS and MKL on Itanium 2 61 3.2 Optimized tile code generation ...... 62 3.3 TiledDGEMM...... 63 3.4 Several transformed mini-mmm and their corresponding 1D kernel. The first is a kernel of 2 daxpys, the second is 16 dot products and the third is the same as the first (modulo commutativity of the multiplication and renaming)...... 65 3.5 (a) - 1D Kernel: dot product k,l, (b) - 1D Kernel: daxpy k,l, (c) - 2D Kernel: dgemvk,(d)-2DKernel: outerproductk...... 66 3.6 DAXPYexamples ...... 69 3.7 DAXPY kernel performance with different unrolling factors...... 70 3.8 Dot-product kernel performance with different unrollingfactors...... 70 3.9 Dotproduct-k, k kernel on Itanium 2 with k = 1, 2, 4, 8...... 71 3.10 Dotproduct-k, k kernel on Pentium 4 with k = 1, 2, 4, 8...... 71 3.11 Outerproduct-k kernel on Itanium 2 with k = 1, 2, 4, 8...... 72 3.12 Outerproduct-k kernel performance on Pentium 4 with k = 1, 2, 4, 8...... 73 3.13 DGEMV-kernel performance on Itanium 2 with k = 1, 2, 4, 8...... 73 3.14 DGEMV-kernel performance on Pentium 4 with k = 1, 2, 4, 8...... 74

4.1 Decomposition and Recomposition of a code ...... 76 4.2 Recomposition-step1...... 76 4.3 Recomposition-step2...... 77 4.4 Recomposition-step3...... 77 4.5 Kernel recomposition ...... 80 4.6 (a)-Matrix-Vector composed with Dot-product kernels (b)-Performance in cy- cle/FMA superposed with the number of cache L2 and L3 misses/FMA . . . . . 82 4.7 (a)-Matrix-Vector composed with DAXPY kernels (b)-Performance in cycle/FMA superposed with the number of cache L2 and L3 misses/FMA ...... 83 4.8 Performance in cycle/FMA superposed with the number of cache L2 and L3 misses/FMA of (a)-Daxpy kernels (b)-Dot-product kernels (c)-Dot-product ker- nelsusingcopykernels...... 84 4.9 Performance in cycle/FMA superposed with the number of cache L2 and L3 misses/FMA of (a)-Dot-product kernels (b)-Copy kernels (c)-Dot-product kernels usingcopykernelsonPentium4 ...... 85 4.10 (a)-Kernel list used to build dot-product library with size and performance (b)- Performance library in cycle/FMA for dot-product library superposed with the numberofL2andL3cachemisses/FMA...... 85 4.11 PIPOutputcodeaftertranslation ...... 86 4.12 Library performance of dot-product with 11 dot-productkernels ...... 87 4.13 (a)-Kernel list used to build DGEMV library with size and performance (b)- Performance library in cycle/FMA for dot-product library superposed with the numberofL2andL3cachemisses/FMA...... 88 4.14 (a)-Kernel list used to build DGEMM library with size and performance (b)- Performance library in cycle/FMA for dot-product library superposed with the numberofL2andL3cachemisses/FMA...... 88

ix 4.15Resultsofsolver ...... 89 4.16 DGEMM kernel performance on Itanium 2 (left) and Pentium 4 (right)...... 89 4.17 1D convolution performance on Itanium 2(left) and Pentium 4(right) with n=10. 90 4.18 DGEMV-kernel performance on Itanium 2 with k = 1, 2, 4, 8...... 91 4.19 LAPACK Potrs on Itanium 2(left) and on Pentium 4(right) with N=200 . . . . . 92 4.20 Matrix-vectorcode ...... 93

x Glossary

ATLAS Automatically Tuned Linear Algebra Software Caml Categorical Abstract Machine Language CPU Central Processing Unit GCC GNU Compiler Collection FFTW Fastest Fourier Transform in the West GPU Graphics Processing Unit ICC Intel C Compiler ILP Instruction Level Parallelism IPP Integrated Performance Primitives MKL Math Kernel Library MMM Matrix-Matrix Multiply OCaml Objective Caml PROLOG PROgrammation LOGique TCC Tiny C Compiler

xi xii Chapitre 1

Introduction

1.1 Contexte

Les pr´evisions m´et´eorologiques, l’´etude du climat, la mod´elisation mol´eculaire (calcul des structures et propri´et´es de compos´es chimiques. . . ), les simulations physiques (simulations a´e- rodynamiques, calculs de r´esistance des mat´eriaux, la construction d’avion, simulation d’ex- plosion d’arme nucl´eaire, ´etude de la fusion nucl´eaire. . . ), la cryptanalyse sont des exemples d’applications du calcul haute performance. Non seulement les institutions de recherche civiles et militaires travaillent aujourd’hui sur cette probl´ematique mais le besoin grandissant de calculs touchent ´egalement les particuliers avec des applications multi-m´edias de plus en plus gourmandes et des jeux de plus en plus r´ealistes riches en calculs. La simulation num´erique est aujourd’hui le troisi`eme pilier de la science avec la th´eorie et l’exp´erimentation. Le calcul haute performance permettant de faire ces simulations devient donc aujourd’hui un segment de march´ebeaucoup plus large que sa niche traditionnelle, ce qui en fait l’un des rares secteurs de l’informatique en croissance.

1.2 Historique

Depuis 1960, avec la naissance du premier super-calculateur con¸cu par Seymour Cray, les per- formances n’ont cess´ed’augmenter. Nous sommes pass´es d’un calcul par seconde avec l’ILLIAC fabriqu´ee `al’universit´ed’Illinois en 1952 `a280 000 milliards de calculs par seconde avec la machine d’IBM Blue Gene/L en 2005. Ces performances remarquables ne sont pour autant pas si faciles `aobtenir. En effet, pendant ces 70 ans, la coursea ` la performance a ´et´eobtenue par des innovations technologiques dans tous les domaines. Le contrˆole des techniques de gravure du silicium pour les transistors est un exemple des techniques qui ont pu r´evolutionner ce monde. Les progr`es sur les capacit´es m´emoires, disques, l’augmentation des d´ebits de transferts dans les r´eseaux ont permis cette avanc´ee. L’effort technologique sur les machines a ´et´eun ´el´ement majeur dans la course `ala performance. Ces diff´erentes am´eliorations n’´etaient pas suffisantes. Il a donc fallu complexifier les architectures en ajoutant des m´ecanismes mat´eriels de plus en plus sophistiqu´es avec des techniques nouvelles (parall´elisation, vectorisation, pr´ediction,. . . ). Parler aux machines directement est une ´epreuve qui m´erite d’avoir une connaissance assez accrue des caract´eristiques mat´erielles tout en ayant une aisance sur le dialecte homme machine (personne ayant les capacit´es de C6PO). Ainsi, au cours de ces ann´ees, les langages de programmation ont vu le jour pour simplifier la vie des programmeurs et augmenter la portabilit´esur diff´erentes machines.

1 Ces langages sont un code permettant la communication entre l’individu et la machine. Ils simplifient au mieux le contrˆole des unit´es de calculs, de m´emoire, de communication d’une ar- chitecture. Au d´ebut des ann´ees 40, les programmeurs devaient ´ecrire dans le langage machine de l’ordinateur. Son vocabulaire ´etait constitu´ede nombres binaires qui repr´esentaient les adresses des m´emoires et les codes des op´erations. Mais ces langages, tr`es puissants, n’´etaient pas `ala port´ee de toutes les personnes ayant besoin de faire de gros calculs. Apr`es 15 ans d’utilisation, John Backus de chez IBM d´ecida de cr´eer un langage de plus haut niveau, le Fortran pour For- mula Translation. Ses instructions ressemblaient `ades formules math´ematiques, ce qui rendait ce langage r´eserv´e`ades personnes faisant des calculs. Apr`es ce langage, le Cobol vit le jour. Ce langage a ´et´el’un des pr´ecurseurs et fut utilis´epar de nombreux programmes. Sa syntaxe ´etait proche de l’anglais courant. Il ´etait cependant trop verbeux. Il y eut donc l’arriv´ee du Basic, du langage C et du Pascal. Le langage C avait comme avantage de pouvoir ˆetre programm´e de fa¸con plus abstraite tout en restant proche du langage machine. Avec le Fortran, ce dernier langage est devenu le standard de la programmation d’application haute-performance. Les machines ne pouvaient bien sˆur pas comprendre ces langages de haut niveau (aujourd’hui c’est encore vrai). Il fallut donc mettre en place des traducteurs de code qui convertissaient la syntaxe des codes compr´ehensibles par les individus en des codes pour les ordinateurs. C’est ainsi que naquit l’une des chaˆınes qui reste la plus importante aujourd’hui : la chaˆıne de tra- duction de code. En effet, cette traduction est critique pour la g´en´eration d’un code adapt´e. La production d’un code permettant de faire fonctionner tous les organes d’une architecture apr`es plusieurs traductions est l’un des domaines les plus recherch´es actuellement. A partir d’un langage existant sur toutes les machines, la g´en´eration d’un code sp´ecifique oblige `aavoir un compilateur s’adaptant `achaque architecture. Ce langage utilis´eest cependant tr`es g´en´eraliste, il ne contient aucune information pouvant aider le m´ecanisme de traduction. Tout le travail va donc appartenir aux compilateurs en charge de transformer au mieux le code en un dialecte correspondant exactement `acelui d’une machine. Il y eut ´egalement durant les ann´ees 70, des tentatives de construction de machine langage d´edi´ee `aun langage ´evolu´e. L’optimisation du mat´eriel se faisait donc en fonction du langage et non l’inverse. Le contraire est donc devenu la r`egle depuis l’av`enement de la loi de Moore qui a mis toute la profession au service des technologies du silicium.

Fig. 1.1 – Loi de Moore (`agauche) Evolution du mat´eriel (`adroite).

L’obtention de performance suppl´ementaire pourrait se faire grˆace `ala loi de Moore (Figure 1.1), nous aurions la possibilit´ed’attendre d’avoir des processeurs fabriqu´es de transistors de plus en plus petits pour que la fr´equence soit plus grande et donc les organes de calculs plus

2 rapides. Cependant, aujourd’hui, la fr´equence a atteint des limites physiques. Il faut donc trouver des m´ethodes nouvelles qui pourront s’abstraire de ces fr´equences limit´ees. C’est le cas du parall´elisme, qui consiste `afaire plusieurs tˆaches en mˆeme temps. Dans le domaine du calcul haute-performance, le parall´elisme a contribu´e`arendre chaque composant ind´ependant en ayant la possibilit´ede fonctionner en mˆeme temps. Pour cela, les instructions d’une machine ont ´et´edivis´ees en plusieurs op´erations atomiques pour pouvoir faire qu’une instruction s’ex´ecute dans le mˆeme temps qu’une autre. Que ce soit des instructions pipelin´ees ou des instructions vectorielles, ces diff´erentes m´ethodes ont rendu les codes ex´ecutables en parall`ele. Les nouveaux processeurs lanc´es depuis 2 ans s’inscrivent dans ce principe, ils poss`edent plusieurs coeurs de calculs ce qui permet une augmentation du degr´ede parall´elisme. Comme nous le montrons dans la figure 1.1, le mat´eriel a du se m´etamorphoser pour le gain de performance, rendant ainsi la tˆache des g´en´erateurs de code de plus en plus complexe. Dans cette th`ese, nous traiterons essentiellement les mod`eles mono-coeur qui serviront `apasser sur les mod`eles `aplusieurs coeurs pour des travaux futurs. L’autre aspect du probl`eme concerne le volume et le temps d’acc`es aux donn´ees. En effet, une m´emoire `atubes `avide dont la capacit´e´etait de 150 mots et le temps d’acc`es de plusieurs secondes cr´e´ee par IBM en 1948 est devenue une m´emoire de plusieurs giga mots d’espace avec des temps d’acc`es de quelques dizaines de nanosecondes. La masse de donn´ees de plus en plus grande a ´et´estock´ee dans des disques dont les volumes pouvaient contenir seulement 5 M´ega- Octets avec l’IBM 350 en 1956 pour atteindre 2,5 Tera-Octets aujourd’hui. Ces disques ont toujours ´et´elimit´es m´ecaniquement, leur temps d’acc`es a toujours ´et´e1 million de fois plus lent que celui d’une m´emoire. Pour ´eviter d’attendre trop longtemps le chargement des donn´ees `a partir des disques, on a donc cr´e´eun syst`eme hi´erarchique. Ce syst`eme permet de rapprocher les donn´ees le plus rapidement possible des unit´es qui en auront besoin pour un calcul. Ce syst`eme est bas´esur une repr´esentation hi´erarchique qui va du disque, le plus grand et le moins rapide aux m´emoires, les plus rapides et les plus petites. Pour encore augmenter la rapidit´ede chargement des donn´ees, la m´emoire s’est vue divis´ee en diff´erentes parties : les m´emoires proches des unit´es de calculs et les m´emoires servant de zone de stockage entre les disques. La m´emoire cache a donc vu le jour dans les ann´ees 60 dans un laboratoire de Grenoble. La m´emoire cache contient une copie des donn´ees originelles lorsqu’elles sont coˆuteuses (en terme de temps d’acc`es) `a r´ecup´erer ou `acalculer par rapport au temps `ason temps d’acc`es. Une fois les donn´ees stock´ees dans la m´emoire cache (le cache), l’utilisation future de ces donn´ees peut ˆetre r´ealis´ee en acc´edant `ala copie en cache plutˆot qu’en r´ecup´erant ou recalculant les donn´ees, ce qui abaisse le temps d’acc`es moyen.

1.3 La m´emoire cache

Nous faisons un premier arrˆet dans cette historique pour d´etailler le fonctionnement d’un tel composant. En effet, elle sera au centre de diverses optimisations que nous verrons ult´erieu- rement. Pourquoi est-elle aussi importante ? Tout d’abord, il faut savoir que cette m´emoire est divis´ee en plusieures niveaux : – le cache de premier niveau (L1) dans les processeurs (cache de donn´ees souvent s´epar´edu cache d’instructions), – le cache de second niveau (L2) dans certains processeurs (peut se situer hors de la puce), – le cache de troisi`eme niveau (L3) rarement (sur la carte m`ere). Le sch´ema de la figure 1.3 montre la complexit´emat´erielle d’un Itanium avec l’organisation de sa m´emoire cache. Elle est divis´ee en 3 niveaux dont les temps d’acc`es `achaque niveau varie crescendo en fonction de la profondeur de cache.

3 Par exemple, le temps de calcul sur certaines architectures est de l’ordre d’un ou deux cycles pour une op´eration d’addition. Un acc`es `ala m´emoire peut coˆuter jusqu’`a200 cycles. Il est donc tr`es important de savoir comment g´erer ces diff´erents niveaux pour avoir des temps d’acc`es aux donn´ees minimaux. En programmation, la taille de la m´emoire cache revˆet un int´erˆet tout particulier, car pour profiter de l’acc´el´eration fournie par cette m´emoire tr`es rapide, il faut que les parties de programme tiennent le plus possible dans cette m´emoire. Comme elle varie suivant les processeurs, ce rˆole d’optimisation est souvent d´edi´eau compilateur. Prenons par exemple le code suivant : int i; double s; double tab[1000]; s=0.; for (i = 1; i<1000; ++i) s+=tab[i];

Les donn´ees acc´ed´ees par le code suivant seront contenues dans la m´emoire cache si ce code est r´ep´et´eplusieurs fois et si le cache peut contenir 1000 flottants. Une fois dans le cache, les donn´ees seront acc´ed´ees plus rapidement. Nous avons vu que le cache servait de zone temporaire pour lire ou ´ecrire des valeurs. Nous allons voir `apr´esent comment ces transferts ont lieu. Le plus petit ´el´ement de donn´ees qui peut ˆetre transf´er´e entre la m´emoire cache et la m´emoire de niveau sup´erieur est appel´ela ligne de cache. Le mot, lui, est le plus petit ´el´ement de transfert entre le processeur et le cache. Comme cette zone est modifi´ee au cours du temps, quand le processeur va vouloir acc´eder `aune donn´ee pour la premi`ere fois, cela va cr´eer un d´efaut de cache. Ce d´efaut de cache va revenir plusieurs fois durant ce manuscrit (sous forme de Miss), car il provoque l’augmentation du temps d’acc`es `aune donn´ee. Ce qui est un facteur d´egradant les

Fig. 1.2 – La m´emoire cache d’un Itanium

4 performances. L’acc`es `aune donn´ee pour la premi`ere fois n’est pas le seul moyen de produire un d´efaut de cache. En effet, quand l’ensemble des donn´ees n´ecessaire au programme exc`edent la taille du cache, nous allons avoir des d´efauts de cache qui vont arriver pour transf´erer les donn´ees manquantes dans ce cache. Nous avons ´egalement des d´efaut conflictuels quand deux adresses distinctes de la m´emoire de niveau sup´erieur sont enregistr´es au mˆeme endroit dans le cache ou alors des d´efauts de coh´erence quand ils sont dus `al’invalidation de lignes de la m´emoire cache afin de conserver la coh´erence entre les diff´erents caches des processeurs d’un syst`eme multi- processeurs. Les protocoles de coh´erence ont la charge de s’assurer de l’int´egrit´edes donn´ees dans tous les caches. Les syst`emes multi-processeurs sont un agr´egat de processeurs pouvant avoir leur propre cache avec une m´emoire qu’ils partagent, d’o`ula n´ecessit´ed’un protocole permettant de r´eguler la coh´erence des donn´ees entre les divers acteurs du calcul. Comment se fait le stockage dans les caches ? La m´emoire cache ´etant limit´ee, il faut d´efinir une m´ethode indiquant `aquelle adresse de la m´emoire cache doit ˆetre ´ecrite une ligne de la m´emoire principale. Nous avons plusieurs syst`emes de stockages : – les m´emoires caches compl`etement associatives o`uchaque ligne de cache de la m´emoire de niveau sup´erieur peut ˆetre ´ecrite `an’importe quelle adresse de la m´emoire cache. – les m´emoire caches directes o`uchaque ligne de la m´emoire principale ne peut ˆetre en- registr´ee qu’`aune seule adresse de la m´emoire cache. Ceci cr´ee de nombreux d´efauts de cache conflictuels si le programme acc`ede `ades donn´ees qui sont mapp´ees sur les mˆemes adresses de la m´emoire cache. La s´election de la ligne o`ula donn´ee sera enregistr´ee est habituellement obtenue par : Ligne = (Adresse) mod (Nombre de lignes). – les m´emoires caches N-associatives qui est un compromis entre ces deux solutions. La m´emoire cache est divis´ee en N ensembles ´egaux de lignes de cache. Une ligne de la m´emoire de niveau sup´erieur est affect´ee `aun ensemble, elle peut par cons´equent ˆetre ´ecrite dans n’importe laquelle des voies. Ceci permet d’´eviter de nombreux d´efauts de cache conflictuels. A` l’int´erieur d’un ensemble, la disposition est compl`etement associative. En g´en´eral, la s´election de l’ensemble est effectu´ee par : Ensemble = (Adresse m´emoire) mod (Nombre d’ensembles) Cette m´emoire cache peut `ala fois contenir des donn´ees et des instructions. On peut se poser la question du besoin de s´eparer ces deux ´el´ements compl`etement diff´erents. Si nous prenons l’exemple des chargements, il est bon d’avoir la division de ces deux entit´es. Avoir un cache unifi´e est souvent un manque `agagner dans un syst`eme. Cependant, il montre quelques avantages, en effet, dans un cache s´epar´e, comme nous parlions, la coh´erence entre instructions et donn´ees doit ˆetre effectu´ee. De nos jours, la solution la plus r´epandue est la s´eparation des caches car elle permet entre autres d’appliquer des optimisations sp´ecifiques `achaque cache, les particularit´es de compor- tement des instructions et des donn´ees ´etant tr`es diff´erentes (la pr´ediction de branchements illustre bien ces propos) Les protocoles de coh´erence de cache sont tr`es nombreux, ils r´egissent l’ordonnancement des donn´ees dans les caches. Nous ne nous attarderons pas plus sur le sujet, cela touche de plus en plus au cˆot´emat´eriel de l’architecture.

1.4 Traduction et transformations

Les unit´es de calculs, les acc`es m´emoires, les sp´ecificit´es architecturales sont autant de ca- ract´eristiques que le compilateur doit prendre en compte pour g´en´erer un code efficace. Pour s’adapter `acela, il doit transformer le code pour qu’il soit cr´e´e`al’architecture. Par exemple, sur certaines architectures comme l’Itanium, l’utilisation de loads pairs sera pr´ef´er´e`al’emploi de deux loads (pour des donn´ees contigu¨es, il sera plus rapide d’utiliser une instruction qui va

5 pouvoir faire le chargement ´equivalent `adeux instructions). Dans cette section, nous allons donc introduire la notion de transformation de code. Tout d’abord, nous allons nous interroger sur la n´ecessit´e de faire un tel travail. La necessit´e d’avoir des calculs toujours plus rapides n’est plus `ad´emontrer mais pourquoi devons-nous nous attarder sur la bonne mise en oeuvre des transformations appliqu´ees sur le code par ces compilateurs ? Si nous regardons la courbe de la figure 1.3, nous pouvons voir la repr´esentation de la performance d’un code g´en´er´epar un compilateur g´en´eraliste et de l’autre g´en´er´epar ce mˆeme compilateur avec une m´ethode permettant de diriger les transformations de ce compilateur par rapport `aun domaine sp´ecifique. Ici, nous avons repr´esent´el’exemple d’une transform´ee de Fourier g´en´er´ee par SPIRAL et celui d’un compilateur normal. Comme nous pouvons voir le compilateur se trouve `a10% des performances du g´en´erateur sp´ecifique. Nous pouvons faire pareil pour une multiplication de matrices, les performances seront quatre `acinq fois moins bonnes que le code g´en´er´ede fa¸con naive par le compilateur. L’´ecart de performance entre un code g´en´er´epar un compilateur et un g´en´erateur sp´ecifique est tr`es grand, c’est pour cela qu’il est int´eressant de savoir comment conduire les diff´erentes transformations du compilateur.

8000 Compilateur Generateur specifique

7000

6000

5000

4000

3000 Performance en MFlops

2000

1000

0 4 6 8 10 12 14 16 Taille

Fig. 1.3 – FFT compil´epar un compilateur et une application sp´ecifique

Nous avons vu de fa¸con d´etaill´ee la m´emoire cache, mais bon nombre de composants sont tout aussi importants. C’est le cas des registres. Les registres sont un peu le cache le plus interne pour un processeur. Ils sont tr`es limit´es, nous pouvons en trouver 128 sur Itanium, 8 sur Pentium ou 16 sur Opteron (il y a d’autres registres sur ces architectures mais leur nombre est r´eduit). La gestion du pipeline d’instruction est ´egalement primordiale, comme nous en parlions plus haut, la division des instructions en plusieurs ´etapes atomiques permet de faire du parall´elisme d’instructions pouvant donc acc´el´erer le code. Cela rejoint ´egalement le nombre d’unit´efonctionnelle qu’il faut prendre en compte. Les compilateurs ont `ala charge plusieurs optimisations : – l’´elimination de code mort, ce qui permet de faire baisser la taille des programmes. En effet, suite `aune mauvaise programmation, des portions de codes sont inacessibles, le compilateur prend la charge de le supprimer, – les optimisations de flots de donn´ees, comme l’´elimation de sous-expression commune pour le calcul ou les propagations de constantes, – la reconnaissance de variable d’induction avec la construction de repr´esentation SSA,

6 – l’analyse des pointeurs r´ef´eren¸cant les tableaux, – les optimisations plus complexes sur les nids de boucles que nous verrons au prochain chapitre. Les transformations doivent toujours respecter la s´emantique d’un programme. D’apr`es [7], une transformation est l´egale si le programme transform´e donne le mˆeme r´esultat que le code original. En plus de ce probl`eme de l´egalit´e, se rajoute la notion de d´ependances. En effet, les programmes pr´esentent des contraintes sur l’ordre d’ex´ecution des instructions. Les transfor- mations peuvent changer cet ordre, ce qui ne respecte pas ces d´ependances. Les d´ependances de contrˆole interviennent entre deux instructions lorsque l’ex´ecution de l’une des instructions d´epend du r´esultat de l’autre. L’autre forme de d´ependance est celle concernant les donn´ees. Elles sont au nombre de trois : – lecture apr`es ´ecriture (ou RAW) intervient lorsqu’une instruction cherche `alire une donn´ee avant que cette derni`ere ait pu ˆetre ´ecrite par une instruction lanc´ee ant´erieurement, – ´ecriture apr`es lecture (ou WAR) lorsqu’une instruction cherche `a´ecrire `aun emplacement m´emoire avant que celui-ci n’ait pu ˆetre lu par une instruction ant´erieure. – ´ecriture apr`es ´ecriture (ou WAW) lorsqu’une instruction essaie d’´ecrire au mˆeme empla- cement qu’une instruction pr´ec´edente. Prenons par exemple la transformation de d´eroulage que nous d´etaillerons un peu plus tard. Cette transformation duplique le code d’une corps de boucle de cette mani`ere : for (int x = 0; x < 100; x++) for (int x = 0; x < 100; x += 5){ a+=r[x]; a+=r[x]; a+=r[x+1]; a+=r[x+2]; a+=r[x+3]; a+=r[x+4];} (code intial) (code apr`es d´eroulage) Le param`etre de cette transformation est le facteur de d´eroulage 5. Selon la valeur de ce facteur, le corps de la boucle d´eroul´ee va ˆetre plus ou moins grand. Si nous prenons l’exemple pr´ec´edent, nous avons vu que le tableau d´eroul´eva utiliser plus de registres sur la machine cible. Cependant, ce nombre de registres est limit´edonc sur une architecture comme Itanium, si ce facteur de d´eroulage d´epasse les 128 registres le compilateur va commencer `amettre des instructions spill/fill ce qui va ralentir les performances. Nous avons donc une recherche de param`etres `aeffectuer. Le chemin allant d’un algorithme scientifique `al’obtention de performances est donc assez complexe. Nous venons de voir ce que le compilateur devait prendre en compte pour rendre un code efficace. Ces optimisations et leurs combinaisons sont nombreuses. Comme nous le verrons au cours de cette th`ese, il est tr`es difficile de trouver la s´equence de phases d’optimisation qui va ˆetre performante. La difficult´econsiste `aconcevoir un mod`ele de performance qui pourrait pr´evoir l’effet d’une transformation sur un code. Ce qui donnerait une recherche quasiment imm´ediate des transformations `aeffectuer. Il est donc requis de se lancer dans la recherche de la meilleure s´equence d’optimisations. Ce nombre de combinaisons est ´enorme et le coˆut pour le trouver ´egalement [81]. Les compilateurs poss`edent des m´ecanismes se chargeant d’appliquer ou non une s´equence de transformations `a un code. Ce m´ecanisme est bas´esur des heuristiques. Les optimisations pour une application de calcul haute-performance sont souvent li´ees `ace domaine. De nombreux travaux ont ´et´efaits pour permettre aux compilateurs de s’adapter `aces diff´erents domaines, c’est ce que nous allons voir dans la prochaine partie. La tˆache des compilateurs est toutefois limit´ee dans un premier temps `ala transformation de code. Les transformations algorithmiques ne peuvent pas ˆetre faites. Or, ce sont souvent avec ce type de transformations que l’ont peut changer la complexit´edes algorithmes et donc

7 faire acc´el´erer le temps d’ex´ecution. D’importants travaux ont ´et´efaits sur la reconnaissance d’algorithme [2], ils ont montr´ecomment reconnaˆıtre un algorithme et le remplacer par des appels `ades fonctions de biblioth`eques.

1.5 Les g´en´erateurs de code sp´ecifique aux domaines

Nous commencerons par citer l’une des revues de IEEE qui a consacr´etout son volume 93 du f´evrier 2005 `ace type de g´en´erateur. En effet, le d´eclenchement des options de compilation menant `ala performance se fait par une bonne connaissance de l’architecture, du compilateur et de l’algorithme scientifique. Ce que proposent ces g´en´erateurs, c’est de diriger l’espace de recherche du compilateurs pour trouver les bonnes s´equences d’optimisation. Au cours de cette th`ese, nous nous sommes particuli`erement int´eress´es `a l’optimisation des BLAS [57, 56, 29]. Les BLAS sont des routines qui s’occupent d’op´erations d’alg`ebre lin´eaire comme la multiplication de matrices ou de vecteurs. Elles sont utilis´ees pour construire des biblioth`eques enti`eres comme celle de LAPACK [5] (biblioth`eque scientifique pour le calcul num´erique). Ces routines sont optimis´ees pour le calcul haute performance et pour diff´erentes architectures. Les BLAS sont divis´es en trois niveaux : – le niveau 1 contient des op´erations sur les vecteurs uniquement comme les produits sca- laires ou le calcul sur les normes de vecteurs. Les calculs sont de la forme suivante : y ← α × x + y, – le niveau 2 concerne les op´eration matrice-vecteur de la forme y ← α × A × x + β × y, – le niveau 3 est le niveau de la multiplication de matrices et de fa¸con plus g´en´eral des calculs sur les matrices de la forme : C ← α × A × B + β × C. Ces BLAS sont `ala base de plusieurs calculs d’algorithmes scientifiques ou ´economiques. De leur optimisation d´epend la performance de ces nombreuses applications. Au niveau de l’archi- tecture de machines, ces routines sont tr`es int´eressantes car le calcul sur les matrices implique d’optimiser `ala fois le chargement des donn´ees en m´emoire et `ala fois l’optimisation des calculs. Leur optimisation a donc ´et´efaite dans un premier temps par l’optimisation de code assembleur. Comme toujours, leur performance ´etait ´elev´ee mais le grand inconv´enient de cette m´ethode ´etait la portabilit´ede ces optimisations sur d’autres architectures. Or aujourd’hui, les machines ´evoluent sans cesse en ayant diff´erentes tailles de cache, diff´erentes unit´es fonctionnelles,. . . . La recherche s’est donc int´eress´ee `aoptimiser ces BLAS sur toutes les architectures en donnant naissance `ades g´en´erateurs de code sp´ecifique. Plusieurs projets se sont succ´ed´es, l’un des plus connus `ace jour pour l’optimisation des BLAS reste ATLAS [86].

ATLAS (Automatically Tuned Linear Algebra Software) est une biblioth`eque pour l’alg`ebre lin´eaire. Elle fournit une impl´ementation des BLAS pour C et Fortran. L’un des points forts de cet outil a ´et´ede pouvoir g´en´erer un code performant sur diff´erentes architectures. L’optimisation d’ATLAS se base sur 3 proc´ed´es : – le param´etrage des transformations. En effet, il a la possibilit´ede fairer varier le temps de latence du pipeline logiciel, les degr´es de d´eroulages, la taille des tuiles (r´esultat d’un pavage que nous verrons au cours de cette th`ese), – la connaissance de l’application cible permet `aATLAS d’avoir des fragments de codes adapt´es `ala multiplication de matrices, – s’adapte `adiff´erentes architectures. On pourra retrouver des codes SIMD avec l’utilisation des instructions SSE ou ALTIVEC selon l’architecture. ATLAS se caract´erise par une recherche exhaustive de tous les fragments de codes qu’il pos- s`ede et prendra les meilleurs pour diff´erentes tailles. Les diff´erents facteurs de tuilage pourront adapter le volume de code acc´ed´eaux diff´erentes niveaux de cache.

8 GotoBLAS est un projet r´ecent beaucoup moins robuste que le pr´ec´edent qui construit des biblioth`eques pour les mˆemes calculs. Il se base sur la construction de fonctions de calcul `a partir de noyaux fabriqu´es `apartir de code assembleur.

SPIRAL permet l’optimisation sur diff´erentes architectures de calculs de transform´ee de Fou- rier. Il exploite les connaissances math´ematiques sp´ecifiques du domaine en faisant des trans- formations algorithmiques avec une recherche it´erative. La m´ethode se base sur la recherche et l’apprentissage de techniques exploitant la structure de l’algorithme qui guide l’exploration et l’optimisation. Contrairement `aATLAS, l’optimisation est faite sur l’algorithmique et non sur le code.

FFTW [38] (ou la Transform´ee de Fourier la plus rapide de l’Ouest) est ´egalement un outil permettant de g´en´erer des biblioth`eques de calcul de transform´ee de Fourier. Cet outil a ´et´e d´evelopp´epar Matteo Frigo et Steven Johnson du laboratoire du MIT. La technique est de choisir un algorithme parmi une vari´et´epossible en ´evaluant et en mesurant les performances. Il va diviser le probl`eme en plusieurs petits probl`emes. Il utilisera particuli`erement des codes agissant sur des petites grandeurs car ils sont plus efficaces. Les diff´erentes divisions d’algorithme pour une taille donn´ee peuvent ˆetre stock´ees ainsi que le plan d’ex´ecution. Elles peuvent donc ˆetre r´eutilis´ees par la suite. Ces diff´erents outils ne sont que le reflet de l’int´erˆet du domaine pour la sp´ecification de la connaissance aux compilateurs et ´egalement pour la portabilit´edes optimisations. Avec ces quatre outils, nous avons vu deux m´ethodes d’optimisation. Les deux premi`eres optimisent le code et les deux derni`eres l’algorithme. Lequel des deux est important `aoptimiser ? La r´eponse est ´evidente, les deux doivent ˆetre pris en compte. La complexit´eest ce qui d´etermine la rapidit´e d’un algorithme. Si nous prenons par exemple des algorithmes tr`es simples comme le tri rapide (quicksort) et le tri fusion, leur complexit´eest en n log n. Pour autant, sur une machine, le tri rapide se montre beaucoup plus d´elicat `aimpl´ementer et surtout `aatteindre de bonnes performances. Le principal inconv´enient de ces m´ethodes est le temps de recherche d’une solution. Comme nous avons vu, ils sont contraints de tester diff´erents codes avec un nombre de combinaisons de transformations assez ´elev´e. Ces nombreux tests impliquent un tr`es grand temps de calcul. Pour cela, de nombreuses m´ethodes ont ´et´epropos´ees pour restreindre l’espace de recherches de solutions. Pour contraindre cet espace, l’utilisation de mod`ele a pu diriger les transformations d’un code par rapport aux caract´eristiques de la machine contenues dans ce mod`ele. Par exemple, Yotov et al. [90] ont r´eussi `aavoir les mˆemes performances qu’ATLAS en ´evitant l’´etape de tests de programme. Des m´ecanismes d’apprentissage [40, 1] permettent ´egalement de s’adapter `al’architecture en ayant un plus grand contrˆole des exp´eriences et de leur signification.

1.6 Contributions

Au cours de cette th`ese, nous allons cr´eer un langage permettant l’expression de transfor- mations de code avec la recherche de s´equence de transformations et leurs param`etres. Puis, nous fabriquerons une m´ethode d’optimisation de biblioth`eque similaire `aATLAS en utilisant le langage cr´e´een ayant une approche it´erative ne tenant compte que de fragments de code. Nous nous aiderons pour cela, d’un mod`ele de performance que nous avons d´ecrit `ala fin de ce manuscrit.

9 1.7 Plan

Nous commencerons par l’´etude g´en´erale de diff´erentes optimisations de code repr´esent´e dans la figure 1.4. Nous ne prenons pas en compte les codes dont l’optimisation est parall`ele. Nous pouvons distinguer deux types de code : les code r´eguliers et irr´eguliers. Les codes r´eguliers sont des codes dont le flot d’ex´ecution n’est pas fonction des entr´ees du programme, les codes irr´eguliers sont le contraire. Pour les codes r´eguliers, une technique avec des poly`edres permet de faire des calculs sur des domaines d’it´erations [36] permettant de s’abstraire du code en le transformant par la suite [21]. Pour les deux types de codes, les optimisations sont les mˆemes : la premi`ere est, bien sˆur, l’optimisation assembleur qui consiste `aprendre la partie d’un code et `a optimiser la fonction en code sp´ecifique `ala machine. Les optimisations par it´eration s’adaptent `ades param`etres choisis et s’occupent de faire les changements. Il y a diff´erents sous-types de recherche it´erative. L’une appartient `ala branche de l’exhaustivit´e, c’est-`a-dire que l’on va tester tous les param`etres d’un programme et toutes les combinaisons possibles de transformations. Les machine learning [1, 40] et les mod`eles [90, 4, 37] permettent de modifier la complexit´e temporelle de ce dernier en r´eduisant le temps de recherche des transformations soit en ayant davantage d’informations sur l’architecture ou sur l’effet des transformations sur celle-ci. Notre travail se situe donc dans la branche it´erative avec une partie exhaustive et une partie mod`ele pour la fin de la th`ese.

Optimisation

Code source Code source regulier irregulier

Polyedre Assembleur Iterative

Transformation bas Machine Recherche Modele niveau impossible Learning Exhaustive

Specifique a On n’apprend rien Temps long l’architecture

Fig. 1.4 – Ensemble des optimisations de code s´equentiel

Dans cette introduction, nous avons vu l’important travail qu’il faut r´ealiser pour communi- quer avec les machines afin qu’elles produisent un code efficace pour les applications scientifiques. La mise en place d’outils sp´ecialis´es est une ´etape n´ecessaire pour atteindre les performances maximales d’une architecture. Les transformations sur le code et la recherche de leurs meilleurs param`etres sont primordiales `al’adaptation du code `ala machine. C’est ce que les g´en´erateurs sp´ecifiques aux domaines vus pr´ec´edemment font. Cependant, leur approche, bien que portable, est trop limit´ee `aleur domaine d’utilisation. L’utilisateur ne peut pas int´eragir avec ces outils

10 pour pouvoir appliquer d’autres transformations ou les utiliser pour d’autres codes. Pour pal- lier ces manques, nous proposerons dans un premier temps une approche permettant de faire des transformations de fa¸con g´en´erale sur un programme donn´e. Cette id´ee assez banale va ˆetre enrichie par l’´etude d’une m´ethode purement g´en´erative ce qui nous permettra d’´etudier la faisabilit´ed’une telle approche et de montrer ses avantages et ´egalement ses limites. Pragmati- quement, nous utiliserons une m´ethode plus structur´ee pour nous permettre des transformations qui ne sont pas possibles avec une m´ethode g´en´erative. Notre but atteint, `asavoir un langage fonctionnel et utile pour les transformations ainsi que la recherche des param`etres, la route vers la performance est encore longue. Il faudra par cons´equent se diriger vers une solution plus efficace, nous permettant de faire passer ce code d’un code na¨ıf `aun code performant. Pour cela, dans un second temps, nous utiliserons le langage cr´e´epour pouvoir ´elaborer une m´ethode d’optimisation qui s’appuiera sur la division de code en plusieurs fragments pour que le compi- lateur puisse avoir la tˆache facile. Nous verrons que cette m´ethode originale `ala fois portable (par les transformations source `asource) et `ala fois g´en´erale (aucun domaine cibl´e) permet d’avoir des performances comparables aux biblioth`eques dont les optimisations sont faites `ala main en code assembleur. Nous concluerons cette th`ese par les optimisations possibles `acette m´ethode tout en expliquant les travaux futurs.

11 Chapter 2

Meta-programming Languages for High-Performance Computing

The quality of compiler-optimized code for high-performance applications is far behind what optimization and domain experts can achieve by hand. Although it may seem surprising at first glance, the performance gap has been widening over time, due to the tremendous complexity increase in microprocessor and memory architectures, and to the rising level of abstraction of popular programming languages and styles. This chapter explores in-between solutions, neither fully automatic nor fully manual ways to adapt a compute-intensive application to the target architecture. By mimicking complex sequences of transformations useful to optimize real codes, we show that generative programming is a practical means to implement architecture-aware optimizations for high-performance applications. The code of general programming languages is too difficult to compile. Using a tool is needed because architecture are too complex and not documented enough to generate directly an efficient code. There is neither architecture information nor program domain. The use of a specific language could be a method to improve performance. The drawback of this choice would be too specific to an application and not to a general code. This would be just as the usage of specific libraries. In our approach, we preferred to bring to the scientific community a more general language that returns easier the expression of the transformations as well as the research of the parameters. In this chapter, we will study the possibility to obtain performance programs from pro- gramming language allowing generating code with the characteristics of the target machine. Therefore, we will present two languages allowing expressing optimizations and to test different combinations of transformations: MetaOCaml and X-Language. This work will show that the metaprogramming methods can reach performances of an handly-optimized code.

2.1 Motivation

Different approaches tackle the improvement of the code performance. Indeed, computations take a larger place in softwares. We have two ways to optimize these different codes. We can improve the architecture ability to adapt itself to the codes or we can improve the code to fit to the computer features. We chose the last one. Finding a compilation method adapted to high-performance programs is required. This method have to take into account the architecture features. Empirical search has been studied in the context of compiler transformations [51] and library generators. Thus, ATLAS [86], a

12 linear algebra library generator, searches the space of possible forms of matrix-matrix multipli- cation routines. The different forms vary in the size of tiles, degree of unrolling, and schedule of operations. The SPIRAL [71] and FFTW [38] signal processing library generators search a space consisting of implementations of different formulas representing the transform to be imple- mented. In the case of library generators, empirical search leads to performance improvements of an order of magnitude over good generic libraries that have not been tuned for a particular machine. Empirical search can also be applied manually by a programmer. The idea would be for the programmer to write the application in terms of several parameters whose best value for a particular target machine would be determined by empirical search. Parameters could specify values such as degree of unrolling of a given loop, tile size, etc. Parameters could also be used to represent completely different ways of carrying out a computation or part of a computation by numbering the different strategies and making this number one of the parameters whose value is to be identified. Library generators managed to generate a code coming close to peak performances adapting itself machine features, this is the case of ATLAS [86]. His goal was to generate an efficient code for computing applications of linear algebra, generating several versions of a program to obtain an adapted code to naive code. FFTW [38] SPIRAL [72] used the same method, one for the signal processing applications. Although effective and portable, their functioning is limited to their application domain. We can observe the general form of the generation tools of library in Figure 2.1.

Figure 2.1: Programming adaptive library generators

Indeed, the path from research prototypes to production-quality optimizers has been more difficult than expected, and advanced loop-nest and interprocedural optimizations are still per- formed manually by application programmers. Main reasons are the following:

• driving and selecting profitable optimizations is increasingly difficult, due to the complex- ity and dynamic behavior of modern processors [52, 24, 68];

• domain-specific knowledge unavailable to the compiler can be required to prove optimiza- tion legality or profitability [12, 60];

13 • hard-to-drive transformations are not available in compilers, including transformations whose profitability is difficult to assess or whose risk of degrading performance is high, e.g., speculative optimizations [6, 73];

• complex loop transformations do not compose well, due to syntactic constraints and code size increase [21];

• some optimizations are in fact algorithm replacements, where the selection of the most appropriate code may depend on the architecture and input data [61].

Application kernel template Iterative driver

Driver selects the best value Generate and execute compile

Optimization 1

Optimization 2

Feedback Optimization n

Figure 2.2: Iterative search

MFLOPS

1200

1000

800

600

400

200

NB 50 100 150 200 250

Figure 2.3: Influence of parameter selection with the size NB

Application-specific solutions. It is well known that manual optimizations degrade porta- bility: the performance of a C or Fortran code on a given platform does not say much about its performance on different architectures. Several works have successfully addressed this issue, not by improving the compiler, but through the design of application-specific program gener- ators, a.k.a. active libraries [85]. Such generators often rely on feedback-directed optimization to select the best generation strategy [76], but not exclusively [90]. The most popular examples are ATLAS [86] for dense matrix operations and FFTW [38] for the fast Fourier transform. Such generators follow an iterative optimization scheme, as depicted in Figure 2.2. In the case of ATLAS: an external control loop generates multiple versions of the optimized code, varying the optimization parameters, and an empirical search engine drives the selection of the best parameters. E.g., Figure 2.3 shows the influence of the tile size of the blocked matrix product

14 on performance on the AMD Athlon MP processor; as expected from the two-level cache hi- erarchy of the processor, three main intervals can be identified, corresponding to the temporal reuse of cached array elements; it is harder to predict and statically model the pseudo-periodic variations within each interval, due to alignment and associativity conflicts in caches [90]. Most transformations applied in these generators have been previously proposed for tradi- tional compilation frameworks, but existing compilers fail to apply them for the aforementioned reasons. Conversely, optimizations often involve domain-specific knowledge, from the special- ization and interprocedural optimization of library functions [26, 16] to application-specific optimizations such as algorithm selection [61]. Recently, the SPIRAL project [72] investigated a domain-specific extension of such program generators, operating on a domain-specific lan- guage of digital signal processing formulas. This project is one step forward to bridge the gap between application-specific generators and generic compiler-based approaches, and to improve the portability of application performance. During this chapter, we will study two ways of using of the metaprogramming for high- performance computing. In the first part, we will show the use of MetaOCaml like multi-level language allowing the generation and the transformation of multi-level codes. Then, we will see that the use of a such language bring some limitations and we create a macro-language which can express high-level loop transformations.

2.2 Features of a meta-programmation language for high-performance computing

In this part, we will study the different features of a language expressing transformations and the way of using them.

1. Elementary transformations. The first features that come in mind are constructs to gen- erate multiple versions of a statement by applying elementary transformations to a state- ment. Elementary transformations are widely used transformations that cannot be con- veniently cast in terms of other simpler transformations. For program optimization, the targets of the transformations are usually compound statements and the transformations typically manipulate the order of execution and the control structure of the components. For sequences of assignment statements, typical elementary transformations are statement reordering, replication, and deletion. Loop transformations include unrolling, interchang- ing, stripmining, fusion, fission, and scalar replacement. We also consider loop tiling an elementary transformation although in theory it can be represented as a combination of stripmining and interchanging. Some loop scheduling transformations, such as , are to be considered as elementary transformations. The reason is that, al- though scheduling can be represented as a sequence of simpler transformations, it is usually difficult to do so.

loop fission [3] breaks a loop into multiple loops over the same index range but each taking only a part of the loop’s body. The goal is to break down large loop body into smaller ones to achieve better utilization of locality of reference. It is the reverse action to loop fusion

15 int i, a[100], b[100]; int i, a[100], b[100]; for (i = 0; i < 100; i++) { for (i = 0; i < 100; i++) { a[i] = 1; a[i] = 1; b[i] = 2; } } for (i = 0; i < 100; i++) { b[i] = 2; }

(before) (after) loop fusion [3] attempts to reduce loop overhead. When two adjacent loops would iterate the same number of times (whether or not that number is known at compile time), their bodies can be combined as long as they make no reference to each other’s data. It make compiler’s work easier. int i, a[100], b[100]; int i, a[100], b[100]; for (i = 0; i < 100; i++) { for (i = 0; i < 100; i++) { a[i] = 1; a[i] = 1; } b[i] = 2; for (i = 0; i < 100; i++) { } b[i] = 2; }

(before) (after) [3] improves the cache performance for accessing array elements. Cache misses occur if the contiguously accessed array elements within the loop come from a different cache line. Loop interchange can help prevent this. The effectiveness of loop interchange depends on and must be considered in light of the cache model used by the underlying hardware and the array model used by the compiler. for (i = 0; i < 10; i++) for (j = 0; j < 20; j++) for (j = 0; j < 20; j++) for (i = 0; i < 10; i++) a[j][i] = i + j; a[j][i] = i + j;

(before) (after) [3] duplicates the body of the loop multiple times, in order to decrease the number of times the loop condition is tested and the number of jumps, which hurt performance by impairing the instruction pipeline. Completely unrolling a loop eliminates all overhead, but requires that the number of iterations be known at compile time. for (int x = 0; x < 100; x++) for (int x = 0; x < 100; x += 5) { { f(x); f(x); } f(x+1); f(x+2); f(x+3); f(x+4); }

(before) (after) loop tiling/loop blocking partitions a loop’s iteration space into smaller chunks or blocks, so as to help ensure data used in a loop stays in the cache until it is reused. The partitioning of loop iteration space leads to partitioning of large array into

16 smaller blocks, thus fitting accessed array elements into cache size, enhancing cache reuse and eliminating cache size requirements. for (i = 0; i < 1024; i++) for (i = 0; i < 1024; i+=4) { { C[i] = A[i]*B[i]; for (ii = 0; ii < 4; ii++) } { C[i+ii] = A[i+ii]*B[i+ii]; } }

(before) (after) vectorization can perform a few instructions in one vector instruction. The resulting code is independent and could be done with vector instructions In the following example, we consider that we apply the vectorization with a vector with a size of 4. For instance: for (i = 0; i < 1024; i++) for (i = 0; i < 1024; i+=4) { { C[i] = A[i]*B[i]; C[i:i+3] = A[i:i+3]*B[i:i+3]; } }

(before) (after) If we normalize the execution time of vector instructions, we can see that it is always faster than scalar instructions.. scalar promotion [3] can replace the use of an array by a use of a scalar. It is to avoid the useless loads of arrays. while (j < maximum - 1) int calcval = (4+array[k])*pi+5; { while (j < maximum - 1) j = j + (4+array[k])*pi+5; { } j = j + calcval; }

(before) (after) software pipelining [55, 67] is a transformation to reschedule the instructions. It is an alternative technique for scheduling VLIW processors. In software pipelining, itera- tions of a loop in a source program are continuously initiated at constant intervals without having to wait for preceding iterations to complete. That is, multiple it- erations, in different stages of their computations, are in progress simultaneously. The steady state of this pipeline constitutes the loop body of the object code. The advantage of software pipelining is that optimal performance can be achieved with compact object code. The concept of software pipelining can be illustrated by the following example: Suppose we wish to add a constant to a vector of data. Assuming that the addition is one-stage pipelined, the most compact sequence of instructions for a single iteration is: 1 Read 2 Add 3 4 Write

Different iterations can proceed in parallel to take advantage of the parallelism in the data path. In this example, an iteration can be initiated every cycle, and this optimal throughput can be obtained with the following piece of code:

17 1 Read 2Add Read 3 Add Read 4L:Write Add Read CJump L 5 Write Add 6 7 Write

Instructions 1 to 3 are called the prolog: a new iteration is initiated every instruction cycle and execute concurrently with all previously initiated iterations. The steady state is reached in cycle 4. and this state is repeated until all iterations have been initiated. In the steady state, four iterations are in progress at the same time, with one iteration starting up and one finishing off every cycle. (The operation CJump L branches back to label L unless all iterations have been initiated). On leaving the steady state, the iterations currently in progress are completed in the epilog, instructions 5 through 7. The software pipelined loop in this example executes at the optimal throughput rate of one iteration per instruction cycle, which is four times the speed of the original program. The potential gain of the technique is even greater for data paths with higher degrees of pipelining and parallelism.

Among these transformations, each one of them are understandable by the user. No new transformation was created, we have defined just a new way to combine them. Reordering statements in a code implies to take into account control dependencies and data depen- dencies. Dependence analysis [66] determines if it is safe to apply transformations on source code.

2. Composition of transformations. Usually, the best version of a statement is the result of applying several elementary transformations. Thus, for example, ATLAS applies in- terchanging, tiling, unrolling and scheduling to the triply nested matrix-matrix multipli- cation loop during its empirical search for an optimal form of the loop. Therefore, our language should allow the application of multiple transformations to a single statement. An example of composite transformation is unroll and jam shown in Figure 2.4. This transformation can be implemented by applying an outer unroll followed by fusion of the two inner loops. Alternatively, unroll&jam can be implemented by first stripmining the outer loop, then interchanging the inner loop with the newly generated loop, and finally unrolling the innermost loop. An important form of transformation composition is conditional composition, where a condition is used to select the transformation or the parameter value of a transformation. For example, consider a loop that is to be first stripmined and then the resulting inner loop unrolled. We may want to fully unroll the inner loop but only when the size of the strip is less than a certain threshold and partially unroll otherwise.

3. Procedural Abstraction. For composite transformations, it is convenient to have proce- dural abstractions to encapsulate new transformations and to avoid having to rewrite sequences of transformations that are applied more than once.

4. A mechanism to define new transformations. This extension mechanism enables the user to add new transformations that cannot be represented as composition of elementary trans- formations. In particular, programmers should be able to generate application-dependent transformations that take into account the semantics of the computation. The simplest way to represent a transformation is using transformation rules which are adequate to

18 represent many transformations. The transformation rules consist of a code template followed by the form resulting after modification by the transformation. For instance, a stripmine transformation with a tile of size 4 could be defined as follows: for (i = 0; i < N; i++) { } -> for (ii = 0; ii < (N/4)*4; ii += 4) for (i = ii; i < ii+4; i++) { } -> for (i = (N/4)*4; i < N; i++) { }

Transforming the top code template into the bottom code is the stripmine transformation, where variable represents the body of the loop to be stripmined. As the example illustrates, transformation rules are quite convenient. However, since transformations rules are not universal, some transformations must be represented as a program written in, for example, a conventional programming language. In this case, the interface between the source language and the transformation routines must be clearly specified. This interface should contain the abstract syntax tree of the code to be trans- formed and perhaps other related information such as dependence graphs.

5. A mechanism to name statements. When applying a sequence of transformations, it is of- ten necessary to apply one of the transformations to one of the components of the resulting code. For example, to implement unroll&jam unrolling is applied to the innermost loop re- sulting from stripmining. Therefore, the ability to name components and subcomponents of statements is necessary to enable the composition of transformations.

Transforming a code is efficient to reach high performance. The sequence of these trans- formation is really complex to find. The aim of a such language is to have a way to express transformations without making the code unreadable. Understanding transformations for an user is easier than using the resulting code of a transformation. Goals of this language:

for(i=0;i

for(i=0;i

for(i=0;i

Figure 2.4: Unroll and Jam

19 • source-to-source transformations

• express iterative search

• transformation readibility

• processor efficiency compared with hand-coded software

• cost of designing, implementing

These goals are in contrast to Domain Specific Programming Languages (DSL). A DSL is created specifically to solve problems in a particular domain and is not intended to be able to solve problems outside of it. General-purpose programming languages are created to solve problems in many domains.

Language Constraints In this work, we treat only regular codes with static control and perfect nested loops. If-conditions are not in this language and there are only for-loops. For MetaOCaml, the original code must be transformed into MetaOCaml codes. For the others, we can add pragmas in initial codes to transform it. The input is a C-code. Source-to-source transformations are performed in our approach. To make them, we choose to generate trans- formations from a program, a generation tool is needed.

2.3 MetaOCaml, purely generative approach

We choose to use a metaprogramming language to create our transformation framework for loops. In this part, we have decided to study a purely generative approach. This work explores the promises of generative programming languages and techniques for the high-performance computing expert. We show that complex, architecture-specific optimizations can be imple- mented in a type-safe, purely generative framework. We also show that peak performance is achievable through the careful combination of a high-level, multistage evaluation language — MetaOCaml [14] — with low-level code generation techniques. Yet, the applicability of genera- tive approaches is balanced with technical caveats and practical implementation barriers. More specifically:

• On the“positive”side, we show that program generation offers hope for major productivity improvements in the semi-automatic/semi-manual optimization of performance-critical applications. Indeed, large gains can be obtained without more general reflexive meta- programming (with introspection, rewriting, etc.).

• On the “negative” side, we (re)discover intrinsic limitations of generative approaches, and we identify some weaknesses specific to type-safe multistage evaluation. In most cases, we show that these weaknesses are no fundamental limitations, but reduce the gains in programmer productivity, or make the approach accessible to functional programming or compilation experts only.

• From these mixed results, we explore possible improvements and motivate further research, using MetaOCaml as a reference multistage language. One of the research directions con- sists in combining high-level, type-safe program generation with a back-end imperative code generation stage to eliminate abstraction overhead; this approach is similar to off- shoring [30].

20 Section 2.5 surveys typical transformations for high-performance computing, their implemen- tation in a multistage evaluation framework, and the impact on their design on programmer productivity. Section 2.5.3 revisits a simple but teachful and realistic example, matrix-matrix product. To achieve competitive performance with this abstract program generation approach, Section 2.5.4 describes a simple set of combinators to eliminate abstraction penalty and generate C code from MetaOCaml with strong safety guarantees.

2.4 Prerequisite of MetaOCaml

We do not need to present Xavier Leroy’s language, OCaml [74]. Caml is a general-purpose programming language, designed with program safety and reliability in mind. It is very ex- pressive, yet easy to learn and use. Caml supports functional, imperative, and object-oriented programming styles. It has been developed and distributed by INRIA, France’s national research institute for computer science, since 1985. MetaOCaml [30] is a multi-stage extension of the OCaml programming language, and pro- vides three basic constructs called Brackets, Escape, and Run for building, combining, and ex- ecuting future-stage computations, respectively. MetaOCaml is a compiled dialect of MetaML. Different syntaxes are used in this language:

Brackets (written .<...>.) can be inserted around any expression to delay its execution. MetaOCaml implements delayed expressions by dynamically generating source code at runtime. While using the source code representation is not the only way of implementing MSP languages, it is the simplest. The following short interactive MetaOCaml session illustrates the behavior of brackets: let a = 1+2;; val a : int = 3 let a = .<1+2>.;; val a : int code = .<1+2>.

Lines that start with let are what is entered by the user, and the following lines are what is printed back by the system. Without the brackets around 1+2, the addition is performed right away. With the brackets, the result is a piece of code representing the program 1+2. This code fragment can either be used as part of another, bigger program, or it can be compiled and executed. In addition to delaying the computation, Brackets are also reflected in the type. The type in the last declaration is int code. The type of a code fragment reflects the type of the value that such code should produce when it is executed. Statically determining the type of the generated code allows us to avoid writing generators that produce code that cannot be typed. The code type constructor distinguishes delayed values from other values and prevents the user from accidentally attempting unsafe operations (such as 1 + .<5>.).

Escape (written . ...) allows the combination of smaller delayed values to construct larger ones. This combination is achieved by ”splicing-in” the argument of the Escape in the context of the surrounding brackets: let b = .<.~ a * .~ a >. ;; val b : int code = .<(1 + 2) * (1 + 2)>.

This declaration binds b to a new delayed computation (1+2)*(1+2).

21 Run (written .!...) allows us to compile and execute the dynamically generated code without going outside the language: let c = .! b;; val c : int = 9

MetaOCaml is a meta-programming language, it generates code from different syntaxes. As we previously said, the language has to generate codes. MetaOCaml follow this requirement.

2.5 Generative Strategies for Loop Transformations

The main program transformations for high-performance target regular loop nests operating over arrays. Most of them have been defined for an imperative, intraprocedural setting. Although more and more interprocedural program analysis are integrated into some modern compilers, few advanced interprocedural optimizations have been proposed. Advanced optimizations often fail to be applied by the compiler, if at all implemented, and some important optimizations would require too much domain specific knowledge or miss important hidden information (in libraries or input data structure) [16]. From a compiler perspective, there seem to be two means to improve the situation: either the programmer must be given a way to teach the compiler new analysis and optimizations, how to drive them, and possibly when to apply them (overriding static analysis), or the programmer must implement a generator for a class of programs to produce optimized code automatically [60]. Multistage evaluation primarily supports the second direction and we will survey how typical transformations can be revisited in a multistage setting. Very close to our work, the TaskGraph active library [11] provides multistage evaluation and loop transformation support for adaptive optimization of high performance codes (numerical and image processing, in particular). We share similar goals with this work, but we impose an additional constraint on the meta-programming that may occur during program generation: we aim for a purely generative approach, where code is only produced through multistage evaluation (brackets, escape, run), whereas the TaskGraph approach implements loop transformations as high-level transformations of an abstract code representation. As a consequence, the TaskGraph library embeds a full restructuring compiler in the generator, where we only require back-end code generation. In addition, our purely generative approach avoids pattern mismatches and syntactic limitations that are likely to happen on the TaskGraph IR [11]. More fundamentally, the availability of type-systems (with inference) is critical for program debugging and verification; such type systems exist for purely generative languages, but seem hard to extend to reflexive meta-programming [79]. It is also important for a program manip- ulation language to support equational reasoning, hence to facilitate the design of advanced analysis and optimizations [79]. For these practical and fundamental reasons, we choose to stick with a purely generative framework in the following. The main goal is to understand the impact of this choice on expressiveness and programmer productivity, in the context of program manipulations for high- performance computing.

2.5.1 Primitive Transformations In Figure 2.5, we recall some basic syntactic features of MetaOCaml: .< expression >. denotes the code expression value of type code that evaluates to expression , notation .~value used within a code expression is the escape syntax to embed value (defined in the generator environment) into the code expression, the run syntax .!code (to execute a code expression)

22 let rec full_unroll lb ub body = if lb>ub then .< () >. else if lb=ub then body .< lb >. else .< begin .~(body .< lb >.); .~(full_unroll (lb+1) ub body) end >.

val full_unroll : int -> int -> ((’a, int) code -> (’b, unit) code) -> (’b, unit) code =

let a = Array.make 100 0

let body = fun i -> .< a.(.~i) <- .~i >.

full_unroll 0 3 body

.< begin a.(0) <- 0; a.(1) <- 1; a.(2) <- 2; a.(3) <- 3 end >.

Figure 2.5: Full unrolling example. is not used in this example; the rest of the syntax is pure OCaml. Also, vertical bars on the left delimit values returned by the toplevel. For the sake of clarity, cross-stage persistent values are denoted by the variable referring to them in the generator code (it was always possible on the examples in this chapter), hiding the (* cross-stage persistent value *) comment introduced by MetaOCaml. Let us consider a simple example: loop unrolling. This standard transformation — also called partial loop unrolling — can be decomposed into a bound and stride recomputation step and a body duplication step. The second step is called full unrolling and is a straightforward application of multistage evaluation. The code in Figure 2.5 is a recursive multistage imple- mentation of full loop unrolling; body is a function whose single argument is value of the loop counter for the loop being unrolled, and lb and ub are the loop bounds. To enable substitu- tion with an arbitrary expression, notice argument i is a code expression, not a plain integer. Although quite simple, implementing this transformation has two important caveats.

1. Free variables in code expressions, such as i in this example, are important for special- ization (or versioning). In MetaOCaml, all variables must eventually be bound, e.g., as function arguments. Partial application handles practical cases where several generation steps with multiple specializations are needed.

2. The arguments of full_unroll carry structured information (loop bounds and body), as opposed to its “flat” return value (code expression). Since code expressions are immutable in a purely generative approach, it will not be easy to compose such a generation function with further transformation steps, unless subsequent transformations leave the duplicated loop bodies invariant.

The second issue highlights a fundamental problem with program generation, when code expressions produced by a program generator may not evolve beyond the predefined set of ar- guments of a function or the predefined escape points in the expression. Many optimizations

23 built of sequences of simpler transformations cannot be implemented as a composition of genera- tors. This is a major difficulty since transformation composition is key to practical optimization strategies for high performance. Indeed, function composition is key to the building of complex code generators from primitive ones.

However, this is not the end of the road for our study, since there are alternatives to func- tion composition to reuse the code of primitive generators. For example, to achieve partial unrolling, we only have to convert the original loop bounds into adjusted bounds with a strided interval, then we can call full_unroll to perform the actual body duplication; see Figure 2.6. Section 2.5.2 will discuss some practical means to reuse code generators in spite of the lack of compositionality.

let partial_unroll lb ub factor body = let number = (ub-lb)/factor in let bound = number*factor in .< begin for ii = 0 to number-1 do .~(full_unroll 0 (factor-1) (fun i -> body .< ii*factor+lb + .~i >.)) done; for i = bound+lb to ub do .~(body .< i >.) done end >.

val partial_unroll : int -> int -> int -> ((’a, int) code -> (’a, unit) code) -> (’a, unit) code =

let body = fun i -> .< a.(.~i) <- .~i >. in partial_unroll 1 14 4 body

.< begin for ii_2 = 0 to (3 - 1) do begin a.((ii_2 * 4) + 1) + 0 <- ((ii_2 * 4) + 1) + 0; a.((ii_2 * 4) + 1) + 1 <- ((ii_2 * 4) + 1) + 1; a.((ii_2 * 4) + 1) + 2 <- ((ii_2 * 4) + 1) + 2; a.((ii_2 * 4) + 1) + 3 <- ((ii_2 * 4) + 1) + 3 end done; for i_1 = (12 + 1) to 14 do a.((i_1) <- i_1 done end >.

Figure 2.6: Partial unrolling

In the high-performance computing context, this simple example and function inlining could be good advocates for multistage evaluation. However, neither inlining nor unrolling cause serious compilation problems today (beyond choosing when and how much to use them).

24 2.5.2 Composition of Transformations

Figure 2.7 describes a real optimization sequence for the galgel SPEC CPU2000 benchmark [77] (borrowed from [68]). This sequence of 23 transformations on the same program region (a complex loop nest) was manually applied, following an optimization methodology for feedback- directed optimization [68]. The experimental platform is an HP AlphaServer ES45, 1 GHz Alpha 21264C EV68 with 8 MB L2 cache. Each analysis and transformation phase is depicted as a gray box, showing the time difference when executing the full benchmark (in seconds, a negative number is a performance improvement); the base execution time for the benchmark is 171 s. This sequence is out of reach of current compiler technology. Although particularly complex, it is representative of real optimizations performed by some (rare) compilers [47] and some (courageous) programmers [68]. Beyond compositionality, this example also shows how important are extensibility (provisions for implementing new transformations) and debugging support (static and/or generation-time and/or dynamic).

A 2 : + 2 4 s A 4 : - 5 s

F u s i o n L 1 - L 2 L 2 A r r a y C o p y S c a l a r S trip-M ining FF i s u s s i i o o n n S h i f t i n g A 5 : - 6 s A 1 : - 1 4 s A 3 : - 2 4 s Propagation P r o m o t i o n F u s i o n S c a l a r L 1 U n r o l l R e g i s t e r Instruction S trip-M ining FF i s u s s i i o o n n F u s i o n H o i s t i n g P r o m o t i o n S h i f t i n g F i s s i o n a n d J a m P r o m o t i o n Interchange F u s i o n S p l i t i n g S trip-M ining S trip-M ining F u s i o n F u s i o n

Figure 2.7: Optimizing galgel (base 171 s)

Simpler example. Figure 2.8 shows a simpler example where 4 classical loop transformations convert two simple loop nests at the top of the figure into the bloated code fragment below (partially shown). These 4 loop transformations are, in application order, loop interchange, double loop fusion and software pipelining or shifting [3]. Multistage evaluation may clearly help the programmer to write a template for the code below, lifting several parameters and generation phases to automatically customize the code for a target architecture. Unfortunately, this template would not feature much code reuse for other loop optimizations. The previous study makes clear that the lack of reusability limits the applicability of multi- stage evaluation for implementing advanced program transformations. Indeed, successful gener- ative approaches should not only enable application programmers to implement a generator for one single program. In practice, code reuse and portability can be achieved using an abstract intermediate representation, from higher order skeletons — see e.g. [43, 44] — to monadic pro- gram composition approaches — [62] — to more expressive domain-specific languages — see e.g. [83, 72] and the survey by Consel in [60].

Composition of generators. The loop unrolling example shows that the design of a complex code generator from primitive ones is not straightforward: code expressions produced by a generator may not evolve beyond the predefined set of arguments of a function. Any code fragment within a code expression is an invariant for further generation steps. As a result, many optimizations built of sequences of simpler transformations cannot be implemented as a composition of generators. There are of course solutions, consisting in the composition of“generators of code generators” or in using “generators of an abstract intermediate syntax”; these solutions and their impact on productivity will be studied later in this chapter.

25 Original nests.

for j = 1 to n do for i = 1 to m do a.(i) <- a.(i) + b.(i).(j) * c.(j) done done; for k = 1 to m do for l = 1 to n do d.(k) <- d.(k) + e.(l).(k) * c.(l) done done

After 4 loop transformations. let mn = min(m-4, n) in for x = 1 to mn do a.(1) <- a.(1) + b.(1).(x) * c.(x); a.(2) <- a.(2) + b.(2).(x) * c.(x); a.(3) <- a.(3) + b.(3).(x) * c.(x); a.(4) <- a.(4) + b.(4).(x) * c.(x); for y = 1 to mn do a.(y+4) <- a.(y+4) + b.(y+4).(x) * c.(x); d.(x) <- d.(x) + e.(y).(x) * c.(y) done; d.(x) <- d.(x) + e.(mn-3).(x) * c.(mn-3); d.(x) <- d.(x) + e.(mn-2).(x) * c.(mn-2); d.(x) <- d.(x) + e.(mn-1).(x) * c.(mn-1); d.(x) <- d.(x) + e.(mn).(x) * c.(mn); for y = mn+1 to m-4 do a.(y+4) <- a.(y+4) + b.(y+4).(x) * c.(x); d.(y) <- d.(y) + e.(y).(x) * c.(y) done; d.(m-3) <- d.(m-3) + e.(mn-3).(x) * c.(mn-3); d.(m-2) <- d.(m-2) + e.(mn-2).(x) * c.(mn-2); d.(m-1) <- d.(m-1) + e.(mn-1).(x) * c.(mn-1); d.(m) <- d.(m) + e.(mn).(x) * c.(mn) done; for x = mn+1 to n do for y = 1 to m-4 do ...

Figure 2.8: Composition of loop transformations

Let us first study this limitation on loop unrolling again. Since partial loop unrolling can be decomposed into strip-mining and full unrolling of the inner strip-mined loop, we may implement the strip-mine generator shown in Figure 2.9, hoping that some composition mechanism will allow us to define partial unrolling from strip_mine and full_unroll. Unfortunately, this does not work as smoothly because full_unroll does not operate on a closed code expression but on a pair of integer arguments (the bounds) and a function on code expressions (the body). Now, the higher-order generators strip_mine and partial_unroll look very similar: let aside the recursive implementation of full_unroll, partial unrolling seems almost like a lifted version of strip-mining where the code expression has been extended from the loop body to

26 let strip_mine lb ub factor body = let number = (ub-lb)/factor in .< begin for ii = 0 to number-1 do for i = 0 to factor-1 do .~(body .< ii*factor+lb+i >.) done done; for i = number*factor+lb to ub do .~(body .< i >.) done end >.

val strip_mine : int -> int -> int -> ((’a, int) code -> (’a, ’b) code) -> (’a, unit) code =

let body = fun i -> .< a.(.~i) <- .~i >. in strip_mine 1 14 4 body

.< begin for ii_8 = 0 to (3 - 1) do for i_9 = 0 to (4 - 1) do a.(((ii_8 * 4) + 1) + i_9) <- ((ii_8 * 4) + 1) + i_9 done done; for i_7 = (12 + 1) to 14 do a.(i_7) <- i_7 done end >.

Figure 2.9: Strip-mining the whole inner loop. Based on these similarities, it is indeed possible to reuse some code, by factoring the common template of strip_mine and partial_unroll. This is the purpose of the generalized strip-mining generator in Figure 2.10. Thanks to the new g_strip_mine generator, it is easy to implement both the plain strip-mining and partial unrolling, by composition; see Figure 2.10 again. It is even possible to compose multiple strip-mining and unrolling steps, without ordering constraints, with partial applications of g_strip_mine.

Beyond composition: code reuse. This brings us to the most important question to eval- uate generative programming in our context. Function composition is the most natural means to compose loop transformations, and is not applicable to most of our generators. But is it the only way to achieve our goal? Since we are primarily interested in productivity, we should consider the wider concept of generator code reuse. To overcome the fundamental asymmetry between the structured arguments of generator functions and the flat code expressions they produce, one way is to try to generalize the g_strip_mine higher-order generator — also called combinator — approach. Indeed, when the set of transformations to be composed is known for a given application domain, one may aim for a domain-specific set of combinators, where each on those capture a core generative construct: e.g., skeleton, control or data structure. In other words, this approach consists in defining “generators of code generators” and composing these higher order generators (combinators).

27 let g_strip_mine g_loop lb ub factor body = let number = (ub-lb)/factor in .< begin for ii = 0 to number-1 do .~(g_loop 0 (factor-1) (fun i-> (body .< ii*factor+lb+ .~i >.))) done; for i = number*factor+lb to ub do .~(body .< i >.) done end >.

let g_loop_gen lw ub body = .< for i = lw to ub do .~(body .< i >.) done >.

(* Plain strip-mining *) g_strip_mine g_loop_gen 1 14 4 body

(* Partial unrolling *) g_strip_mine full_unroll 1 14 4 body

Figure 2.10: Factoring strip-mining and unrolling

This higher order generator approach requires some agility in passing continuations until the very last composition step where code expressions are generated and assembled. Also, our intuition is that the larger and more expressive the set of transformations, the less semantically rich the associated set of combinators, hence the less productivity gains for the domain developer. Alternatively, to design domain-specific generators, a general approach is to build higher- level constructs from primitive code combinators that operate on a suitable intermediate rep- resentation. The difference seems very thin compared with the implementation of a domain- specific language compiler, a daunting task with low productivity for the programmer. Yet one may expect much benefits from domain-specific knowledge, from the static checking of MetaO- Caml, and from the support of polymorphic and functional code expressions. For example, building an abstract intermediate representation for loop transformations may be as simple as a polyvariant tree of code expressions and loop triplets (bounds and stride). After carrying this survey, we come to the conclusion that, although theoretically sufficient, a compositional approach based on multistage evaluation and abstract intermediate represen- tations would hardly be satisfactory for most programmers. Indeed, the effort to master the language may not be worth its safety and abstraction. In addition, given the complexity and variety of program transformations for high-performance computing, some implementation er- rors may not even be avoided by such an approach (including array dependences). The next section develops our argumentation, proposing alternative ways to improve compositionality.

Alternative Ideas The polytope model has been proposed to represent loop nests and loop transformations in a unified fashion [36, 87, 48]. Although constrained by regularity assumptions on control struc- tures, it is applicable to many loop nests [10] in scientific applications and captures loop-specific information as systems of affine inequalities. Recent advances in polyhedral code generation [9]

28 and a new approach to enable compositional transformations [21] fit very well with our search for a generative and compositional optimization framework. Fortunately, there exists a powerful OCaml interface [13] to the two most effective libraries to operate on polyhedra [33, 64, 70]. Together with multistage evaluation, it seems very easy and efficient to design a polyhedral code generator and to couple it with lower-level back-end optimizations, including scalar promotion and . The use of the OCaml language should also facilitate the imple- mentation of the symbolic legality-checking and profitability analysis involved in polyhedral techniques. Besides the polytope model, alternative ways to achieve better compositionality would in- clude some form of reification (to further transform closed code expressions). This may take the form of an extension of OCaml’s pattern-matching construct, with type-safe term-rewriting rules (type-invariant substitution, polymorphic type specialization, etc.). We believe it is un- likely a general-purpose, user-extensible transformation framework can be built with multi-stage evaluation as the only meta-programming primitive. Allowing some user control on the gener- ated code is thus a pragmatic solution to consider. The ability to post-process generated code with pattern-based substitution is both a powerful tool, yet transformations operating at a more abstract, intensional level — like the polyhedral one — are undoubtedly more elegant.

2.5.3 Generative Implementation of Complex Optimizations To experiment with a realistic example, we reimplemented part of the ATLAS code generator for matrix product [86, 90].

Revisiting a Custom Generator Figure 2.11 shows a pseudo-code for the “generative” matrix product (OCaml-like syntax). In ATLAS, the generator spans more than 3500 lines of C, it is cluttered with printfs, obscure index computations, and string substitutions (see file emit.c in DGEMM).1 The first nest, executed by the generator, computes indirection arrays to build the names of all temporary scalar variables in the innermost computation block. The four nested loops on ii, jj, i and j implement the tiled (a.k.a. blocked) matrix product, with a square tile of size B. These tiles are further decomposed into smaller blocks of size MU and NU; the loops iterating on these blocks are fully unrolled, and all temporary results stored into scalar variables. In a first part, values of the result array c are loaded in the registers named c_m_n. The second part loads one row of a and one column of b into the set of registers a_m and b_n, then operates on these registers to compute partial sums for this pair of rows and columns. Loads are decoupled from the actual computations, and additions are decoupled from multiplications, to hide the latency of the L2 cache and of the multiplication. In the third part, the results are stored back into array c. Indirection arrays i1 and i2 hold the automatically generated names for a large collection of scalar variables; these are working arrays holding the precomputed values for ⌊h/NU⌋ and h mod NU for all 0 ≤ h < MU × NU. Moreover, the strange (illegal) syntax t_i1.(m)_i2.(m) denotes the “synthesized” scalar variable t_5_7 when i1.(m) = 5 and i2.(m) = 7. The inner loops on m and n are fully unrolled and operate on these scalar variables. Starting from the naive three nested loops of the matrix product algorithm, the high- performance computing expert applies four transformations to obtain the code in Figure 2.11. One of those is loop unrolling and is a well known application of multistage evaluation. The three other transformations are more original.

1For the sake of clarity, our pseudo-code and the following experiments operate on integers, whereas DGEMM operates on double-precision floats.

29 (* executed by the generator *) h := 0; for m=0 to MU-1 do for n=0 to NU-1 do i1.(h) <- m; i2.(h) <- n; h := !h+1 done done

(* Template of the generated code *) for ii=0 to N-1 step B do for jj=0 to M-1 step B do for i=ii to ii+B-NU step NU do for j=jj to jj+B-MU step MU do (* Two fully unrolled loops to load the values from matrix block *) (* c[i..i+MU-1][j..j+NU-1] into promoted scalar variables c_m_n *) for m=0 to MU-1 do for n=0 to NU-1 do c_m_n := c.(i+m).(j+n) done done; for k=0 to B-1 do (* All fully unrolled loops to load one pair of rows and columns of matrices a and b *) (* then compute their dot product in a pipelined fashion *) for m=0 to MU-1 do a_m := a.(i+m).(k) done; for n=0 to NU-1 do b_n := b.(k).(j+n) done; for m=0 to latency-1 do t_i1.(m)_i2.(m) := !a_i1.(m) * !b_i2.(m) done; for m=0 to MU*NU-latency-1 do c_i1.(m)_i2.(m) := !c_i1.(m)_i2.(m) + !t_i1.(m)_i2.(m); n := m + latency; t_i1.(n)_i2.(n) := !a_i1.(n) * !b_i2.(n) done; for m=MU*NU-latency to MU*NU-1 do c_i1.(m)_i2.(m) := !c_i1.(m)_i2.(m) + !t_i1.(m)_i2.(m) done done; (* Two fully unrolled loops to store the values from scalars *) (* c_m_n back into matrix block c[i..i+MU-1][j..j+NU-1] *) for m=0 to MU-1 do for n=0 to NU-1 do c.(i+m).(j+n) <- !c_m_n done done done (* Postlude... *) done (* Postlude... *) done (* Postlude... *) done (* Postlude... *)

Figure 2.11: Simplified matrix product template

Loop tiling. The outer loops of the matrix product are rescheduled to compute the blocked product for better cache locality [3]. This transformation is usually seen as a composition of strip-mining (making two nested loops out of one) and loop interchange.

30 None of these transformations makes much use of multistage evaluation. It is possible to implement bound and stride operations and loop generation for strip-mining in the same way as for partial unrolling, but this does not compose with loop interchange at the code expression level. Instead, we had to write an ad-hoc generator for the tiled template of the matrix product. We still get all the benefits of using MetaOCaml instead of strings or a syntax tree (as in ATLAS), but we are not able to reuse the loop tiling code for further applications of the technique on other loop nests.

Scalar promotion. After a second tiling step, the innermost loops are fully unrolled (this is also called unroll and jam [3]) and array accesses in the large resulting blocks are promoted to scalars to enable register reuse (the whole transformation is called register tiling [3]).

This transformation should definitely fit a generative framework since it falls back to straightforward code substitution. However, one may not explicitly craft new variable names in MetaOCaml, since identifiers in let or fun bindings are not first-class citizens: let .~name = ... is not a valid syntax. We know two methods to round this limitation. The first method assumes dynamic single assignment arrays [32], i.e., arrays whose ele- ments are only assigned once during the execution. Such arrays can be replaced by fresh scalar variables, whose names are automatically generated by the MetaOCaml system, following a monadic continuation-passing style [50]. The interested reader should refer to the latter paper for details. This approach has the advantage of directly generating efficient scalar code (unboxed), but it has most of the cons of programming with monads or with explicit continuations, an unnatural style for programmers of high-performance numerical applications.2

Yet it is not difficult to extend the approach of [50] to more general array access patterns that does not require the single assignment property, i.e., authorizing an individual array element to be overwritten multiple times. It is indeed possible to simulate the mutable array elements of matrices c and t with a finite set of boxed variables of type int ref. This solution may produce efficient code if the back-end OCaml compiler is able to unbox the scalar references, otherwise it has no impact on the effective number of memory operations performed over the matrix block (i.e., it does not perform any scalar promotion). An example and analysis of this approach will be developed in the next section.

Instruction scheduling. To better hide memory and floating point operation latencies, some instructions in the innermost loops are rescheduled in the loop body and possibly delayed or advanced by a few iterations. This optimization improves the effects of the scheduling heuristic in the backend compiler.

At first glance, instruction scheduling and software pipelining seem inadequate for gen- erative languages. Yet, these optimizations fit in a typical scheduling strategy where instructions are extracted from a priority list and generated in order. It is even possible to write a generic list-scheduler [25] on a dependence graph of code expressions and to extend it to modulo-scheduling [63].

The next section describes the MetaOCaml implementation of a (shorter) type-safe generator for the compute kernel in the tiled, scalar promoted and software-pipelined matrix product.

2Also syntactic sugar like the monadic do notation helps hide the conceptual complexity to the programmer [50].

31 Type-Safe Generator for the Matrix Product The greatest challenge is to use the meta-programming facilities of MetaOCaml to express the compute-intensive kernel in a style in which array accesses are replaced by references to single variables. We expect that an advanced native-code compiler for MetaOCaml will use processor registers for the reference variables. We define the three combinators in Figure 2.12.

let rec makeCodeRefs n = if n = 0 then [] else let ref0 = ref 0 in (.< ref0 >.) :: makeCodeRefs (n-1)

val makeCodeRefs : int -> (’a, int ref) code list

let dynamicFor low high step f = .< let i = ref .~low in while !i <= .~high do (.~f) (!i); i := !i + .~step done >.

val dynamicFor : (’a, int) code -> (’a, int) code -> (’a, int) code -> (’a, int -> ’b) code -> (’a, unit) code

let staticFor low high step f = let rec gen a = if a > high then .< () >. else .. in gen low

val staticFor : int -> int -> (int -> (’a, unit) code) -> (’a, unit) code

Figure 2.12: Array of code and loop iteration combinators

let makeCodeRefs’ n cont = let rec loop n acc cont = if n = 0 then cont acc else .. :: acc) cont)>. in loop n [] cont

Figure 2.13: Boxed scalars without cross-stage persistence

• makeCodeRefs n generates a list of n code expressions, each one being a distinct value of type int ref code, initialized to 0. Each expression will be given a name by the MetaOCaml system at run time.3 The first implementation of makeCodeRefs relies on

3The names for the reference variables are constructed internally by the MetaOCaml system (version 3.07).

32 cross-stage persistence to make each code expression in the list evaluate to a distinct reference at code generation time. This is achieved in binding a local variable ref0 to ref 0 before staging its value and appending the result to the list; appending .< ref 0 >. multiple times would not achieve the expected result: every code expression would evaluate into a unique dynamic reference. Figure 2.13 provides an alternative, less intuitive implementation of makeCodeRefs: to avoid cross-stage persistence, distinct references in the generated code are explicitly bound within the continuation function cont operating on them; this implementation leads to more efficient code due to (current) inefficiencies in the implementation of cross-stage persistence in MetaOCaml.

• dynamicFor creates a for-loop environment in which loops can have a stride greater than or equal to 1; it is based on OCaml while-loops. dynamicFor takes the lower bound, upper bound, stride and a function which abstracts the loop body in the iteration variable. All arguments of dynamicFor are code expressions since they depend on dynamic values.

• staticFor expresses loops of stride one which are to be fully unrolled in the program. staticFor takes a static lower and upper bound and a function for the loop body which maps the (static) index to the code expression of the loop body.

Figure 2.14 shows the scalar-promoted loop program for the matrix multiplication without the body of the loop nest. The static arguments are the unrolling factors mu and nu, and the software pipeline latency. The dynamic arguments are listed in the lambda-abstraction fun a b c nn mm nb. For each of the arrays a, b, t and c we apply the makeCodeRefs combinator and define functions aa, bb, tt and cc which translate each access to an array into the access to the appropriate register. To use the improved variant of makeCodeRefs, it is sufficient to replace the four lines defin- ing aregs, bregs, tregs and cregs by the code in Figure 2.15 (and close the 4 additional parentheses in the end). This does not really complicate the code for a programmer familiar with continuation-passing style, but is not as straightforward as the first, cross-stage persistent version (which was less efficient). The loop nest body is depicted in Figure 2.16. In a first part, values of the result array c are loaded in the register set accessed by cc. The second part operates on these registers to compute partial sums for every pair of rows in a and columns in b. In the third part, the results are stored back into the array. The second part is the most interesting; it consists of a dynamic loop on k, whose body con- sists of sequences of statements, which are expressed here in terms of our combinator staticFor. First, elements from the arrays a and b are loaded into the register sets indexed through func- tions aa and bb, respectively. The temporary register set indexed through function tt is filled with initial products, then expresses the pipelined dot product of every pair of rows and columns. At last, when all products have been computed, the results are committed into to the register set indexed through function cc. Figure 2.17 shows a snippet of the code generated by the matmult generator (the optimized version without cross-stage persistence). One may easily recognize part of the initialization code for the temporary variables in the fully unrolled block of the matrix product.

Evaluation and Feedback from the Case Study At the time of the experiments, no stable native code MetaOCaml compiler was available. Therefore we did not try to measure the performance of the code directly, since the interpretation overhead is at least of the same order of magnitude as the architectural phenomena we are trying to optimize for.

33 let matmult mu nu latency = let i1 = Array.make (mu*nu) 0 and i2 = Array.make (mu*nu) 0 and h = ref 0 in for m=0 to mu-1 do for n=0 to nu-1 do i1.(!h) <- m; i2.(!h) <- n; h := !h+1 done done; let aregs = makeCodeRefs mu and bregs = makeCodeRefs nu and tregs = makeCodeRefs (mu*nu) and cregs = makeCodeRefs (mu*nu) in let aa i = nth aregs i and bb i = nth bregs i and tt i j = nth tregs (i*nu+j) and cc i j = nth cregs (i*nu+j) in .< fun a b c nn mm nb -> .~(dynamicFor .<0>. .. .. . .~(dynamicFor .<0>. .. .. . .~(dynamicFor .. .. .. . .~(dynamicFor .. .. .. . // Loop nest body in Figure~2.16 >.) >.) >.) >.) >.

val matmult : int -> int -> int -> (’a, int array array -> int array array -> int array array -> int -> int -> int -> int -> unit) code

Figure 2.14: Scalar promotion — loop template

// ... makeCodeRefs’ mu (fun aregs -> makeCodeRefs’ nu (fun bregs -> makeCodeRefs’ (mu*nu) (fun tregs -> makeCodeRefs’ (mu*nu) (fun cregs -> // ...

Figure 2.15: More efficient variant without cross-stage persistence

However, the generated code expressions “look” very much like the output of the C code emitted by ATLAS. We thus bet that, provided the abstraction penalty of executing of MetaO- Caml programs can be eliminated, our type-safe, purely generative implementation can match the performance of the custom, string-based generator of ATLAS. The next section will show our early solution to completely eliminate the abstraction penalty and confirm this analysis. Besides performance, an important issue is code readability and debugging. Static type- checking is a great asset for developing robust meta-programs. Nevertheless, writing a generative template for the tiled, unrolled, scalar promoted and pipelined matrix product can be a trial and error experience. Assembling code expression combinators with escapes, cross-stage persistence, and thorough use of partial application, makes the code hard to follow, especially to high-

34 .~(staticFor 0 (mu-1) 1 (fun m -> staticFor 0 (nu-1) 1 (fun n -> .< .~(cc m n) := c.(i+m).(j+n) >.))); for k=0 to nb-1 do .~(staticFor 0 (mu-1) 1 (fun m -> .< .~(aa m) := a.(i+m).(k) >.)); .~(staticFor 0 (nu-1) 1 (fun n -> .< .~(bb n) := b.(k).(j+n) >.)); .~(staticFor 0 (latency-1) 1 (fun m -> .< .~(tt (i1.(m)) (i2.(m))) := !(.~(aa (i1.(m)))) * !(.~(bb (i2.(m)))) >.)); .~(staticFor 0 (mu*nu-latency-1) 1 (fun m -> let n = m+latency in .< begin .~(cc (i1.(m)) (i2.(m))) := !(.~(cc (i1.(m)) (i2.(m)))) + !(.~(tt (i1.(m)) (i2.(m)))); .~(tt (i1.(n)) (i2.(n))) := !(.~(aa (i1.(n)))) * !(.~(bb (i2.(n)))) end >.)); .~(staticFor (mu*nu-latency) (mu*nu-1) 1 (fun m -> .< .~(cc (i1.(m)) (i2.(m))) := !(.~(cc (i1.(m)) (i2.(m)))) + !(.~(tt (i1.(m)) (i2.(m)))) >.)) done; .~(staticFor 0 (mu-1) 1 (fun m -> staticFor 0 (nu-1) 1 (fun n -> .< c.(i+m).(j+n) <- !(.~(cc m n)) >.)))

Figure 2.16: Scalar promotion — loop body

// ... let i_53 = (ref jj_49) in while ((! i_53) <= ((jj_49 + mb_47) - 4)) do ((fun j_51 -> begin begin t_40 := (c_43.(i_50 + 0)).(j_51 + 0); t_39 := (c_43.(i_50 + 0)).(j_51 + 1); t_38 := (c_43.(i_50 + 0)).(j_51 + 2); // ... t_24 := ((! t_4) * (! t_8)); t_23 := ((! t_4) * (! t_7)); t_22 := ((! t_4) * (! t_6)); // ...

Figure 2.17: Scalar promotion — generated code

performance computing experts. We did not investigate interprocedural optimizations beyond classical applications of gener- ative languages: specialization, cloning and inlining. More complex transformations may even combine loop and function transformations [16]. Yet addressing the issues raised by matrix-

35 matrix product is a necessary step, and we believe it will contribute to understand how to support interprocedural transformations. We did not study the legality of the code transformations either. This issue is of course important, although not as critical as in a stand-alone compiler where the decision of considering a program transformation has to be fully automatic. If we were to check for array or scalar dependences [3] or to evaluate the global impact of data layout transformations, the OCaml type system would be insufficient. Coupling a multistage generator with a static analysis framework seems an interesting research direction.

2.5.4 Safe Meta-Programming in C

There exist several two-stage evaluation extensions of the C language, from the standard C preprocessor to C++ template meta-programming [84, 11], and to the ‘C project [69] based on a fast run-time code generation framework. None of these tools provide the higher order functions on code expressions, cross-stage persistence and static checking features of MetaOCaml. It seems possible to design a multistage C extension with most of these features, probably with some restrictions on the language constructs allowed at the higher stages, but that would be a long-term project. Using MetaOCaml as a generator for C programs seems more pragmatic, at least as a research demonstrator:

1. this may either involve an OCaml-to-C translator, taking benefit of the imperative features of OCaml to ease the generation of efficient C code;

2. or one may design a small preprocessor to embed C code fragments into the lower stages of a MetaOCaml program, relying on a higher level abstract syntax built on code generation combinators [82] to assemble these fragments.

The first solution is very cumbersome when aiming at the generation of standalone “C only” programs. It is unlikely that such an OCaml-to-C translator can ever be designed, without lacking simultaneously on robustness and performance. In the context of multistage evaluation, offshoring is an original, very promising alternative, whose development is conducted within MetaOCaml itself [30]. Instead of a pure translation, offshoring maps code expressions in an imperative subset of OCaml to a subset of C, and homogeneously binds OCaml and C object code. We plan to study this offshoring approach in the future, but currently follow on the second direction, for two reasons:

Generative programming: using a higher-level abstract syntax avoids transforming gener- ated code, following a pure generative strategy;

Programmer control: an abstract syntax can provide full C expressiveness with no hidden overhead (operating on OCaml values, marshaling and unmarshaling) and no generation- time errors (unsupported syntax).

The rest of this section is not the main point of this chapter, but a technical step to make the MetaOCaml code generators eventually produce machine code without any abstraction penalty. The following abstract combinators support a subset of the C language, and we cannot guarantee safety with respect to all syntactic and semantic requirements of C compilation. Rather, these abstract combinators are useful to translate all imperative constructs needed in our experiments into pure C code.

36 Safe Generation of C code in OCaml

Our goal is to design a set of C code generation primitives such that an OCaml program calling these primitives may only generate syntactically sound and type-safe C code. This goal is less ambitious than the offshoring approach [30], since we do not aim at transparent communication of OCaml values to staged C code and vice-versa (which involves Marshaling in [30]). Instead, we assume complete separation of the OCaml and C worlds: only the code generation part of is written in MetaOCaml, the target application being a C (or Fortran) only program. Of course, instead of calling generation primitives explicitly, it would be natural to design a preprocessor converting embedded C syntax into the proper OCaml declarations and calls. This does not bear any fundamental difficulty and is left for future development work. Our current solution does not aim at generating the whole C syntax (it has very limited support for function declarations and calls), and it is not even guaranteed to be fully safe: it should rather be seen as a partially-coercitive environment for generating safe code, with the ability for a hacker to circumvent the soundness restrictions, but limiting the risk for bugs to be unwillingly introduced. A safer solution would necessarily include abstract types and heavy usage of the module system, but we preferred to postpone any work in that direction before a module-aware release of MetaOCaml comes through.

type environment = { mutable txt:string; // C text being produced mutable cnt:int; // Private counter for alpha-renaming ind:string // Pretty printing (indentation) } // Some syntactic elements of C type ’a c_rvalue = RValue of string * ’a type ’a c_lvalue = LValue of string * ’a type ’a c_loop = Cloop of (’a c_rvalue) * (bool c_rvalue) * (’a c_rvalue) * (environment -> unit)

let get_nam var = match var with LValue (n, _) -> n let get_def var = match var with LValue (_, d) -> d let get_txt exp = match exp with RValue (t, _) -> t let get_val exp = match exp with RValue (_, v) -> v let get_init var = match var with Cloop (n, _, _, _) -> n let get_cond var = match var with Cloop (_, c, _, _) -> c let get_iter var = match var with Cloop (_, _, i, _) -> i let get_body var = match var with Cloop (_, _, _, b) -> b

// Support function: turns an lvalue into an rvalue let lvrv var = RValue (get_nam var, get_def var) // Support function: build a loop object let create_loop (init : ’a c_rvalue) (cond : bool c_rvalue) (iterator : ’a c_rvalue) block = Cloop (init, cond, iterator, block)

Figure 2.18: Some support functions for C generation

Technical overview. Practically, we mirror the C grammar productions as generator func- tions operating on specific polyvariant types to represent meaningful C fragments. Each C variable is embedded into an OCaml pair of a string (its name) and a dummy value matching

37 (* Output the declaration of to *) let gen_int_decl nam cont env = let var = nam ^ "_" ^ (string_of_int env.cnt) in let txt = sprintf "int %s;\n" var in env.txt <- env.txt ^ env.ind ^ txt; env.cnt <- env.cnt + 1; cont (LValue (var, 0)) (* Output an instruction built of expression to *) let gen_inst exp env = let txt = sprintf "%s;\n" (get_txt exp) in env.txt <- env.txt ^ env.ind ^ txt

(* Output a block to Notice is a function on environments, this allows to defer the evaluation of the generators in the block *) let gen_block block e = let env = {txt=e.ind ^ "{\n"; cnt=0; ind=e.ind ^ " "} in block env; env.txt <- env.txt ^ e.ind ^ "}\n\n"; e.txt <- e.txt ^ env.txt

(* The following productions are self-explanatory *)

let gen_loop loop env = env.txt <- env.ind ^ "for (" ^ (get_txt (get_init loop)) ^ "; " ^ (get_txt (get_cond loop)) ^ "; " ^ (get_txt (get_iter loop)) ^ ")\n"; gen_block (get_body loop) env let gen_if_then_else (cond : bool c_rvalue) b_then b_else env = env.txt <- env.txt ^ "if (" ^ (get_txt cond) ^ ")\n"; gen_block b_then env; env.txt <- env.txt ^ "else\n"; gen_block b_else env let gen_int_cst (cst : int) = let txt = sprintf "%d" cst in RValue (txt, cst) let gen_assign (var : ’a c_lvalue) (exp : ’a c_rvalue) = let txt = sprintf "%s = %s" (get_nam var) (get_txt exp) in RValue (txt, exp) let gen_lt (op1 : ’a c_rvalue) (op2 : ’a c_rvalue) = let txt = sprintf "(%s < %s)" (get_txt op1) (get_txt op2) in RValue (txt, true) let gen_add (op1 : ’a c_rvalue) (op2 : ’a c_rvalue) = let txt = sprintf "(%s + %s)" (get_txt op1) (get_txt op2) in RValue (txt, (get_val op1) + (get_val op2))

Figure 2.19: Some generation primitives for C the type of the C variable; these variables can only be declared through the explicit usage of OCaml variables. Figures 2.18 and 2.19 show the main types, support functions and genera- tion primitives (for intraprocedural constructs only) and Figure 2.20 shows a code generation example. By design, the grammar productions guarantee syntax correctness. It is more challenging to deal with the scope of the C variable declarations and enforcing type-safety. We choose to represent a C block (the only placeholder for variable declarations) as a function on an environment. Interestingly, environments both record the code being generated and serve to delay the evaluation of the generation primitives until after the production of the surrounding C block syntax. This continuation passing style is the key to the embedding of C scoping and typing rules into the OCaml ones. For example, an integer variable x is simultaneously declared to OCaml and generated to C code when evaluating let x = gen_int_decl "x" env in continuation .

38 Generated C code is appended to the environment env. Further assignments to x may call gen_assign directly and read references to x must first turn it into an “rvalue” through the lvrv function. Technically, we make use of partial evaluation to η-reduce the environment argument when composing generators. This leads to a syntax closer to the syntax tree these generators reflect,4 reducing the chances to corrupt the generated code by mixing multiple environments. A more thorough application of this η-reduction approach would incur rewriting all generators as envi- ronment to environment transformers; this is not a trivial task since many generators already have a return value whose type is key to the safety enforcement of C code generation. We will further develop this approach when we will design a fully safe C code generation library for OCaml.

Generating New Variable Names Following this scheme, it is easy to support the generation of new variable names, without explicit usage of monads or continuation-passing style [50], and avoiding the possible overhead of the solution proposed in Section 2.5.3.0. Figure 2.21 shows a class to operate on scalar- promoted arrays with generation-time array bound checking. Of course, this checking cannot be done statically in general (although specific cases could be handled, with the proper encoding in the OCaml type system): scalar promoting an array with out-of-bound accesses yields out- of-bound exceptions. For example, a scalar-promoted array a is simultaneously declared to OCaml and gener- ated to C code when evaluating let a = new sp_int_array "a" 10, and it may further be referenced through the method idx.

Application to ATLAS Figure 2.22 highlights the most significant parts of the implementation of a generator template for the matrix-matrix product. In this simple implementation, instruction scheduling is done by hand through an explicit split into three separate unrolled loops. Also, empirical search strategies to drive the optimization (e.g., simulated annealing) are not included. This MetaOCaml generator, provided the manually identified optimal parameters, produces C code that meets the peak performance of the (integer) matrix-matrix product, for small and medium matrix sizes (less than 192). These results were obtained on two different architectures: an AMD Athlon XP 2800+ (Barton) at 2.08GHz with 512KB L2 cache, and a AMD Athlon 64 3400+ (ClawHammer) at 2.2GHz with 1MB L2 cache.5

Interprocedural Extensions Currently, support for function declarations and calls is awkward. Like for scalar types, specific generator functions for each function type must be implemented. Three OCaml functions are needed per C function type: one for the function prototype, one for the declaration and one for the call. These generators operate on a polyvariant ‘c_function type and enforce the proper prototype/declaration/call ordering of the C language (assuming the generation of a single-file program). It is still unclear whether more polymorphism can be achieved, possibly requiring one triple of generators for each given number of arguments only.

4Kindly suggested by a reviewer. 5Failure to reach peak performance on larger matrices is due to the lacking page-copying step when iterating across tiles, an important optimization to reduce the pressure on the translation look-aside buffer (TLB). There is no difficulty implementing this additional step in our framework.

39 let global = {txt=""; cnt=0; ind=""};;

let generated_block = let block env = let x = gen_int_decl "x" env and y = gen_int_decl "y" env in gen_loop (create_loop (gen_assign x (gen_int_cst 1)) (gen_lt (lvrv x) (gen_int_cst 10)) (gen_assign x (gen_add (lvrv x) (gen_int_cst 1))) (gen_loop (create_loop (gen_assign y (gen_int_cst 2)) (gen_lt (lvrv y) (gen_int_cst 20)) (gen_assign y (gen_add (lvrv y) (gen_int_cst 2))) (fun env -> let z = gen_int_decl "z" env in gen_inst (gen_assign z (gen_int_cst 5)) env)))) env; gen_inst (gen_assign y (gen_add (lvrv x) (gen_int_cst 3))) env; gen_if_then_else (gen_lt (gen_int_cst 2) (lvrv x)) (gen_inst (gen_assign x (gen_int_cst 5))) (gen_inst (gen_assign x (gen_int_cst 3))) env in gen_block block global

let _ = print_string global.txt

{ int x_0; int y_2; for (x_0 = 1; (x_0 < 10); x_0 = (x_0 + 1)) { for (y_2 = 2; (y_2 < 10); y_2 = (y_2 + 2)) { int z_0; z_0 = 5; } } y_2 = (x_0 + 3); if ((2 < x_0)) { x_0 = 5; } else { x_0 = 3; } }

Figure 2.20: Example of C generation

The next step is of course to write the preprocessor for OCaml-embedded C code, to reduce the burden of writing code like the block generator in Figure 2.20.

2.5.5 Conclusion

We revisited some classical optimizations in the domain of high-performance computing, dis- cussing the benefits and caveats of multistage evaluation. Since most advanced transformations are currently applied manually by domain and optimization experts, the potential benefit of

40 class sp_int_array = fun nam siz ini env -> let _ = for i = 0 to siz-1 do env.txt <- env.txt ^ (sprintf "int %s_%d;\n" nam i) done in object val name = nam val size = siz method idx x = if (0 <= x & x < siz) let var = sprintf "%s_%d" name x in LValue (var, 0) else failwith "Illegal reference to " ^ nam ^ "[" ^ x ^ "]" end

Figure 2.21: Support for scalar-promotion

(* ... *)

let q = new sp_int_array "q" 20 env and r = new sp_int_array "r" 20 env and t = new sp_int_array_array "t" 20 20 env

(* ... *)

in let block_mab1 it env = gen_block (gen_inst (gen_assign (t#idx i1.(it) i2.(it)) (gen_mul (lvrv (q#idx i1.(it))) (lvrv (r#idx i2.(it)))))) env

(* ... *)

in let multiply env = for m = 0 to mu-1 do for n = 0 to nu-1 do i1.(!h) <- m; i2.(!h) <- n; h := !h + 1 done done; .! full_unroll 0 (latency - 1) (fun i -> .< block_mab1 .~i env >.); .! full_unroll 0 (mu * nu - latency - 1) (fun i -> .< block_mab2 .~i env >.); .! full_unroll (mu * nu - latency) (mu * nu - 1) (fun i -> .< block_mab3 .~i env >.)

(* ... *)

Figure 2.22: Revisited matrix product template generative approaches is very high. Indeed, several projects have followed ad-hoc string-based generation strategies to build adaptive libraries with self-tuned computation kernels. We took one of these projects as a running example. We show that MetaOCaml is suitable to imple- ment this kind of application-specific generators, taking full advantage of its static checking properties to increase the productivity and portability of manual optimizations. But our results also show that severe reusability and compositionality issues arise for complex optimizations, when multiple transformation steps are applied to a given code fragment. In addition, we show that static type checking in MetaOCaml — in the current state of the system — somewhat restricts the usage of important optimizations like the scalar promotion of arrays. Nevertheless,

41 sum=0; for (i=0;i<256;i+=%d) { %for (k=i; k<=i+(%d-1); k++) sum = sum + a[i+%k]; } (before)

Figure 2.23: Loop unroll using macro statements. we also show that on all our examples, this difficulty can be overcome in practice, at the price of additional coding complexity — monadic style — or relying on a smarter backend compiler — unboxing and offshoring. Let us summarize our findings. Although multistage evaluation can be used in a simple and effective way to implement program generators for portable high-performance computing applications, it does not relieve the programmer from reimplementing the main generator parts for each target application. The only way to improve code reuse in the generator is to base its design on a custom intermediate representation, which may be almost as convoluted for application programmers as designing their own compiler. These academic results are confronted with a concrete example, matrix-matrix product in the ATLAS adaptive library. We show that competitive performance is achievable with a high-level, type-safe program generation approach, provided the domain expert has a good technical practice with functional programming and with optimizations to lower the abstraction penalty. However, we were not able — up to now — to fully eliminate the abstraction penalty with MetaOCaml alone. We had to resort to a heterogeneous multistage approach to eventually produce lower level C code. While bringing safe meta-programming to low-level imperative programming, this approach also helps lifting some restrictions of MetaOCaml, but delays some safety checks to the generation of the C program. With this study and the complexity of the use of a such language to produce multi-level codes, we will present in the second part of this chapter a new simpler approach.

2.6 A generation language : X-Language

In this section, we will study a new means to use multi-level programmation. We created a language allowing the expression of source-to-source transformations with the ability to create a search space to find the best sequence with good parameters.

2.6.1 Macro-languages

Perhaps the simplest approach to implement X would be to use a macro language. Assuming that the macro language statements are C-like statements preceded by the character % and that references to macro language variables are also preceded by %, Figure 2.23 shows an example where the %for statement produces the body of a loop unrolled %d times. That is, when the %for loop is executed, it produces the sequence of assignments: s=s+a[i+0]; s=s+a[i+1]; ...;s=s+a[i+%d-1]. In this this example we assumed that %d is a sub-multiple of 256 and, for that reason did not include the clean-up code needed to correctly handle the remainder of the 256 iterations of the original loop. Notice that %d in Figure 2.23 will be assigned a value at compile-time, and will usually be assigned several values in successive compilations during an empirical search for the best version of the program.

42 An implementation based on macro language would produce a system that relies on gen- eration rather than transformation. Thus, the construct of Figure 2.23 does not transform an initial loop but generates a loop with the body unrolled %d times. If the macro language in- cludes procedures, it would be possible to write generation routines that accomplish the same objectives as any transformation. For example, we could conceivably develop an %unroll-loop routine that accepts the body of the loop, the index variable, and the degree of unrolling as parameters. These generation routines could be a convenient way to extend the base language with new parameterized statements. In some cases it is preferable to use the generation approach so that the programmer can produce exactly the transformed code that he desires. For this reason, X includes a macro language. However, we have found that the generation approach has two disadvantages: :

1. The generative approach leads to code that is difficult to develop and understand. If we want to optimize an existing program it will be necessary to modify the original code which may introduce errors. Furthermore, code containing generative statements is difficult to write and read. Therefore, the generative approach has disadvantage even when the parameterized code is to be written from scratch.

2. Complexity when composing transformations. Since the programmer is directly manip- ulating source text, when two or more transformations are applied to a statement, the macro statements can become complicated. For instance, tiling the three loops of the matrix-matrix multiplication code in Figure 2.24-(a) with square tiles of size tile results in the code shown in Figure 2.24-(b). The variable %tile will be instantiated at com- pile time, so that versions of matrix-matrix multiplication with different tile sizes can be generated by just changing the value of the %tile variable. The code in Figure 2.24-(b) shows the remainder loops when %tile is not divisible by K, and outlines the additional code that should be written to generate the remainders of M and N. A programmer who needs to write all this additional code is likely to make mistakes. This problem will be less severe if the macro language contains procedures, but then there would be the need to develop a procedure for each combination of transformations or procedures with a cum- bersome parameter list. In any case, tiling can be obtained by composing loop stripmine and loop interchange. Unfortunately, the programmer using macro statements cannot take advantage of this.

2.6.2 X-Language pragma use In this Section, we describe the X language that we have designed taking into account the features described in Section 2.2. X uses #pragmas to name loops or portions of code and to specify the transformations to apply. The syntax of the #pragmas used to name loops or code sections has the form:

#pragma xlang name { ... }

The {} are only necessary when naming a set of statements, but they are not required to name a single statement. These pragmas need to be placed right before the code section to be named. The syntax of the #pragmas to specify transformations has the form:

#pragma xlang transform keyword

43 for (i=0;i

(a) (b)

Figure 2.24: (a)-Matrix-multiply code. (b)-Tiled Matrix-multiply code with macros.

sum=0; sum=0; #pragma xlang name l1 #pragma xlang name l1 for (i=0;i<256;i++) { for (i=0;i<256;i+=4) { sum = sum + a[i]; sum = sum + a[i]; } sum = sum + a[i+1]; #pragma xlang transform unroll l1 4 sum = sum + a[i+2]; sum = sum + a[i+3]; }

(a) (b)

Figure 2.25: Example in X of loop unroll. (a)- Pragmas to name the loop and specify the unroll 4 (b)- Generated code

44 #pragma xlang name l1 #pragma xlang name l1 for (i=0;i

(a) (b)

Figure 2.26: Example in X of stripmine.(a)-Pragmas to name loops and specify the stripmine transformation. (b)-Generated code.

The original source code only needs to be modified with the name #pragmas. The transform #pragmas can be in the same file that the source code or in a different one. In X, the loop unrolling transformation in Figure 2.23 is specified as shown in Fig- ure 2.25. #pragma xlang name l1 is used to name the loop right after it, while #pragma xlang transform unroll l1 4 specifies the transformation unroll l1 4 times. The strip- mine transformation is specified in X with #pragma xlang transform stripmine l1 4 l3 l1rem as shown in Figure 2.26-(a). This transformation will stripmine the l1 loop using a tile size of 4. The generated code is shown in Figure 2.26-(b). The new loop that results of the stripmine transformation is named l3. To name the remainder loop, the example uses l1rem. Using this postfix notation we can apply the same transformation to l1 and l1rem by simply using l1∗ Another transformation that X includes is array scalarization. The syntax for this transfor- mation is #pragma xlang transform scalarize-func in [], where func can be in, out, -in&out or none. scalarize-in is used when copy-in is needed, that is, when the initial values in the array have to be loaded into the scalar variables. scalarize-out is used when copy-out is needed, that is, when the scalar values need to be written back to memory to the corresponding array locations. scalarize-in&out is used when both in and out are re- quired. scalarize is used when nor in or out are necessary. The programmer must determine which is the appropriate scalarize transformation to apply so that the generated code is correct. Figure 2.27-(a) shows an example where the scalarize-in transformation is used to scalar- ize the array a in l1. The generated code is shown in Figure 2.27-(b). The generated code contains the declaration of the new scalar variables a0 and a1, and two new pragmas that name certain statements of the generated code. #pragma xlang name l1.loads name the statements that load the array values into the scalars. #pragma xlang name l1.body name the statements where the array references have been replaced with scalars. Notice that these #pragmas are automatically generated after a scalarize transformation is applied, without the programmer specifying anything. In the case of a scalarize-out transformation an additional #pragma naming l1.stores would have been generated. Naming these loop sections allows the programmer to apply new transformations on the generated code. For example, Figure 2.28-(a) shows an example where the load statements of the copy-in phase have been moved before l1 and the store statements of the copy-out phase have been moved outside l1 as shown in Fig-

45 sum=0; double a0,a1; #pragma xlang name l1 sum=0; for (i=0; i<256; i+=2){ #pragma xlang name l1 sum = sum + a[i]; for (i=0; i<256; i+=2){ sum = sum + a[i+1]; #pragma xlang name l1.loads } { a0 = a[i]; #pragma xlang transform scalarize-in a in l1 a1 = a[i+1]; } #pragma xlang name l1.body { sum = sum + a0; sum = sum + a1; } }

(a) (b)

Figure 2.27: Example in X of the scalarize-in transformation. (a)-Pragmas for scalarize-in. (b)-Code after scalarize-in array a in l1. ure 2.28-(b). In this new example, we have used #pragma xlang transform lift l1.loads before l1 and #pragma xlang transform lift l1.stores after l1, where the syntax of this transformation is #pragma xlang transform lift . X also includes transformations for software pipelining. One difference between the software pipelining and the loop transformations is that software pipelining operates on statements in- stead of loops. The lower granularity of software pipelining transformations makes them more complex, since the programmer needs to deal with movement of individual statements. The two transformations used for software pipelining are split and shift. The split transformation is not necessarily a software pipelining transformation. It is used to separate atomic instructions. Figure 2.29 shows how an instruction combining a load and an operation is breaking assignment statements into two statements, one to compute the right hand side and the other to assign the computed value to the left hand side. Figure 2.30 shows how to software pipeline a loop with the shift transformation. We have used #pragma xlang transform shift l1.1 2. The first argument l1.1 corresponds to the first statement of loop l1 and in general, the loop. notation is used to designate the sequence of the first n statements in the body of loop loop. In the example, the first statement is shifted with respect to the remaining statements with a latency of 2, given by the second argument. Application of the shift transformation creates a pipeline with multiple stages. The example shows the resulting code, with a prolog and a epilog loop. Notice that these loops can be unrolled using the pragma fullunroll as shown in Figure 2.30-(b). Defining transformations with respect to existing ones provides a procedural abstraction to the X language. We describe them in Section 2.6.3.

2.6.3 Implementation In this section, we describe the implementation of the X language translator and present how transformations are encoded.

Code translation The X language is translated in two steps. The frontend performs several tasks before passing the result to the backend. First, the frontend parses the annotated C program and builds

46 for (i=0;i

(a) (b)

Figure 2.28: Example in X of scalarize-out and lift transformation. (a)-Pragmas pour scalarize-out et lift. (b)-Code g´en´er´e.

for (i=0; i

(a) (b)

Figure 2.29: Example split. (a)-Pragmas for split. (b)-Generated code.

47 for (i=0; i

(a) (b)

Figure 2.30: Example shift for software pipeline. (a)-Pragmas for shift. (b)-Generated code, including fullunroll. the associated abstract syntax tree. Next, a tree-walk identifies the loops and transformations specified by the X language directives. The marked loops are then rewritten as series of library calls that represent the loops inside the backend. Also, transformation directives are translated into library calls for performing the appropriate transformations on the annotated loops. After all the annotations of the C program have been translated, the remaining code is transformed using a multistage language similar to the language described in Section 2.6.1. Our multistage language also resembles ‘C [69] which is a generalization of a macro language with arbitrary recursion and where a program may generate another program and execute it, having multiple program levels cooperate and share data possibly at run-time. The final translated program is then ready to be processed by the backend. In the second step, this program is executed: it reads a separate file describing the optimiza- tions, performs the optimizations and produces the final optimized C code. The macro language is used to manipulate code expressions and to write some optimizations (such as unroll) in a compact way. Partial evaluation of expressions that contain only % variables and constants is done in this step: as presented in Section 2.6.1, variable names such as c_%i are then expanded into c_0, c_1,... in the resulting code. Finally, all unoptimized code (not prefixed by pragmas) is printed out without any modifi- cation in the final code.

Defining New Transformations The definition of transformations in X can use pattern rewriting rules and macro code. A pattern rewriting rule contains two patterns: the first pattern is for matching and the second one is for rewriting. When an input code matches the first pattern, the code is rewritten as indicated by the second pattern. If the pattern rewriting rule is not expressive enough, the user has the possibility to define the code using macro code directly. Thus an X program could

48 contain both pragmas and macro statements. In fact, it is possible to define a code generator associated with a pattern of code. In the current implementation, no dependence analysis is integrated yet, so no validity check is performed for the transformation. We envision that, contrary to the compiler, validity checks in X only raise warnings to the user, since the user is assumed to know what he is doing and validity checks may be too conservative. Procedural abstraction enables the writing of complex transformations from simpler ones. It is an important feature in the definition of transformations. The destination pattern can contain some transform pragmas. For instance, a line such as #pragma xlang transform fullunroll l1rem could be added to the destination pattern of stripmine and would fully unroll the remain- der loop. Here is an example to illustrate the use of this language: #pragma xlang transform tile(i,II,STRIDE) for(i=0;i

%%

In this previous example, we describe the loop tiling. The first part before the symbol %, the described code is the loop which must be found in the code. In the other part, we have the loop which replaces the found loop. This is loop with a stride 1 with a lower bound 0 which will be transformed into two loop with a stride being the factor of the loop tiling. The body loop will be transformed by this rule i to i + ii to create a tiling.

Implementation in TCC/PROLOG To have a better efficiency for the search and to express transformations, we choose to use Prolog. Inside it, there is a search motor and we could use it to generate different code versions. X-Language is divided into three steps: • the lexical analyzers. To lexically analyze the code, we used a C99 type lexer: the compiler tiny C (TCC). This compiler owns the grammar to analyze C code. The learning cost of the structure of this compiler makes it very appealing to do this step. This step reach to construct a syntax tree that represents the code. This tree concerns only the code that is inside the pragmas. The output of this compiler is a Prolog code. Basically, we choose to represent the code in predicat forms.

• the code transformer. This step is realized by Prolog. With the predicats generated by TCC, Prolog [23] use the transformation rules created by the user to transform the program.

• code generator. After each transformation, Prolog will generate different versions and test them. The advantages of Prolog is:

• terms rewriting,

• general-purpose language, we can add analysis tool or cost model,

49 • natural search mechanism,

• it can use solver like Constraint Handling Rules[39] to interact with models.

A usage example of this language:

#pragma xlang begin for (i=0; i

The pragmas begin and end declare the part that will be optimized. We have the generation that can be done by the part which generate the plan of transformations. The tiling have the argument STRIDE that varies from 16 to 128 with a stride of 32. This transformation will be applied to the three following loops, an exhaustive research with these different arguments. A Prolog program is generated, this program owns the transformation expressions with constraints. We obtain the loop representation with Prolog expressions and we apply transformations. The code representation is done in this way:

for(i,0,1339,1, % Loop declaration for(j,0,1339,1, for(k,0,1339,1, array(z,[i,j]) += array(x,[i,k]) * array(y,[k,j])))) % Body loop declaration

From this representation, we apply transformations in Prolog. We obtain numerous versions of programs with different arguments. In the following illustration, there are two version using an unrolling factor of 1 and 2.

50 int loop1; int loop0; int loop2; int loop1; double scalar5; int loop2; ... double scalar3; double scalar20; ... for(i=0;i<1339;i+=16) { double scalar10; for(j=0;j<1339;j+=16) { for(i=0;i<1339;i+=16) { for(k=0;k<1339;k+=16) { for(j=0;j<1339;j+=16) { for(loop0=i;loop0

Numerous results are generated in exhaustive way. They are compiled and tested. The last pragma in this example findmemcopy allow to find the use of copy by block.

Transformation creation is created in the following example:

51 /* Strip mine */ sm(I,_,1,for(I,LB,UB,S,B),for(I,LB,UB,S,B)).

sm(I,NEW,S,for(I,LB,UB,1,B1),for(I,LB,UB,S,for(NEW,I,I+S,1,B2))):- integer(UB), integer(LB), integer(S), 0 =:= (UB-LB) mod S, new_name(loop,NEW), recordz(symtable,int(NEW)), substitute([I/NEW],B1,B2),!.

sm(I,NEW,S,for(I,LB,UB,1,B1),for(I,LB,UB,S,for(NEW,I,min(I+S,UB),1,B2))):- new_name(loop,NEW), recordz(symtable,int(NEW)), substitute([I/NEW],B1,B2).

/* Interchange */ inter(I1,I2,for(I1,LB1,UB1,S1,for(I2,LB2,UB2,S2,Z)), for(I2,LB2,UB2,S2,for(I1,LB1,UB1,S1,Z))).

/* Fusion */ fusion(I1,I2,(for(I1,LB,UB,S,Z1);for(I2,LB,UB,S,Z2)), for(I1,LB,UB,S,(Z1;Z3))):- substitute([I2/I1],Z2,Z3).

/* Fission */ fission(I,for(I,LB,UB,S,(B1;B2)), (for(I,LB,UB,S,B1);for(I,LB,UB,S,B2))).

In this code, we created 3 transformations. We described the loop-tiling, loop-inversion and loop-fission.

2.6.4 Experimental results

We study in this section a matrix-matrix multiplication and its optimization with X language. Starting from a very simple implementation, the goal is to mimic ATLAS by performing the same transformations with the X. For this preliminary experiment, the platform used is a NovaScale 4020 server from Bull featuring two 1.3Ghz Itanium 2 (Madison) processors, with a 256KB level 2 cache and a 1.5MB level 3 cache. Quality of compiled code is the key to performance on Itanium because of its explicit parallel assembly and its in-order execution. Scheduling problems cannot be smoothed by hardware mechanisms. All codes (including ATLAS) are compiled using the Intel C compiler (icc) version 9.0 with -03 -fno-aliases flags.

Pragmas for MMM

The initial code for matrix-matrix multiply is a triple-nested loop where the inner loop contains one floating point multiply-add operation. Blocking the code for L2 and L3 cache is key to obtaining high performance. Therefore each loop is tiled three times using X pragmas in order to perform the multiplication with blocks fitting into registers and the L2 and L3 caches. Figure 2.31-(a) shows the mini-MMM code tailored for L2 cache, with the pragmas to generate register-blocking. Note that we do not perform software pipelining because the compiler handles this opti-

52 #pragma xlang name iloop #pragma xlang name iloop for (i = 0; i < NB; i++) for(i = 0; i < NB; i++){ #pragma xlang name jloop #pragma xlang name jloop for (j = 0; j < NB; j++) for(j = 0; j < NB; j += 4){ #pragma xlang name kloop #pragma xlang name kloop.loads for (k = 0; k < NB; k++) { {c_0_0 = c[i+0][j+0]; c[i][j]=c[i][j]+a[i][k]*b[k][j]; c_0_1 = c[i+0][j+1]; } c_0_2 = c[i+0][j+2]; #pragma xlang transform stripmine iloop NU NUloop c_0_3 = c[i+0][j+3]; #pragma xlang transform stripmine jloop MU MUloop } #pragma xlang transform interchange kloop MUloop #pragma xlang name kloop #pragma xlang transform interchange jloop NUloop for(k = 0; k < NB; k++){ #pragma xlang transform interchange kloop NUloop {a_0 = a[i+0][k]; #pragma xlang transform fullunroll NUloop a_1 = a[i+0][k]; #pragma xlang transform fullunroll MUloop a_2 = a[i+0][k]; #pragma xlang transform scalarize_in b in kloop a_3 = a[i+0][k];} #pragma xlang transform scalarize_in a in kloop {b_0 = b[k][j+0]; #pragma xlang transform scalarize_in&out c in kloop b_1 = b[k][j+1]; #pragma xlang transform lift kloop.loads before kloop b_2 = b[k][j+2]; #pragma xlang transform lift kloop.stores after kloop b_3 = b[k][j+3];} {c_0_0=c_0_0+a_0*b_0; c_0_1=c_0_1+a_1*b_1; c_0_2=c_0_2+a_2*b_2; c_0_3=c_0_3+a_3*b_3;} ... } #pragma xlang name kloop.stores {c[i+0][j+0] = c_0_0; c[i+0][j+1] = c_0_1; c[i+0][j+2] = c_0_2; c[i+0][j+3] = c_0_3;} }} ... // Remainder code (a) (b)

Figure 2.31: (a) mini-mmm code in X. (b) Code after transformation with MU = 4, NU = 1. mization better than we can at the source level in this case. Likewise, basic block scheduling is correctly handled by the compiler. We have used two stripmine and three interchange transformations to tile the two nested loops iloop and jloop. Fig.2.31-(b) shows a fragment of the resulting code when the values of blocking are 1 for iloop and 4 for jloop. For the L2 and L3 tilings, copies of a, b and c are made in order to have all the elements of the submatrices in a contiguous memory block.

Optimization Tuning Expressing the optimization is only one step towards high performance code. The other impor- tant step consists of finding the right values for the parameters. Many search strategies can be applied, such as the search employed by ATLAS. For DGEMM, we performed an exhaustive search for the appropriate tile sizes around the expected values. Comparison with the naive code shows a speed-up of 80 (for matrices of size 600 × 600). Figure 2.32 shows that code optimized with the X language outperforms ATLAS

53 for all matrix sizes when coupling it with a custom memory copy routine called dcopy. This routine was automatically produced by a specialized assembly generator, the Xemsys Library Generator [89], using hardware performance counters and static analysis of the assembly code [27]. Coupling our code with the less specialized copy routine of the Intel Math Kernel Library (MKL) yields performance on par with ATLAS on average, and using the plain memcopy sub- routine of the C library degrades performance slightly.

DGEMM NxN on Itanium 2 7 ATLAS X-Language MKL 6

5

4

GFlops 3

2

1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N

Figure 2.32: Preliminary results comparing ATLAS to naive code with pragmas for DGEMM.

These results are very encouraging. Yet the peak architectural performance for matrix- matrix product on Itanium is 0.5 cycle per FMA operation, and the MKL implementation of dgemm does achieve 0.55 cycle per FMA on average, which is 10% to 15% faster than ATLAS and the X-language implementation (depending on array alignment). Our future work includes the continuation of our X-language experiment to fully reproduce or outperform the MKL, showing that the added productivity in adaptive library development can translate into added performance as well (with respect to manual designs like ATLAS).

2.6.5 Bibliography

It is well known that manual optimizations degrade portability: the performance of a C or For- tran code on a given platform does not say much about its performance on different architectures. Several works have successfully addressed this issue, not by improving the compiler, but through the design of application-specific program generators, a.k.a. active libraries [85]. Such genera- tors often rely on feedback-directed optimization to select the best generation strategy [76], but not exclusively [90]. The most popular examples are ATLAS for dense matrix operations and FFTW for the fast Fourier transform. Such generators follow an iterative optimization scheme. Most optimizations performed by these generators are classical loop transformations; some of them involve domain knowledge, from the specialization and interprocedural optimization of library functions [16, 26], to application-specific optimizations such as algorithm selection [61]. Recently, the SPIRAL project [72] pioneered the extension of this application-specific approach to a whole domain of programs: digital signal processing. This project is one step forward to bridge the gap between application-specific generators and generic compiler-based approaches, and to improve the portability of application performance.

54 Beyond application specific generators, iterative optimization techniques prove useful to drive complex transformations in traditional compilers. They use the feedback from real ex- ecutions of the optimized program to explore the optimization search space using operations research algorithms [52], machine learning [61], and empirical experience [68]. In theory, iterative optimization is fully disconnected from the technical implementation of program optimizations. Yet generative approaches such as multistage evaluation avoid the pattern-matching limitations of syntactic transformation systems, which improves the structure of the search space and the applicability of empirical techniques. Indeed, systematic exploration techniques require a higher degree of flexibility in program manipulation than traditional compiler frameworks [20]. We thus advocate a framework that would allow the domain expert to design and express his own transformations, and to meta-program the search for optimal performance through iterative optimization [19]. This goal is similar to the one of telescoping languages [16, 49], a compiler approach to reduce the overhead of calling generic library functions and to enable aggressive interprocedural optimizations, by making the semantical information about these libraries avail- able to the compiler. Beyond libraries, similar ideas have been proposed for domain-specific optimizations [60]. These works highlight the increased need for researchers and developers in the field of high-performance computing to meta-program their optimizations in a portable fashion. Another alternative is multistage evaluation. Most programming languages support macro expansion, where the macro language allows a limited amount of control (not recursive, in general) on code parts. Yet multistage evaluation denotes the syntactic and semantic support allowing a program to generate another program and execute it, having multiple program lev- els cooperate and share data. String-based multistage languages support true recursion and cooperation between levels, but offer no syntactic guarantees on the generated code; the most widely used are the various shell interpreters, and the current version of the X language is also of this kind. To increase productivity, structured multistage languages enforce syntactic cor- rectness of the generated code: e.g., C++ expression templates [84], ‘C [69] and Jumbo [46]. To further increase productivity and ease debugging, a few multistage languages guarantee that the generated code will not produce any compilation error (syntax, definition and initializa- tion errors, type checking): e.g., MetaML and its successor MetaOCaml [14, 78]. The added safety is very valuable to increase the productivity of program generator designers, but the as- sociated constraints may also complicate the meta-programming of specific optimizations [19]. Up to now, the multistage language and meta-programming community has mostly focused on general-purpose transformations like in partial evaluation, specialization and simplification. These transformations are useful, in particular to lower the abstraction penalty, but far from sufficient to adapt a compute-intensive application to a complex architecture. As a matter of fact, research on generative programming and multistage evaluation has not greatly influenced the design of high-performance applications and compilers, most application-specific adaptive libraries being ad-hoc string-based program generators. The TaskGraph library [11] is closely related with the X language. It combines a structured multistage evaluation layer built on top of C++ expression templates, with run-time generation and compilation, and with a transformation toolkit based on SUIF (1.3) [42] and/or ROSE [75]. It is not a language per se, but a set of C++ templates and classes associated with customizable source-to-source transformation capabilities. As such, it should be understood like the underlying infrastructure to build a general-purpose multiversioning language such as X. We preferred to redesign our own infrastructure for multistage evaluation and source-to- source transformation, for the sake of simplicity, to avoid the memory and code overhead of C++ templates, and because we do not currently aim for run-time code generation. In Figure 2.6.5, we present our approach among different domain approaches. We chose to

55 present different kinds of optimization way with 3 axes:

• the transformation tools like Spiral which owns an abstract representation where we can apply the transformations and generation tools like ATLAS which can optimize the code,

• the black box which represents the tools where the user interact like Tick C, where doing the compiler’s work is required.

• the specific tools like ATLAS and Spiral with general-purpose tools like compilers.

ATLAS proceeds only by generation with automatic transformations, SPIRAL by transforma- tions and ‘C (Tick C) is a general-purpose method. The goal of X-Language is to use techniques of these different tools (generation, transformation with manual way to express it). It is a gen- eral purpose-language which can be domain specific (pink surface). Unlike XLG, it generates source code.

Compiler XLG Spiral

Transformation

X−Language

Tick C General−purpose ATLAS

Generation Domain−specific

Automatic Manual

Figure 2.33: Related works

2.7 Conclusion

To conclude, we sum-up the main features of these two approaches with their limitations.

2.7.1 Comparison of two approaches

We carried in the following table the specific points that we previously introduced for meta- programming languages of which the goal is to transform a program. In a purely generative implementation, the generator combination results often have an immense code to do a com- position of transformations. Thus, the legality of the code is lost. Despite a type verification better, the reuse code being rather specific to the transformed program, the function reuse with MetaOCaml shows perilous enough. The last value of this picture represents the performance of a multiplication code of obtained matrices on an Itanium 2 with the two methods.

56 Features MetaOCaml X-Language All elementary All elementary Elementary transformations transformations transformations Composition of compositions Composition by call Composition of transformations generators functions Functions creation Domain-specific function Portability New transformations Difficult for users Writing rules Code naming Automatic Manual Type safety Done More difficult DGEMM Itanium performance 5.2 GFlops 5.8 GFlops

2.7.2 Limitations In these two approaches, the creation of a non specific language to an application able to express transformations and their research has been performed. However, these tools require the programmer to work instead of compiler. They must specify the transformations to apply on the code step by step. The research also must be specified with intervals as well as the method. The achievement of an automatic optimization is still far and research time to find a sequence of optimum transformations can be of several months. Actually, the combinatory is very important for a code of some lines. Which is more research time and work than must furnish a user, the semantic one of the program easily can be questioned. The clearness of expression transformations is often obscure, the user can be easily disconcerted by the created loops by transformations that are not applied and that produce a code that does not exist yet. Therefore, it is necessary to anticipate the effect of the transformations to be able to apply other transformations on potentially generated codes. The expression readibility of transformations is often low. The main problem is to anticipe the potentially generated code by transformations. To solve these problems, in the following chapter, we create a high-level method which is based on these previous languages. This method generates the pragma directives instead of the user. We create a tool being able to decrease the search time to find the good sequences of transformations targeting only code fragments. Metaprogramming approach is a way to let the user handle optimizations. The weakness of this solution is the important work that programmers have to do. Architecture knowledge is not needed but the effort to find a solution is very complex and it takes a long time because of the program size. In the next chapter, we will see that there is a reasonable solution to balance compiler task and user work using code fragments.

57 Chapter 3

Loop Optimization using Kernel Decomposition

With the limitation of the previous approach, we decided to create another optimization tech- nique. The main problem was the complexity for the user of using this tool and the execution time to optimize code. In this chapter, we use X-Language to create code fragments. They are easier for compilers to optimize. This method will be more user-friendly. Indeed, the user can declare the area to optimize and the system automatically performs the optimization. The increasing complexity of hardware features for recent processors makes high perfor- mance code generation very challenging. In particular, several optimization targets have to be pursued simultaneously (minimizing L1/L2/L3/TLB misses and maximizing instruction level parallelism). Very often, these optimization goals impose different and contradictory constraints on the transformations to be applied. We propose a new hierarchical compilation approach for the generation of high performance code relying on the use of state-of-the-art compilers. This approach is not application-dependent and do not require any assembly hand-coding. It relies on the decomposition of the original loop nest into simpler kernels, typically 1D to 2D loops, much simpler to optimize. We successfully applied this approach to optimize dense matrix mul- tiply primitives (not only for the square case but to the more general rectangular cases) and convolution. The performance of the optimized codes on Itanium 2 and Pentium 4 architectures outperforms ATLAS and in most cases, matches hand-tuned vendor libraries (e.g. MKL).

3.1 Introduction

As we saw previously, the architecture complexity and the architecture optimizations (prefetch- ing mechanisms, code vectorization, cache hierarchy,. . . ) make compiler’s work very hard. They come up against the difficulty to do an optimization that doesn’t affect another one done be- forehand. To obtain an efficient code, it is necessary to have well organized computations to use adapted functional unit, that the loads are done before these computations trying to play with a maximum cache reuse. Thus, numerous parameters have to be taken into account to achieve performance. Nevertheless, it is important that these parameters are optimized in a global manner for that the optimization of each one does not interfere with each other. Compilers have to take into account this last remark, a code optimizes itself in a global manner. A code of which cache reuse is optimized but without an ILP optimization cannot be effective. This optimization sequence is more complex for the ILP is regulated in the back-end and the other to a higher level. It is necessary to act on several compilation levels. The matrix matrix multiply example illustrates well these previous remarks. Although the code is very easy to understand, no compiler manages to generate an efficient code which reaches

58 peak performance. To deal with this problem, Whaley et al. [86] have developed a specialized and iterative code generator (ATLAS). ATLAS generated libraries outperform codes produced by most of the compilers. Recently, the cost of the iterative compilation in ATLAS has been reduced by replacing the iterative search by a cost model, while still generating codes with nearly the same performance [91]. But even with these recent improvements, vendor [31, 65] or hand-tuned BLAS3 [41] still outperforms ATLAS generated codes. So what are ATLAS and compilers still missing in order to reach this level of performance? In this chapter, we propose an automatic method to close this performance gap. The starting point is to decouple the two issues: locality optimization and ILP optimization. Tiling is first performed to produce a tile code operating on sub-arrays that fit in cache. Tiling constraints, ensuring that the sub-arrays accessed within the tile fit in cache, do not define a single tile size but rather a set of tile sizes. We use this degree of freedom to search for the best performing tile code and choose the tile size according to the constraints on this code. Then, to optimize the tile code, we use a bottom up approach combined with systematic exploration. In general, the multiply-nested loop structure of the tile code will still be too complex to be correctly optimized by a compiler even if the operands are in cache. Therefore, from the multiply-nested loop, we systematically generate several kernels, using interchange, strip mining, and partial unrolling. These kernels have a simpler control structure (1 or 2 dimensional loops), the loop body containing several statements resulting from unrolling surrounding loops. Additionally, to simplify the compiler task, loop trip counts are set to constants, and multidimensional array accesses are simplified. The main constraint on these kernels is to be simple enough so that a decent compiler can generate high performance code. Then, the performance of all of these kernels is systematically measured. From these kernels, different variants of the original tile code can be easily rebuilt. And finally, taking into account tile size constraints, a specific version of the tile code is selected and the whole code is produced. The main contributions of the proposed approach are: • Automate the process of generating high performance code optimizing simultaneously ILP and data locality. The main contribution here is that our approach allows to find the best tradeoff between these two optimization targets • Achieve performance similar to hand coded routines • Rely on flexibility (different versions) of the code generated to match varying data locality properties. For example, the real difficulty is not generating a matrix multiply code achieving high performance on square matrices, but a general matrix multiply code that will obtain high performance for arbitrary rectangular shaped matrices (where locality properties on each array can be very different) • Reduce the cost of the optimization phase. Most of the search phase and experimentation phase are done on the kernels (which are mostly one dimensional loops) and not on the whole code The proposed approach is demonstrated on BLAS3 codes but can be applied to other codes. Unlike ATLAS, we did not a priori select a given code that is further tuned. On the contrary, we consider a large number of variants, which are automatically produced. Each of these variants correspond to the application of a given set of transformations/optimizations to the original tile code. Generation and exploration of these optimization sequences and their parameters are achieved with a meta-compilation language, X-language. The approach described in this chapter applies to regular linear algebra codes. More specifi- cally, the codes considered are static control programs [35]: loop bounds have to depend linearly

59 on other loop iteration counters or on read-only variables. Array indices also have to depend linearly on loop indices.

3.2 Why is it important to divide the problem?

The histograms in Figure 3.1 compare the performance of two versions of a square matrix multiply primitive (DGEMM): one is generated by ATLAS (grey bars) and the other one is from the Intel MKL library (white bars). The target machine is an Itanium 2 running at 1.575 Ghz with a peak performance of 6.3 GFlops. the MKL version clearly outperforms the ATLAS version and gets performance numbers very close to peak. Trying to understand the performance gap, we measured L2 and L3 misses for each code. The results in Figure 3.1 (resp. Figure 3.1) represent the average number of bytes fetched outside of L2 (resp. L3) per FMA (Fused Multiply Add). These metrics are simply computed according to the following formula: (128 × number of L2 misses)/number of F MA (resp. (128 × number of L3 misses)/number of F MA). Surprisingly enough, MKL is performing on average 3 times more L2 misses than ATLAS and between 1.5 and 2 times more L3 misses (for matrices larger than 900 × 900). We checked that the number of prefetch instructions is similar in both cases and we also looked at the impact of TLB misses: again in both cases the impact is negligible meaning that both ATLAS and MKL have done an excellent job at minimizing TLB misses. It should be noted that although the stress imposed by MKL routines on L2 bandwidth (resp. on memory bandwidth) is much higher, this is still below the sustained bandwidth achievable by L2 cache: 24 GB/s (resp. by memory system: 6.4 GB/s). Similar experiments performed on Pentium 4 lead to similar observations. What conclusions should be drawn from this example?

• Minimizing L2, L3 and TLB misses should not be the only goals;

• Furthermore, increasing L2, L3 and TLB miss rate (but still staying under the hardware bounds) can be the right path to reach peak performance;

• The real key to achieve peak performance is to achieve the right tradeoffs between ILP optimization and data locality optimization.

It should be noted that our results are fundamentally different from those obtained by Chen et al. [17] that are essentially focused on data locality optimization across all of the memory hierarchy levels. The models and heuristics limit the search to a small number of candidate implementations, and the empirical results provide the most accurate information to the compiler to select among candidates and tune optimization parameter values. This work is based on Matrix Multiply algorithm. In the rest of this chapter, we will show how our approach succeeds in achieving a well balanced optimization of all of these metrics, resulting in performance levels close and even superior to MKL.

3.2.1 X-language Framework

Using X-language [28], defined in the previous chapter, multiple version of a code are generated. Source-to-source transformations are expressed in this language and are applied on the code after the first compilation stage. X-language pragmas enable to:

60 Figure 3.1: DGEMM performance and L2 and L3 behavior for ATLAS and MKL on Itanium 2

• Specify code fragments (scope) on which X-language transformations apply, using #pragma xlang begin and #pragma xlang end directives around selected code;

• Trigger source-to-source transformations on code fragment using pragma directives, such as #pragma xlang transform tile(i,II,STRIDE) #pragma xlang transform unroll(II,UNROLL)

The first directive tiles the loop i with a stride STRIDE into a new loop II. The second one partially unrolls II by a factor of UNROLL. Available transformations include unrolling, tiling, fission, fusion, interchange, scalar promote, . . . The rule-based transformation en- gine of X-language enables more transformations.

The main advantage of X-language is that the user can apply very precisely desired source- to-source transformations. However, long optimization sequences are tedious to write with pragmas. Furthermore, there is no real search mechanism for optimization sequences and there is no feed-back from the compiler.

3.3 Hierarchical decomposition in kernels

This section presents the detailed method for hierarchical kernel decomposition. Algorithm (Figure 3.2) sums up the main steps of the approach, further detailed in the following sections. The code is first tiled for data locality. At this stage, the exact tile sizes are not selected yet, they are still parameters (Step 1 of Algorithm in Figure 3.2). But we assume that tile sizes are such that the whole array regions accessed in the tile fit into cache. The code associated with a tile is called the tile code. On the resulting tile code, various code transformations are applied resulting in a large number of versions (Step 2). Then the loops of these versions are decomposed into simpler computational kernels: for example, for a multiply-nested do loop, tiling the innermost loop for 100 iterations generates a simple kernel with one loop of 100 iterations. More complex kernels can be obtained by tiling the two innermost and so on (Step 3a). We choose to bound the exploration of these new tile sizes to some given constants. Note that this tiling level is

61 Algorithm Optimized tile code generation Input: A linear algebra code P Output: A set T of optimized tile codes and K of kernels. 1. Tile P for data reuse. Tile sizes are kept parametric. Let T denote the tile code. If there is no reuse, T = P.

2. Apply various code transformations to generate multiple versions of T.

3. For each version V generated in Step 2: For each loop nest L of V:

(a) Tile any number of inner loops of L with constant loop sizes. Tile code is denoted K. (b) Add the following constraint to V: the tile size of V must be a multiple of the size of K (no tail code). (c) Apply scalar promotion to array accesses in K, adding possible copy-in/out of array sections into/from array temporaries. (d) Hoist all copies to entry/exit of T. (e) Add the resulting tile code K to K and the resulting copies to K.

Add the tiled V into V, for all possible tilings of Step 3a.

4. For each K ∈ K: Measure performance of K for all array alignments

5. For each V ∈V: Estimate the performance of V by multiplying performance of its kernels/copies by their surrounding loop trip counts. Let p(V) denote the performance estimation of V.

If p(V) > maxU s.t. V

Figure 3.2: Optimized tile code generation not directly linked to memory reuse concerns. Its purpose is to define some code fragments independent of the application that will be executed and their performance measured. In order to avoid any unnecessary tail code, the loop must have an iteration count multiple of the tile size (Step 3b), for example a multiple of 100 iterations. Array accesses can then be optimized by scalar promotion (boiling down to loop invariant removal): array regions are copied into temporary scalars or arrays before the newly created tile code, and then copied out from these temporaries after the tile (Step 3c). The new tile code itself is called a kernel. An important point to note here is that when these copies are hoisted at the entry and exit of the tile code (Step 3d), as the working set fits into the cache, these copy operations fill the cache with the data used by kernels and these kernels will execute with data already in cache. Moreover, as there is a copy for each array accessed by a kernel, it is possible to choose precisely the array alignments that minimize cache bank conflicts inside kernel code. Thus, the performance of kernels is evaluated independently of the application context, with a working set already in cache and for different values of array alignment (Step 4). Note that the execution step is limited to kernel execution. As we can choose to consider only kernels with

62 // copy B into b by blocks of width NJ for (i = 0; i < N ; i += NI) // copy A into a by block of width NK for (j = 0; j < N ; j += NJ) for (i = 0; i < N ; i++) // copy C into c for (j = 0; j < N ; j++) for (k = 0; k < N ; k += NK) for (k = 0; k < N ; k++) // Tile for memory reuse c[i][j] += a[i][k] * b[k][j]; for (ii = 0; ii < NI; ii++) for (jj = 0; jj < NJ; jj++) for (kk = 0; kk < NK; kk++ ) c[ii][jj] += a[ii][kk] * b[kk][jj];

Figure 3.3: Tiled DGEMM. only one single loop, independently of the application, this makes the execution step rather cheap, compared to the execution of the whole tile code. Copies are considered as kernels and are benchmarked too (data comes from memory in this case). Finally, the best tile code is composed of kernels and copies. All different versions of tile codes are evaluated with a simple cost model relying on kernel measured performance (Step 5). There is no execution involved in this performance evaluation step. If a tile code U has size constraints that can be checked by another tile code V, for example when V must have a loop trip count multiple of 200 and U must have a loop trip count multiple of 100, then we note U < V. When U < V and U outperforms V, then we keep only U (Step 5). For example, if the performance of U is 3 GFlops and the tile size must be a multiple of 100 whereas the performance of V reaches 2 GFlops and the tile size must be a multiple of 200, then V is outperformed by U, whatever the matrix or vector sizes. On the contrary, if the performance U is 1.5 GFlops instead, then V is chosen for sizes multiple of 200 and U for the size 100. According to the size constraints coming from each kernel, the best tile code adapted to the input matrix/vector size is easily found. Each step is presented in detail in the following sections. We will use the standard matrix multiply code given in Figure 3.3 to illustrate our method step by step.

3.3.1 Loop Tiling

The main goal of loop tiling [88, 22, 54, 53] is to reduce memory traffic by enhancing data reuse. Upon entry in the tile, data layout (the array organization) is restructured in particular to reduce TLB misses. Indeed, all copies are hoisted at the entry/exit of the tile by Step 3d, to generate contiguous array sections which will minimize cache interferences and TLB misses. At this stage, the only constraints imposed on tile sizes is that they should be chosen such that the working set used in the tile fits into the cache. For matrix multiply, computing the working set is fairly obvious: it amounts to sum up the size of the three sub-arrays accessed by the tile code. In more general cases, the exact evaluation of the working set can be done but can be fairly complex [18]. For our purpose, in most cases a simple method based on rectangular array subsections (derived from the tile sizes) will be sufficient. In particular, we do not limit our search to square tiles. While using general rectangular tiles will increase the number of parameters to deal with, it provides much more degrees of freedom. First, it allows to cope more efficiently with the case where the original loop iteration space is rectangular (allowing for example to deal with the general rectangular matrix multiply problem). Second, it even allows to deal efficiently with more arbitrarily shaped iteration space such as triangular ones.

63 The tiling is classically obtained through strip mining of all loops, followed by a search using loop permutation, skewing, reversal and loop distribution. The tiling applied on DGEMM is presented in Figure 3.3. The tile code (the 3 innermost loops) corresponds to a mini-MMM according to ATLAS terminology. The copy-out of c is not included.

3.3.2 Loop Transformations The loop transformations we considered in Step 2 are loop interchange, partial unroll, and loop fusion. The goal of these transformations is to increase the parallelism inside some loops (by unrolling or fusion), to give opportunities for higher instruction parallelism and vectorization. The optimization space is described by the range of unrolling factors for each loop, through X-language pragmas. The optimization sequences considered are all possible interchange, fol- lowed by any combination of other transformations. Note that after these source-to-source transformations, the compiler achieves other loop transformations on its own: unrolling, loop vectorization, software pipelining, versioning (for alignment). If the number of unrolling factors used for each loop is U, a perfect loop nest with n loops fully permutable generates n! × U n different versions. This is not an issue since the maximum loop depth is usually lower than 4.

3.3.3 Data-Layout Optimization Loop tiling followed by scalar promotion are used to generate kernels (Steps 3a, 3c). The kernels are the tile codes generated by this tiling step. Tiling is applied only to inner loops, from the innermost loop (for 1D kernel) to all loops. The complexity of the kernels depends highly on the number of loops considered for this tiling. Kernels with only one loop (1D kernels) have several advantages: compilers usually generate good quality codes for single loops, and the experimental step takes linear time with respect to the number of experiments performed (Step 4). In our study, we consider also kernels with up to 3 loops. Data-layout transformations are applied to simplify array structures accessed by kernel. The optimization is focused on:

• locality optimization: higher locality and regular strides give more opportunities for the compiler to generate high performance memory accesses (better prefetch, fewer instruc- tions to describe the address stream, vectorization,...)

• register usage optimization.

These are standard scalar promotion/blocking techniques. The first goal can be reached by transforming the arrays so that they contain only the working set accessed by the kernel. This can be achieved by using well known techniques [59, 8, 80]. Our implementation is based on scalar promotion techniques, and all arrays are resized to the size of kernel loops. Array dimensions that are only indexed by constants or loop counters not present in the kernel are removed (and arrays are renamed correspondingly). This may require the use of memory copy operations. Figure 3.4 presents three mini-MMM optimizations achieved by our method. The first column gives the optimization sequence used and the resulting code is in the second column. From these codes, the kernels in the third column are generated by tiling of the innermost loop and scalar promotion. Scalar promotion creates new arrays. For the first example, the mapping between array elements of the first optimized version and array elements used in its 1D kernel is the following: c1=c[ii], a1=a[ii][kk], b1=b[kk], c2=c[ii+1] and a2=a[ii+1], where

64 all arrays of the kernel have ni elements. In C, this implies that there is no need for copy at all. For the second optimized mini-MMM of Figure 3.4, the mapping is: c1=c[ii][jj], a1=a[ii], b1[.]=b[.][jj], c2=c[ii][jj+1], b2[.]=b[.][jj+1], etc. Note that array b has to be transposed before entering this kernel. For the third optimized mini-MMM, the mapping is: c1[.]=c[.][jj], a1[.]=a[.][kk], b1=b[kk][jj], c2[.]=c[.][ii+1] and b2=b[kk][jj+1]. This means that the arrays a and c have to be transposed. Compared with the first version, even if the kernel is the same, this decomposition is likely to be less efficient.

Optimization Optimized mini-mmm 1D Kernel for (ii = 0; ii < NI ; ii+=2) for (kk = 0; kk < NK ; kk++) interchange for (jj = 0; jj < NJ ; jj++) for (i = 0; i < ni ; i++) 9 ⇒ jj,kk; partial c[ii][jj] += a[ii][kk] * b[kk][jj]; = c1[i] += a1 * b1[i]; unroll ii. c[ii+1][jj] += a[ii+1][kk] * b[kk][jj]; c2[i] += a2 * b1[i]; for (ii = 0; ii < NI ; ii+=4) ; for (jj = 0; jj < NJ ; jj+=4) for (kk = 0; kk < NK ; kk++) for (i = 0; i < ni ; i++) partial unroll c[ii][jj] += a[ii][kk] * b[kk][jj]; 9 c1 += a1[i] * b1[i]; jj, factor 4; c[ii][jj+1] += a[ii][kk] * b[kk][jj+1]; > ⇒ c2 += a1[i] * b2[i]; partial unroll => ...... ii, factor 4 > c[ii+3][jj+3] += a[ii+3][kk] * b[kk][jj+3]; > c16 += a4[i] * b4[i]; for (kk = 0; kk < NK ; kk++) ; for (jj = 0; jj < NJ ; jj+=2) interchange for (ii = 0; ii < NI ; ii++) for (i = 0; i < ni ; i++) 9 ⇒ ii,kk; partial c[ii][jj] += a[ii][kk] * b[kk][jj]; = c1[i] += a1[i] * b1; unroll jj. c[ii][jj+1] += a[ii][kk] * b[kk][jj+1]; c2[i] += a1[i] * b2; ;

Figure 3.4: Several transformed mini-mmm and their corresponding 1D kernel. The first is a kernel of 2 daxpys, the second is 16 dot products and the third is the same as the first (modulo commutativity of the multiplication and renaming).

Concerning the mini-MMM code for DGEMM, enumerating all optimizations and generating all kernels leads to 5 different kernels, 4 of them are presented in Figure 3.5. The remaining 3D kernel is the DGEMM itself. The values k, l correspond to the unrolling factor of the surrounding loops. For instance, daxpy k,l corresponds to daxpys accumulating in k different vectors, l daxpys sharing the same destination vector.

The method described here corresponds to a systematic enumeration of all possible opti- mization sequences (at the opposite of a selective search). A bound on the total number of kernels studied (which in fact correspond to the size of the search space) can be easily assessed. For perfect loop nests of depth n, for a given tile size, there are n! × U n possible kernels con- sisting exactly of p loops, taking into consideration loop transformations. This is an upper bound since in practice, some dependences can forbid the use of some permutations. Moreover, the same kernels can be obtained by different transformations. Therefore an upper bound on the total number of kernels with at most p loops and exploring t different tile size values is n!×U n ×p×tp where n is the depth of the initial loop nest and U the number of different unroll factors explored. Using existing search methods will be required when other transformations are considered.

When some variations are not handled (such as commutativity, as expressed in Figure 3.4), there are more kernels generated. The compilation step helps to detect these similarities, as explained in the following section.

65 for(i = 0 ; i < ni ; i++) { c11 += a1[i] * b1[i]; ... for (i = 0; i < ni ; i++) { c1l += a1[i] * bl[i]; for (j = 0; j < nj ; j++) c21 += a2[i] * b1[i]; c1[j] += a1[j] * b[i][j]; ...... ckl += ak[i] * bl[i]; ck[j] += ak[j] * b[i][j]; } } (a) (c)

for(i = 0 ; i < ni ; i++) { for (i = 0; i < ni ; i++) { c1[i] += a11 * b1[i]; for (j = 0; j < nj ; j++) ... c[i][j] += a1[i] * b1[j]; c1[i] += a1l * bl[i]; ... c2[i] += a2l * b1[i]; c[i][j] += ak[i] * bk[j]; ... } ck[i] += akl * bl[i]; (d) } (b)

Figure 3.5: (a) - 1D Kernel: dot product k,l, (b) - 1D Kernel: daxpy k,l, (c) - 2D Kernel: dgemv k, (d) - 2D Kernel: outer product k

3.3.4 Kernel Micro-optimization and Execution Once kernels have been generated, they are optimized, compiled and evaluated separately, out- side of the application context (Step 4 of Algorithm 3.2). While standard iterative compilation usually compiles and measures performance of the whole code, our approach focuses on smaller code fragments (the kernels) which are systematically benchmarked. This leads to substantial saving in computational time: for example testing a whole matrix multiply of size n will cost n3 operations while testing a simple daxpy 10,1 of size n will cost 10 ∗ n operations. Moreover, kernel optimizations and evaluations can be easily reused from one code to the other, when the same kernels appear. For the evaluation of kernels, two key parameters are explored: • Loop bounds: they correspond to tile sizes. First, loop bounds impact directly the work- ing set, and therefore forcing to use other levels of cache. Second, mechanisms such as prefetching may be influenced by the actual value of the bound and loop overheads. Fi- nally, pipelines with large MAKESPAN or large unrolling factors can take advantage of larger iteration counts. The span of the loop bound sampling can be user-defined through X-language pragmas.

• Array alignments: the code generated may be unstable with respect to the alignment of the array starting addresses. Important performance gains can be obtained by finding the best alignment [45]. Testing the different possible alignments reveals performance stability. If stability is an issue, it is then possible to copy part of the arrays necessary for the tile with the specific alignments that enable the best performance. The quality of the final code depends on the capacity of the compiler to take advantage of the parallelism expressed in the kernel. In particular, the generator has to perform the following operations appropriately: • Dependence analysis: failure to detect independence of statements degrades schedule latencies, ILP and impacts many other optimizations.

66 • . Depending upon the architecture, this is more or less critical and failure to correctly allocate registers introduce dependencies, impacting latencies. Source- to-source transformations can help the compiler by using single assignment form codes.

• Vectorization: this can be memory access vectorization (Itanium, Pentium for instance) or computation vectorization (Pentium with SSE instruction set). Some code generators rely on pattern matching in order to decide whether or not to perform vectorization. Enumerating different versions of kernels helps in finding the appropriate code that can be matched by these rules.

• Constant propagation and use of static information: the compiler can take ad- vantage for instance of the loop bounds values (which are explicitly provided) for the computation of the prefetch distance. Failure to do this means that the compiler opti- mizes in the same manner loops of different sizes.

Exploration space and executions can be limited by static evaluation and comparison of the assembly codes. Tools such as MAQAO [58] potentially detect inefficient codes from the assembly and compare different versions. Indeed, the compiler sometimes generates the same assembly code from two different source codes or may fail to perform some key optimization (such as vectorization for instance).

3.3.5 Putting Kernels to Work The final step consists of choosing the best kernel decomposition from the available kernels. For each decomposition, the tile code is written as a combination of memory copies and kernel codes. All memory copies are hoisted and then performance of the tile is evaluated as the product of the individual performance of the kernels and the memory operations by their surrounding loop trip count. All kernel measurements are performed in the same context as they are used in the application (same alignment, same cache level). All tiles in T correspond to the best tiles, given some constraints on tile sizes. In particular very thin tiles are taken into account by our approach and may require special kernels.

3.4 Experimental results

First details on our implementation are presented followed by a description of the different architectures, compiler and libraries used. Then, kernel performance is described and ana- lyzed. Finally kernel decomposition and combination on whole code is presented together with performance numbers.

3.4.1 Implementation Kernel decomposition is implemented by a new pragma inside X-language. Compared to the version presented in [28], the version we have developed is based on a C99 front-end parser, Tiny C Compiler [69] and relies on a Prolog engine for source-to-source transformations. Most of the transformations, with their pragmas, are still the same. Unlike the previous language, we can decompose automatically codes with primitives. The advantages of Prolog is to have an embedded search motor. The main contributions of the new X-language version is the possibility to:

• Generate multiple versions by defining search intervals, such as

67 #pragma xlang parameter STRIDE [32:128:32] #pragma xlang parameter UNROLL [1:8:1]

These directives define that STRIDE can take any value multiple of 32 between 32 and 128. X-language then generates automatically all versions of the code fragment with these optimization parameters;

• Trigger the decomposition of a code fragment into kernels:

#pragma xlang decompose i

This directive decomposes loop i into kernels, as explained in Section 3.3.2. This step corresponds to the tiling into computation kernels. X-language generates one file per kernel found.

Micro-optimization of these kernels still requires now another compilation step using X- language. Generation of the code testing array alignment stability is automatic. The selection of the best tile is not yet automatic.

3.4.2 Experimental Environment Three different codes from dense Linear Algebra have been targeted for validating our approach: Matrix Matrix Multiply (DGEMM), Matrix Vector Multiply (DGEMV) and 1D Convolution. On the hardware side, two different architectures have been used:

Itanium 2 BULL Novascale server featuring 1.6GHz Itanium 2 processor, with 3 cache levels was used. Out of these 3 levels, only the 2nd level (256KB, unified) and the 3rd level (9MB, unified) can contain floating point values. The processor offers 128 floating point registers and can issue up to 6 instructions per cycle.

Pentium 4 PC equipped with an Intel Pentium 4 Prescott 2.80GHz processor with two cache levels: L1 D cache (16 KB) and L2 (1MB, unified) was used. The processor can issue up to 2 instructions per cycle and supports the SSE2 extensions. The SSE2 instructions allows to vectorize most of the standard floating operations (up to 2 double precision word can be packed in a single instruction). On this Pentium version, the number of registers available for SSE2 is limited to 8 making register allocation a sensitive issue. Our kernels and codes were compiled using Intel ICC Compiler v9.0 on both platforms. The code generated with our approach was compared with the MKL library (version 8.02) and ATLAS library generator (version 3.6), the same release number being used for both machines.

3.4.3 A few 1D kernels From the matrix matrix multiply code, different DAXPY kernels were generated with different unrolling factors. We implemented DAXPY k,1. We presented in Figure 3.6 two examples of DAXPY 2,1 for a size of 200 and DAXPY 4,1 for a size of 800. We sum up the results in Figure 3.7, the presented performance are in Cycle/FMA. The number k of Daxpyk represents the array number that performs with this algorithm. For each k we have different unrolling factors U1, U2, U3 and U4. As we can see, the results are not stable from 0.5 Cycle/FMA to to 1.5 Cycle/FMA varying the unrolling factors. The efficient codes have very numerous computations like DAXPY 12 of which we will reuse with a performance of 0.509 cycle/FMA. This daxpy12 kernel is built by 24 FMA, 24 loads and 24 stores.

68 DAXPY 4 800 for(i = 0 ; i < 800 ; i++) { DAXPY 2,1 200 c1[i] += a11 * b1[i]; for(i = 0 ; i < 200 ; i++) { c2[i] += a2l * b1[i]; c1[i] += a11 * b1[i]; c3[i] += a3l * b1[i]; c2[i] += a2l * b1[i]; c4[i] += a4l * b1[i]; } }

Figure 3.6: DAXPY examples

Still the DGEMM code, dot-product kernels are generated and transformed. The results are presented in Figure 3.8. Performance are dependent on unrolling factors.

3.4.4 Kernels for DGEMM

Dotproduct-k, k A dotproduct-k, k kernel of length n has a fairly small working set: 2 × k × n + k × k. If there is enough register space, it offers also a very good memory/arithmetic ratio: 2 × k loads for k × k multiply add. Increasing k will decrease the stress on load instruction scheduling but at the same time will increase register pressure up to the point where the register allocator will have to insert expensive spill/fill instructions or to reload some of the operands. Results for the Itanium are presented in Figure 3.9. For all kernels (and vector length ranges), the working set is small enough so all of the data is contained the L2 cache. Increasing k improves performance because kernels become more and more floating point dominated. Dotproduct-4, 4 offers peak performance around 6.2 Gflops but this requires vector length of at least 200 (however using larger vector length does not generate performance loss). However, for k = 8, the register pressure is too high: the compiler start inserting spill/fill instructions and furthermore, the loop body becoming too complex, the compiler can no longer software pipeline the loop, explaining the much lower performance level obtained by this kernel. On the Pentium 4 (see Figure 3.10), the L1 size being much smaller, increasing vector length will push some of the operands out of L1. This explains the general trend on all curves, performance quickly decreases for larger vector lengths. Now, since there are only 8 SSE2 registers, starting with k = 2 the compiler can no longer keep data in registers but instead reloads them. From this angle, increasing k will increase data reuse within L1 and therefore performance improves with larger values of k. However, since many reload instructions have to be issued (even if some can be combined with arithmetic), the overall performance of all dotproduct kernels (under 1.8 Gflops) is fairly disappointing.

Outerproduct-k, k An outerproduct-k kernel of length n has a larger working set: 2 × k × n + n × n. The memory/arithmetic ratio is good: 2 × n × k loads for n × n stores and n × n multiply adds, but requires large temporary storage (much larger than register space). However, it should be noted that register pressure will increase quickly in function of k but also in function of n, therefore data will have to reloaded (but fortunately most of the time from cache). On Itanium 2 (see Figure 3.11), the increase in working set requirement explains the drop off in performance around n = 900. For n larger than 900, the operands can no longer be kept in the L3 cache, some of them come from main memory. A smaller drop off occurs between 100 and 200, because here operands no longer contained in the L2 cache. However, it should be noted that L3 organization is such that very quickly (n = 400), performance levels are high again although now a large fraction of the operation are coming from L3. This explains why minimizing the number of L2 misses is not such an important issue. Increasing k improves performance (not

69 Daxpy with size 100 on Itanium 2 Daxpy with size 200 on Itanium 2 2 2 U1 U1 U2 U2 U3 U3 U4 U4

1.5 1.5

1 1 Cycle/FMA Cycle/FMA

0.5 0.5

0 0 Daxpy1 Daxpy2 Daxpy3 Daxpy4 Daxpy6 Daxpy7 Daxpy8 Daxpy9 Daxpy10 Daxpy11 Daxpy12 Daxpy1 Daxpy2 Daxpy3 Daxpy4 Daxpy6 Daxpy7 Daxpy8 Daxpy9 Daxpy10 Daxpy11 Daxpy12

Daxpy with size 400 on Itanium 2 Daxpy with size 800 on Itanium 2 2 2 U1 U1 U2 U2 U3 U3 U4 U4

1.5 1.5

1 1 Cycle/FMA Cycle/FMA

0.5 0.5

0 0 Daxpy1 Daxpy2 Daxpy3 Daxpy4 Daxpy6 Daxpy7 Daxpy8 Daxpy9 Daxpy10 Daxpy11 Daxpy12 Daxpy1 Daxpy2 Daxpy3 Daxpy4 Daxpy6 Daxpy7 Daxpy8 Daxpy9 Daxpy10 Daxpy11 Daxpy12

Figure 3.7: DAXPY kernel performance with different unrolling factors.

Dot-Product with different sizes on Itanium 2 2 100 200 400 600

1.5

1 Cycle/FMA

0.5

0 Dot-Product1 Dot-Product2 Dot-Product4 Dot-Product6 Dot-Product8

Figure 3.8: Dot-product kernel performance with different unrolling factors.

70 DGEMM (1xN)(Nx1) on Itanium 2 DGEMM (2xN)(Nx2) on Itanium 2 ATLAS ATLAS 7 KNL 7 KNL MKL MKL

6 6

5 5

4 4 GFlops 3 GFlops 3

2 2

1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

DGEMM (4xN)(Nx4) on Itanium 2 DGEMM (8xN)(Nx8) on Itanium 2 ATLAS ATLAS 7 KNL 7 KNL MKL MKL

6 6

5 5

4 4

GFlops 3 GFlops 3

2 2

1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

Figure 3.9: Dotproduct-k, k kernel on Itanium 2 with k = 1, 2, 4, 8.

DGEMM (1xN)(Nx1) on Pentium 4 DGEMM (2xN)(Nx2) on Pentium 4 5 5 ATLAS ATLAS KNL KNL MKL MKL

4 4

3 3 GFlops 2 GFlops 2

1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

DGEMM (4xN)(Nx4) on Pentium 4 DGEMM (8xN)(Nx8) on Pentium 4 5 5 ATLAS ATLAS KNL KNL MKL MKL

4 4

3 3

GFlops 2 GFlops 2

1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

Figure 3.10: Dotproduct-k, k kernel on Pentium 4 with k = 1, 2, 4, 8.

71 DGEMM (Nx1)(1xN) on Itanium 2 DGEMM (Nx2)(2xN) on Itanium 2 ATLAS ATLAS 7 KNL 7 KNL MKL MKL

6 6

5 5

4 4 GFlops 3 GFlops 3

2 2

1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

DGEMM (Nx4)(4xN) on Itanium 2 DGEMM (Nx8)(8xN) on Itanium 2 ATLAS ATLAS 7 KNL 7 KNL MKL MKL

6 6

5 5

4 4 GFlops 3 GFlops 3

2 2

1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

Figure 3.11: Outerproduct-k kernel on Itanium 2 with k = 1, 2, 4, 8.

because of register reuse but due to reuse in L2) and allows to reach performance levels close to peak but slightly lower than the Dotproduct-k kernel. On Pentium 4 (see Figure 3.12), the performance drop off appears much quicker: for n = 300, the working set exceeds L2 cache size. Now increasing k, improves operand reuse in L1 but since L1 is fairly small, already k = 8 exceeds L1 capacity level. However, it should be noted that outerproduct-4 kernel on Pentium performs much better than all of the dotproduct-k, k kernels.

DGEMV-k A DGEMV kernel of length n has a working set: 2 × k × n + n × n. The memory/arithmetic ratio is good: k × n + n × n loads for k × n stores and n × n × n multiply adds, but requires large temporary storage (much larger than register space). This type of tile corresponds to a 2D kernel, performing k vector-matrix products (named Dgemv). For this type of tile, the fastest dgemm uses a kernel of dotproduct 1,1 unrolled 10 times, requiring a matrix transposition of b. Performance results are displayed in Figure 3.13 (Itanium 2) and Figure 3.14 (Pentium 4). A 50% speedup is obtained w.r.t. ATLAS and performance follows those of the MKL library. Performance drops around N = 800 because the tile size exceeds the cache size. At this point the outer tiling or the use of another kernel of our library can correct this degradation.

3.5 Conclusion

This chapter proposed a new automated approach for generating highly optimized code ad- dressing ILP issues as well as data locality issues. This approach relies on state of the art compiler and does not require any hand coding. This approach has been successfully validated on Itanium and Pentium 4 architectures for BLAS3 routines, outperforming ATLAS and being

72 DGEMM (Nx1)(1xN) on Pentium 4 DGEMM (Nx2)(2xN) on Pentium 4 5 5 ATLAS ATLAS KNL KNL MKL MKL

4 4

3 3 GFlops 2 GFlops 2

1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

DGEMM (Nx4)(4xN) on Pentium 4 DGEMM (Nx8)(8xN) on Pentium 4 5 5 ATLAS ATLAS KNL KNL MKL MKL

4 4

3 3

GFlops 2 GFlops 2

1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

Figure 3.12: Outerproduct-k kernel performance on Pentium 4 with k = 1, 2, 4, 8.

DGEMM (1xN)(NxN) on Itanium 2 DGEMM (2xN)(NxN) on Itanium 2 ATLAS ATLAS 7 KNL 7 KNL MKL MKL

6 6

5 5

4 4 GFlops 3 GFlops 3

2 2

1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

DGEMM (4xN)(NxN) on Itanium 2 DGEMM (8xN)(NxN) on Itanium 2 ATLAS ATLAS 7 KNL 7 KNL MKL MKL

6 6

5 5

4 4

GFlops 3 GFlops 3

2 2

1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

Figure 3.13: DGEMV-kernel performance on Itanium 2 with k = 1, 2, 4, 8.

73 DGEMM (1xN)(NxN) on Pentium 4 DGEMM (2xN)(NxN) on Pentium 4 5 5 ATLAS ATLAS KNL KNL MKL MKL

4 4

3 3 GFlops 2 GFlops 2

1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

DGEMM (4xN)(NxN) on Pentium 4 DGEMM (8xN)(NxN) on Pentium 4 5 5 ATLAS ATLAS KNL KNL MKL MKL

4 4

3 3 GFlops 2 GFlops 2

1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

Figure 3.14: DGEMV-kernel performance on Pentium 4 with k = 1, 2, 4, 8.

very competitive with MKL highly tuned routines. It is clear that we rely in a critical manner on a ”good” compiler to achieve high performance on our building blocks/kernels. As a matter of test, we used our approach replacing ICC with GCC. On the Itanium 2, the performance results were much lower while on the Pentium the performance gap was smaller. This clearly shows the necessity of having a very good compiler for the kernels. However, since the kernels used in our approach are fairly simple, a specialized ”compiler” for such kernels can be developed. A good example of such a compiler is the XLG [89] tool developed by CAPS Enterprise. This code generator is fairly specialized in the sense that it only targets vector loops but for such loops, it is capable of applying very aggressive optimization: for example, it evaluates the performance and resource impact of several loop unrolling degrees and at the end generates several versions selecting the best one in function of the vector length. Unlike the chapter 2, the division makes compiler task easier than transforming the whole code like with X-Language. The search space is shorter and performance are very close to peak performance. In this chapter, we did not talk about kernel recomposition to find the original code. This point is deeply important to keep the performance of tested kernels. In the next part, we tackle the study of these kernels and the problem of their recomposition with a different cache hierarchy.

74 Chapter 4

Kernel Recomposition

This chapter proposes a new approach for the generation of performance libraries. This approach is not application-specific and is based on the decomposition method (Chapter 3) generating highly tuned kernels. The library is produced through a selection and combination of the best kernels. The combination is found by a resolution of a integer linear program. We show that our method generates high-performance libraries, in particular for BLAS. We compare the results obtained with vendor libraries and automatic generators (ATLAS [86] and MKL [65]). The contribution is to show how to generate an efficient code from these different kernels and from the optimized codes using these kernels. The key idea, presented in this chapter, is to resort to a performance model (no additional benchmarking) in order to evaluate the performance of each decomposition, according to the input size.

4.1 Library Generation Scheme

The objective of the library generation is to choose the best performing code among the different possible decompositions into kernels, depending on the input size and on the individual kernel performance measures. Kernel performance measurements are performed while data is in cache. That implies that memory copies (or prefetches) are necessary in order to reproduce the same performance in the library function. One of the key to obtain performance is to try to keep the number of copies small, using reuse whenever possible. Working sets and iterations spaces (depending on input sizes) are used in order to determine whether some copies can be hoisted, removed or coalesced. A decision tree will select, according to the parametric input sizes, the most suited code. The general scheme of this method is represented in Figure 4.1. The recomposition method is presented on Figures 4.2,4.3 and 4.4.

Decomposition (Figure 4.2)Starting from an unoptimized function code, several transfor- mations are applied in order to generate different optimized versions. We resort to a meta- compilation language in order to explore a large set of possible optimized codes (Chapter 2). The transformations applied on the source code are all very well-known and are among loop unrolling, tiling, strip-mining, scalar promotion. The legality of the transformations is checked but the exploration of transformation sequences and parameter space is not guided by a model or some heuristic. The user defines, using pragmas, ranges for optimization parameters and defines a set of transformations to consider specifically for some particular code fragment. The parameters correspond for instance to unrolling factors, tiling sizes. This step has been shown

75 Decision tree according to Initial Code input parameters

Library Generation Applying Transformations Kernel Identifications Kernel Benchmarking

Figure 4.1: Decomposition and Recomposition of a code in Chapter 3 Kernel results are stored in a file. Names, sizes, performances are written. Perfor- mance computation kernels are different from performance copy kernels. For the first kernels, this is in cycle/FMA and for copy kernels, this is in cycle/copied element. The results are nor- malized to easily compare kernels performance. From each generated version, the inner most loops are extracted out of their application context and the data structures are simplified ac- cordingly (array dimensions indexed by counters of loops that are no longer in the kernel are projected out). The kernels generated are simpler to compile, and take less time to execute than the whole application. As there meet the static control program requirements, their execution time depends only of the input sizes, on the location of the input data in the cache hierarchy and the alignment of the arrays. Their execution time does not depend on the input values. This means in particular that the kernels generated can be extensively benchmarked while the input data is in cache, and performance of kernels can be considered as tabulated. This approach relies on the compiler to generate good code and in particular, to handle register allocation, instruction scheduling (and exhibiting ILP) and SIMD vectorization if possible. State-of-the-art compilers are able to take care of these transformations.

kernels Size1 Size2... L2 Perf L3 Perf (unit: cycle/FMA) k1 10 50 ...... 0.5 0.7 ... Decomposition k2 15 50 0.6 0.8 k3 20 60 ... 0.55 0.75 ......

(unit: cycle/copied element) copy1 20 60 ... 0.2 0.2 copy2 15 50 ... 0.25 0.25 ......

Figure 4.2: Recomposition - step 1

Constraints generation (Figure 4.3) With input data of the problem, constraints are gen- erated using the previously tested kernels. The initial code is used to help the recomposition.

76 In Figure 4.3, an objective function is created from kernel performance. The function unit is in cycles. As we will see, in this chapter, predicting the code performance is feasible with a static evaluation. Constraints on problem size are also generated. N1 and N2 are problem sizes and kernel number with each size of this kernel equals to these numbers. Xi is the number of kernels to call.

kernels Size1 Size2... L2 Perf L3 Perf (unit : cycle/FMA) k1 10 50 ... 0.5 0.7 ... k2 15 50 0.6 0.8 Constraint k3 20 60 ... 0.55 0.75 Generation Min(X1*0.5*10+...+X1*0.5*50 ...... +copy1*20*0.2+...) 10X1+...+15X15=N1 (unit : cycle/copied element) 50X1+...+50X15=N2 copy1 20 60 ... 0.2 0.2 ... copy2 15 50 ... 0.25 0.25 ......

Figure 4.3: Recomposition - step 2

Library generation (Figure 4.4) Using these constraints, the solver finds kernel numbers to use in each dimension and for different sizes. It generates a decision tree. Each node is condition on different problem size. Leaf corresponds to kernel calls, there is the number of kernel calls.

Initial code

Tile parameter N1 Tile parameter N2 Kernel Calls

Decision Tree Library functions Generation k1 Solver k3 k2 k1

k2 Min(X1*0.5*10+...+X1*0.5*50 copy2 +copy1*20*0.2+...) 10X1+...+15X15=N1 50X1+...+50X15=N2 ...

Figure 4.4: Recomposition - step 3

Usually, kernels are in surrounding loops. These loops are perfectly nested and the body code is regular. Kernel codes are composed by arrays which depends on one loop in the kernel. Predicting accurately performance cost depends on the data location. We considers two cases:

Data in cache A code, being in cache is not memory-bound. No transformation is needed. However, each kernel has a specific mapping for data. For instance, if we use a dot-product

77 kernel for the recomposition of a matrix-matrix multiply, we have to transpose one matrix among them. So, the copy can change the data mapping. For the recomposition, we will use Xi kernels of Sizei for each dimension n corresponding to the surrounding loops.

Data out of cache If data are outside the cache, we have to bring data into the cache. For that, we use a loop-tiling (in the following parts, we consider only stride multiple). Therefore, we have to use a copy to put the data in the cache and to change the data-mapping.

4.1.1 Performance Modeling

Decomposition in kernels that takes a large enough number of cycles. Effects of out of order and other mechanisms that enhance parallelism at the low level have no longer any impact (no overlap due to parallelism among two successive kernels). When data stays into cache, performance of two successive execution of kernels (the same or not) adds up. This is the same principle as for library calls. Depending on cache effects, performance may vary: if there is some reuse among kernels, the second execution may take advantage of this reuse and perform better (Interaction between kernels). Prefetch, likewise, may impact performance if the first kernel prefetches data of the second. As they are compiled separately and iteration domains are constant, little chance that the compiler inserts many prefetches that are useless for the current kernel. Array padding inside a kernel is already taken into account by the different measures. Be- tween different execution of kernels, data locality is dismissed. Prefetch effects from one kernel to the other are ignored, data reuse is taken into account provided a whole input of a kernel is reused (a vector, matrix. . . ). If part of the vector is reuse, no reuse at all is considered. Moreover, we consider the cache as fully associative. A bit unrealistic but cache size is evaluated by benchmarks instead of taking it from documentation. Hence performance is additive, and at worst additivity provides an upperbound on the real performance.

4.2 Recomposition algorithm

After decomposition step, recomposition combines kernels to obtain optimized code. The pre- viously used number Xi has two variable for the remaining of this chapter. A kernel has a cost C associated at the cache hierarchy level and a Size. The number of each kind of kernels is represented by the variable Xij with i being the name of the kernel and j the dimension of the kernel.

4.2.1 With only one kernel

Consider one decomposition with one code of the using kernel K1. for (i1=0;i1

78 1. constraint on footprint compels data to remain in cache. Card(F ootprint(i1,...,iq)) < CacheSize. This constraint comes from the cache size. With this constraint, the kernel performs in a specific cache (Card is the number of elements).

2. we define the minimization cost function which computes the total amount of cycles. p min( (Cost1k × X1k)) with Cost1k is the cost of the kernel 1 with the dimension k. Pk=1 3. we will be interested in constraints for the kernel dimensions. A kernel owns different di- mensions, this constraint delimits each kernel number. The size constraint is the following: p S = X1k × Size1k = ⌈n1⌉with X1i ≥ 0. Pk=1 4.2.2 Extension for different kernels From the previous constraints, we extend the constraint system to q kernels.

p q 1. min( (Costki × Xki)) Pi=1 Pk=1 2. p k=1 X1k × Size1k = ⌈n1⌉ with X1i ≥ 0 S =  .P . .  p k=1 Xpk × Sizepk = ⌈np⌉ with Xpi ≥ 0  P Decomposition produces kernels with performance for different cache level and different copy kernels. Each kernel cost is stored into a file. Results are presented as the following table:

Kernel i Dim1 . . . Dimp Cache1 . . . Cachez kernel 1 Size11 . . . Sizep1 Cost11 . . . Cost1z ...... kernel k Size1k . . . Sizepk Costk1 . . . Costkz

As we mentioned in the previous part, kernel tests are done on an architecture with a cache hierarchy. For example, to assume the performance of a tested kernel with data being located in the cache level N, data must be in this cache level N. To move data into the inferior cache level, two methods exist:

1. the first method is the use of prefetch instructions. The prediction is nevertheless depend- ing on the architecture implementation. The problem of this method is the scheduling of prefecth instructions. (Itanium 2 reschedules the prefetching mechanisms, it can remove them)

2. the second method is the use of copies. Unlike the previous method, the copy cost is higher because of loads and stores but it is more predictable. The loads and stores are not deleted like the others.

These two mechanisms have equally a cost that will be added to the cost of the kernel during the recomposition. To achieve the performance prediction, we will have to demonstrate that measuring kernels outside their application domains is possible. Therefore, we will show that the number of cycles to recompose kernels is the sum of the number of cycles for each kernel separately tested. We have to show that the sum of performance remains linear just as the one of the copy mechanisms. Once the kernel performance is measured for each size, the most efficient is used.

79 Algorithm Kernel recomposition Input: V kernels generated in the algorithm in Figure 3.2 Output: Library generation.

1. ∀i ∈ V and ∀j ∈ Cache Level, measure kernel cost Costij with p dimensions Size1i,..., Sizepi

2. ∀i ∈ V, ∀p ∈ Dimensioni of each kernel i, measure copy costs Costij of size p with its multiples

3. Generate constraint system from the initial code and generate cost minimization function.

4. Generate solution for each size of the problem and minimize cost function with solver.

5. Generate library code:

(a) Produce the value for each size of the tile code K with solverK (b) Solve footprint problem testing different values for blocking values (c) Create a decision tree for each size (d) Generate loop code for the kernel codes with copy for each node of this tree (if needed).

Figure 4.5: Kernel recomposition

We use a linear solver that will resolve this system in a parametrical way to have a code performing for all the sizes. From the solution of this solver (PIP [34]), we generate therefore a code library where all the kernels are integrated with the number of kernel calls. The decision tree selects the adapted kernels for a given code with a given input size. This part is the basis of a construction of function libraries that will be made with the kernels. This tree is built from the experimental measures for each size. We have to pay attention to the kernel limit does not exceed the size of the more internal cache as the L2 cache on Itanium (for floating) and the L1 cache on Pentium. We can realize this step because we know the kernel performance are linear for each cache level. Linear solver with the constraints generated in Step 3, the solver generates a solution for each multiple size of the problem in Step 4. The solver takes into account the copy needed for different cache levels. For instance, if the best performance for a kernel is in the Cache Cache1 and the data are in Cache2, we have to use a copy from Cache2 to Cache1. Using, this method, we can have a simple model for the recombination. The reuse factor helps to determine which kernels can participate to the recombination. The resulting values of the solver, we generate a code in Step 5, this code contains the decision-tree code with different loops to kernel call with the right number for each size of the problem. To minimize the performance cost function, the cost of the kernel recombination must be linear. During, the following section, we show this linearity for the use of these kernels.

4.3 Code generation from constraint systems

On the Figure 4.3, the step of constraint generation produces a system with the following parameters: tile sizes, kernel numbers, performances.

80 Library generation affects the generation method. Indeed, using a library implies to create different functions performing for different sizes. We have to build libraries with the opportunity to use versioned kernels for each value. Therefore, we have to use a generator being able to find a solution for all sizes. We set aside the numerical solver like CPLEX. Even powerful, this kind of tool is useless to solve a problem in general way. We have to call for each size the solver. The system can be solved for only one size. We have to choose a tool which can solve the constraint problems for each size of this problem at the same time. We use the PIP tool which finds solution for each size, it solves in a parametric way a problem. For a library, the difference between these two tools is in the first way, we have to call the solver each time that we call a function with a given size. PIP is a parametric integer linear programming solver. It finds the lexicographic minimum (or maximum) in the set of integer points belonging to a convex polyhedron. Unlike integer programming tools like lp solve or CPLEX, the polyhedron may depend linearly on one or more integer parameters. If the user asks for a non integral solution, PIP can give the exact solution as an integral quotient. The PIP core is the parameterized Gomory’s cuts algorithm followed by the parameterized dual simplex method. For example, let be a constraint: D2(k)= { | i ≤ m, j ≤ n, k ≤ i + j} We encode this previous expression. The input of PIP is on the following form: ( (Lower bound on j ( (Lower bound on j after loop inversion after loop inversion (unknowns j i) (unknowns j i) (parameters k m n) 1 )(if #[ -1 1 0 0] (parameters k m n)) (list #[ 0 0 0 0] 2331-11 #[ 1 0 0 0] ( #[0 -1 0 0 1 0] ) #[-1 0 0 0 0 1] (list #[ 1 -1 0 0] #[1 1 0 -1 0 0] #[ 0 1 0 0] ) ) ( #[-1 1 1 0]) ) ) ) (input file) (output solution file) To express this solution, no new parameter had to be introduced. The form associated to the first conditional is: −1 × k + 1 × m + 0 × n + 0 × 1= m − k so the test should be read as m−k ≥ 0 . If this inequality holds, then the solution is < 0,k > . Otherwise, the solution is To sum things up, the lexicographical minimum of D2 is: ifm − k ≥ 0 then < 0,k > else < k − m,m > Hence the lower bound on the first coordinate: ifm − k ≥ 0 then 0 else k − m Using this previous solution, we transform the matrix form into the library code with a simple pattern-matching method.

4.4 Experimental results

At the beginning of this chapter, problem constraints have been reached. A requirement to deal with this problem is to have kernel performance linearity. In the following parts, we tackle experiments to show that it is possible to make kernels linear. Different examples of computations will be treated, particularly dense algebraic computations.

81 Our kernels and codes were compiled using Intel ICC Compiler v9.0 on both platforms. The code generated with our approach was compared with the MKL library (version 9.0), ATLAS library generator (version 3.6) or Intel IPP (version 5.1). The same release number being used for both machines. In this chapter, the experiments are performed in this way: We call the kernels 20 times without flushing cache and we modify the alignment of input arrays and the data is double.

4.4.1 Matrix-vector multiply on Itanium 2 We will show that kernels used to recompose matrix-vector multiply are linear for performance. From these codes, matrix-vector multiply, DAXPY and dot-product kernels are generated We particularly study the recombination by a DAXPY and dot-product kernels. We begin with the recomposition of matrix-vector multiply kernels after evaluating these kernels. We found the best kernels for each size with different unrolling factors. We used a kernel size of 16000 floating point values in order to fit in L2 cache. The SIZE has been modified for these 2 loops (Figure 4.6). In Figure 4.6-(b), the matrix-vector multiply kernel performance is in cycles/FMA, super- posed with the number of L2 and L3 misses/FMA. Green line shows the L2 cache limit, then L3 sizes. Dotted line represents predicted performance while solid line is performance measure. Prediction line is 1% close to the experimental curve. Recomposition performance is predictable except at L3 cache limit. This figure presents the SIZE of recomposition with the location in the memory hierarchy. For L2 cache, performance is the same for the tested kernels in vitro: 0.51 cycle/FMA. We have a linear zone that goes until about 6.5MB then we have the appear- ance of the L3 misses that make performance degraded from the L3-cache limit value. This value corresponds to the limit where the TLB misses becomes so important that it decreases performance.

8 Experimented Predicted 7

6 CACHE L2 CACHE L3

5

for (i=0;i<16000/SIZE;i++) 4

{ Cycle/FMA 3 transpose(a+i); 2 kernel_dotproduct(a,b,c); 1 0 } L3 MISSES*128/#FMA 8

6

4 kernel_dotproduct(a,b,c){ L2 MISS #pragma unroll(4) 2 0 for (j=0;j

c+=b[j]*a[j]; 6

4 } L3 MISS

2

0 200 400 600 800 1000 1200 1400 1600 1800 2000 SIZE (a) (b)

Figure 4.6: (a)-Matrix-Vector composed with Dot-product kernels (b)-Performance in cy- cle/FMA superposed with the number of cache L2 and L3 misses/FMA

In this second experiment, a daxpy kernel unrolled twice of size 400 has been used to recompose matrix-vector multiply. This kernel (called DAXPY 12 because the external loop is unrolled twelve times) performs around 0.51 Cycle/FMA. In Figure 4.7, this kernel has been used by recomposition.

82 for (i=0;i<16000/SIZE;i+=12) { transpose(a+i); kernel_daxpy(a+i*SIZEA,b,c); } 8 Experimented Predicted 7

kernel_daxpy(a,b,c){ 6 CACHE L2 CACHE L3

5

int i; 4

double *A1,...,*A12; Cycle/FMA 3

double alpha1,...,alpha12; 2

1

0 A1=a; L2 MISSES*128/#FMA A2=a+SIZEA; 8 ... 6 4 A12=a+11*SIZEA; L2 MISS alpha1=b[0]; 2 0 ... L3 MISSES*128/#FMA alpha12=b[11]; 8 6 for(i=0;i<400;i+=2) { 4 c[i]+=alpha1*A1[i]+alpha2*A2[i]+ L3 MISS 2

...+alpha12*A12[i]; 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 c[i+1]+=alpha1*A1[i+1] SIZE +alpha2*A2[i+1]+ ...+alpha12*A12[i+1]; } }

(a) (b)

Figure 4.7: (a)-Matrix-Vector composed with DAXPY kernels (b)-Performance in cycle/FMA superposed with the number of cache L2 and L3 misses/FMA

In Figure 4.8, DAXPY recomposition using DAXPY kernels and dot-product recomposition using dot-product kernels are represented. On each figure, L2 Cache limit corresponds to L2 cache misses outbreak but without performance modification. However, on L3 Cache Limit, L3 cache misses begin to decrease performance. Predicted plot is 1% close to experimental plot. In Figure 4.8-(c) the use of a copy for dot-product kernels shows that copy performance is linear. In Figure 4.9, Kernels on Pentium 4 were represented. These kernels use the SSE instruc- tions. Dot-product recomposition using dot-product kernels are represented. In Figure 4.9-(c) the use of a copy for dot-product kernels shows that copy performance is linear. The common point between these results is cache limits for the most external cache memory. The explanation is bound to the memory page size and the TLB misses. Indeed, the operating system which is a Red-Hat kernel 2.6.7 has 64KB pages. For a cache of 9MB, we can use only 6MB. That’s why performance collapse occurs far from the theoretical limit of the cache.

4.4.2 A real example: dot-product library generation

Dot-product kernel on Itanium 2 is reused. Different versions of kernels for the following code: for(i=0;i

83 8 6 Experimented Experimented 14 Experimented Predicted Predicted Predicted 7 5 CACHE L2 CACHE L3 12 6 CACHE L2 CACHE L3 4 10 5 8 4 3

Cycle/FMA Cycle/FMA Cycle/FMA 6 CACHE L2 CACHE L3 3 2 2 4 1 1 2

0 0 0 L2 MISSES*128/#FMA L2 MISSES*128/#FMA 60 L2 MISSES*128/#FMA 50 8 50 40 6 40 30 30 4 L2 MISS L2 MISS L2 MISS 20 20 2 10 10

0 0 0 L3 MISSES*128/#FMA L3 MISSES*128/#FMA 60 L3 MISSES*128/#FMA 50 8 50 40 6 40 30 30 4 L3 MISS L3 MISS L3 MISS 20 20 2 10 10

0 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 10000 100000 1e+06 10000 100000 1e+06 SIZE SIZE SIZE (a) (b) (c)

Figure 4.8: Performance in cycle/FMA superposed with the number of cache L2 and L3 misses/FMA of (a)-Daxpy kernels (b)-Dot-product kernels (c)-Dot-product kernels using copy kernels

These kernels are transformed by a blocking for the most external cache. 90 kernels are tested with the partial unrolling factor U on loop i. We added copy to make the kernel perform in the area where it was tested in different cache hierarchy. for(i=0;i a0, a1,..., aU, b0, b1,...,bU for(ii=0;ii a0, a1, b0, b1 for(ii=0;ii<1400;ii++) { *c0+=a0[ii]*b0[ii]; *c1+=a1[ii]*b1[ii]; } }

We are particularly interested in different kernels recomposition generated from this method. This array presents the best kernels for each size. The test for the memory performance is performed using a memory flush for each measurement. For each kernel name, kernel size and performance are conveyed in Cycle/FMA. The solver generates the number of kernels and repetition loops for kernel calls are created by the program. For instance, one kernel of size 200 gives 1 loop with a stride of 200 and the iteration number will equal to the solver results. The loop generated will be the following:

84 20 20 20 Experimented Experimented Experimented Predicted Predicted

15 CACHE L1 CACHE L2 15 CACHE L1 CACHE L2 15 CACHE L1 CACHE L2

10 10 10 Cycle/FMA Cycle/FMA Cycle/FMA

5 5 5

0 0 0 180 L1 MISSES*128/#FMA 140 L1 MISSES*128/#FMA 180 L1 MISSES*128/#FMA 160 120 160 140 100 140 120 120 100 80 100 80 60 80 L1 MISS L1 MISS L1 MISS 60 40 60 40 40 20 20 20 0 0 0 60 L2 MISSES*128/#FMA 60 L2 MISSES*128/#FMA 60 L2 MISSES*128/#FMA 50 50 50

40 40 40

30 30 30 L2 MISS L2 MISS L2 MISS 20 20 20

10 10 10

0 0 0 100 1000 10000 100000 100 1000 10000 100000 100 1000 10000 100000 SIZE SIZE SIZE (a) (b) (c)

Figure 4.9: Performance in cycle/FMA superposed with the number of cache L2 and L3 misses/FMA of (a)-Dot-product kernels (b)-Copy kernels (c)-Dot-product kernels using copy kernels on Pentium 4

Name Size L2 Perf L3 Perf Mem Perf

dp1 1 2 3 5.14 7 Experimented Predicted dp2 25 25 1.4 1.56 5.14 6

dp2 50 50 1.20 1.34 5.14 5

dp4 100 100 1.09 1.20 5.14 4 CACHE L2 CACHE L3

3

dp8 200 200 0.77 0.93 5.14 Cycle/FMA dp6 300 300 0.74 0.90 5.14 2 dp2 400 400 0.66 0.84 5.14 1

0 dp2 500 500 0.67 0.85 5.14 L2 MISSES*128/#FMA 25

dp2 600 600 0.63 0.82 5.14 20 dp2 700 700 0.64 0.81 5.14 15 L2 MISS 10

dp2 800 800 0.61 0.77 5.14 5

0 dp2 900 900 0.62 0.80 5.14 L3 MISSES*128/#FMA 25

dp2 1000 1000 0.60 0.77 5.14 20 dp2 1100 1100 0.61 0.78 5.14 15 L3 MISS 10

dp2 1200 1200 0.60 0.77 5.14 5

0 dp2 1300 1300 0.60 0.76 5.14 100 1000 10000 100000 dp2 1400 1400 0.59 0.74 5.14 SIZE dp2 1500 1500 0.60 0.76 5.14 (a) (b) Figure 4.10: (a)-Kernel list used to build dot-product library with size and performance (b)- Performance library in cycle/FMA for dot-product library superposed with the number of L2 and L3 cache misses/FMA

for(i=0;i

The library results are expressed in Figure 4.10-(b). The performance curve which is pre- dicted by our linear solver is 1% the curve representing the real performance. Kernel performances are used by solver to settle the constraints seen in the previous part. The goal of this solver is to compute the number of kernels which is used for problem dimensions.

85 For instance, if the initial code is a dot-product code of size 1000, different kernels can be used: 10 kernels of 100, 5 kernels of 200, 2 kernel of 500 or 1 kernel of 500 and 5 kernels of 100,. . . In figure 4.4.2, we present the decision tree for the dot-product. Using the conditions Cn, we can find the best way to use the most efficient kernels for one problem size.

Init

C1? yes no

C2? yes C4? no

C5? yes C9? no C3? C6? C10? C12?

Call kernel A C7? C11? Call kernel D

C8? Call kernel C

Call kernel B

Figure 4.11: PIP Output code after translation

A code corresponding to (A) is the following: { current_size=0; p1 = (24 * L) / 25; p2 = (7 * p1) / 24; p3 = (1 * p2) / 7; p4 = (59 * p3) / 60; p5 = (3 * p3) / 4; p6 = (1 * p5) / 3; nb_dp1 = 200 * L - 45 * p4 - 19819 * p6; nb_dp2_25 = 1 * L - 100 * p6; nb_dp2_1400 = -15 * p4 + 59 * p6; nb_dp2_1500 = 14 * p4 - 55 * p6; for(i=current_size;i<1500*nb_dp2_1500+current_size;i+=1500) { dp2_1500(A+i,B+i,C); current_size+=1500;} for(i=current_size;i<1400*nb_dp2_1400+current_size;i+=1400) { dp2_1400(A+i,B+i,C); current_size+=1400;} for(i=current_size;i<25*nb_dp2_25+current_size;i+=25) { dp2_1400(A+i,B+i,C); current_size+=25;} for(i=current_size;i

86 This code corresponds to the kernel numbers of each kind that we use with the condition C1, C2 and C3 which are the conditions to know if the input value is a multiple of values As we have each condition for any multiple values with associated kernels, so we can generate a library for all problem sizes. The library results are expressed in Figure 4.12. The performance curve which is predicted by our linear solver is near of the curve representing the real performance.

7 Experimented Predicted 6

5

4 CACHE L2 CACHE L3

3 Cycle/FMA

2

1

0 L2 MISSES*128/#FMA 25

20

15

L2 MISS 10

5

0 L3 MISSES*128/#FMA 25

20

15

L3 MISS 10

5

0 100 1000 10000 100000 SIZE

Figure 4.12: Library performance of dot-product with 11 dot-product kernels

4.5 Decision tree for DGEMV and DGEMM

The library generation for the matrix-vector products will be created from the previously tested kernels. The results are recorded in the following array. We equally added the kernels of the matrix-matrix multiply. The same previous steps will be applied for these two codes. Therefore, we have the performances of the kernels that will compose the matrix-vector multiply and matrix-matrix multiply represented in the array of Figure 4.13 and Figure 4.14. In the array concerning the only DGEMM the performances of the kernels in the L2 were tested for a copy is done before computation for each kernel. Therefore, the data are in the L2 cache. In the array of Figure 4.15, we have the same results given by the solver for different sizes. In the column NoyauCalculSolveur, performance predicted for each kernel size with different kernels as well as the copy kernels. The last column is the sum of these two costs.

4.5.1 Results compared to vendor libraries

If kernels performs without recomposition, they are efficient for only one cache hierarchy. How- ever, they can be improved by recomposition which allow them to perform rapidly in different cache hierarchies and with different sizes. In the following parts, we present the results of the recomposition for a matrix-matrix multiply using different matrix sizes, a LAPACK function and 1D convolution algorithm. In following parts, we present the results of the recomposition for a squared matrix-matrix multiply, a matrix-vector multiply, a LAPACK function and 1D convolution.

87 7 Experimented Predicted 6 CACHE L3

Kernel Name T1 T2 CL2 CL3 CMem 5 dgemv 160 4 4 100 160 53 75 590 4 3 dgemv 80 4 4 200 80 53 75 590 Cycle/FMA 2

dgemv 40 8 2 400 40 52 74 590 1

0 dgemv 32 8 1 500 32 52 74 590 L2 MISSES*128/#FMA 50 dgemv 20 5 2 800 20 52 74 590 40 30

dgemv 16 4 4 1000 16 51 72 590 L2 MISS 20 dgemv 10 5 1 1600 10 51 72 590 10 0 L3 MISSES*128/#FMA dgemv 8 4 4 2000 8 51 72 590 50 40

dgemv 4 4 2 4000 4 50 71 590 30 dgemv 2 2 5 8000 2 63 83 590 L3 MISS 20 10

0 2000 4000 6000 8000 10000 12000 14000 SIZE (a) (b)

Figure 4.13: (a)-Kernel list used to build DGEMV library with size and performance (b)- Performance library in cycle/FMA for dot-product library superposed with the number of L2 and L3 cache misses/FMA

Kernel Name T1 T2 T3 CL2 nkkn 100 1 100 100 1 100 90

1 nkkn 200 1 200 200 1 200 102 Experimented Predicted nkkn 300 1 300 300 1 300 107 0.8 CACHE L3

nkkn 400 1 400 400 1 400 106 0.6

...... Cycle/FMA 0.4 knnk 1 100 1 1 100 1 510 0.2 knnk 1 200 1 1 200 1 437 0 L2 MISSES*128/#FMA knnk 1 300 1 1 300 1 445 100 knnk 1 400 1 1 400 1 425 ...... L2 MISS 50

0 knnn 1 100 100 1 100 100 123 L3 MISSES*128/#FMA knnn 1 200 200 1 200 200 133 8 6

4 knnn 1 300 300 1 300 300 133 L3 MISS knnn 1 400 400 1 400 400 133 2 0 200 400 600 knnn 1 500 500 1 500 500 130 SIZE knnn 1 600 600 1 600 600 132 knnn 1 700 700 1 700 700 128 (a) (b) Figure 4.14: (a)-Kernel list used to build DGEMM library with size and performance (b)- Performance library in cycle/FMA for dot-product library superposed with the number of L2 and L3 cache misses/FMA

DGEMM with MKL and ATLAS

DGEMM can be decomposed in several ways leading to different kinds of kernels: dot product, daxpy, dgemv or outerproduct. For square matrices, the kernel performance analysis and the combination phase analysis selected dotproduct 4,4 ker- nel on the Itanium 2 and the daxpy-4,1 kernel on the Pen- tium 4. The resulting performance on the whole square ma- trix multiply is presented in Figure 4.16. It is interesting to note that on the Pentium 4, performance obtained still lags behind MKL. The main explanation is that we did not suc- ceed in making the compiler make the best use of 8 SSE2 registers. Different kernels are used for different sizes. For the rectangular matrix multiply, our results are much better than Intel MKL Library because of

88 SIZE Num Kernel Comput. Predicted Copy Predicted Total 100 25 nkkn 100 4 100 0.53 0.02 0.55 200 50 knnk 4 200 4 0.52 0.01 0.53 300 50 knnk 6 300 6 0.51 0.02 0.53 400 100 knnk 4 200 4 0.52 0.01 0.53 500 125 knnk 4 500 4 0.51 0.014 0.52 600 100 knnk 6 300 6 0.51 0.015 0.53 700 175 knnk 4 700 4 0.51 0.02 0.53 800 200 knnk 4 800 4 0.51 0.027 0.54

Figure 4.15: Results of solver

kernels are adapted for small sizes. If we compare these last results with the results of the chapter 2, performance are 10% speedup. It is easier for compilers to obtain a good code with code decompositions.

DGEMM NxN on Itanium 2 DGEMM NxN on Pentium 4 7 7 ATLAS ATLAS KERNEL KERNEL MKL MKL 6 6

5 5

4 4 GFlops 3 GFlops 3

2 2

1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

DGEMM (6xN)(Nx6) on Itanium 2 DGEMM (Nx2)(2xN) on Pentium 4 5 ATLAS ATLAS 7 KNL KNL MKL MKL

6 4

5 3 4 GFlops 3 GFlops 2

2 1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

Figure 4.16: DGEMM kernel performance on Itanium 2 (left) and Pentium 4 (right).

1D Convolution

This code presented in Figure 4.17 is an example of how to reuse kernel micro-optimization for other codes. Indeed, this code can be decomposed, after tiling, into daxpy and dotproduct kernels again. Performance are 10% better than Intel MKL library and twice better than a naive code.

89 for(i=0;i

1D Convolution on Itanium 2 1D Convolution on Pentium 4 7 7 NAIVE NAIVE KERNEL KERNEL IPP IPP 6 6

5 5

4 4 GFlops 3 GFlops 3

2 2

1 1

0 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

Figure 4.17: 1D convolution performance on Itanium 2(left) and Pentium 4(right) with n=10.

DGEMV on Itanium 2 In Figure 3.13, results for kernel decomposition concerning DGEMV was lower than Intel MKL library. Indeed, only one kernel of DGEMV was used to perform the matrix-vector multiply. In the results of the Figure 4.18, DAXPY kernels were used to create DGEMV function in addition to the others. It is visible to see that performance is better. There are three histograms :

• KNL rec represents performance with different kernels like DAXPY 10 or DGEMV kernels,

• KNL represents performance of Figure 3.13,

• MKL is Intel MKL library performance.

90 DGEMV 1 on Itanium 2 DGEMV 2 on Itanium 2 KNL_rec KNL_rec 7 KNL 7 KNL MKL MKL

6 6

5 5

4 4

GFlops 3 GFlops 3

2 2

1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

DGEMV 4 on Itanium 2 DGEMV 8 on Itanium 2 KNL_rec KNL_rec 7 KNL 7 KNL MKL MKL

6 6

5 5

4 4 GFlops 3 GFlops 3

2 2

1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

Figure 4.18: DGEMV-kernel performance on Itanium 2 with k = 1, 2, 4, 8.

4.5.2 LAPACK potrs In this example, potrs, a code from Lapack Library, is used with our approach. It solves a Hermitian positive definite system of linear equations AX = B, using the Cholesky factorization. This code (from potrs function) has been fissioned to be decomposed into DAXPY kernels. The main point of this code is to have loops which depends on other loops. The following code is an extract of potrs code. In Figure 4.19, kernels have aroung 20% speed-up.

91 for (j = N; j >= 1; --j) { for (k = M; k >= 1; --k) { tB[k][j] = alpha * tB[k][j]; tB[k][j] /= tA[k][k]; for (i = 0; i < k-1; i++) { tB[i][j] -= tA[i][k] * tB[k][j]; }}}

Potrs M=200 on Itanium 2 Potrs M=200 on Pentium 4 7 7 ATL ATL KNL KNL MKL MKL 6 6

5 5

4 4 GFlops 3 GFlops 3

2 2

1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

Potrs M=600 on Itanium 2 Potrs M=600 on Pentium 4 7 7 ATL ATL KNL KNL MKL MKL 6 6

5 5

4 4 GFlops 3 GFlops 3

2 2

1 1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

N N

Figure 4.19: LAPACK Potrs on Itanium 2(left) and on Pentium 4(right) with N=200 .

4.6 Method extension

During these two last chapters, we created an optimization framework. The weakness of this tool is the number of tested kernels, it can take a long time. To tackle this problem, a tool which tests statically the assembly code can have an interesting feed-back to choose a kernel amount a huge number without any tests. The second thing to improve is to use a more accurate model in order to have a better prediction.

4.6.1 Kernel tests In the kernel optimization, a step of this method is really expensive for the CPU time. This is the kernel tests. A lot of transformations are applied on this kernel, they are exhaustively tested on the target architecture. It can take several minutes. Decreasing this time means testing less kernels. To have a short number of tests, the use of a tool like a model predicts performance

92 without any experiments. MAQAO [58], a tool of assembly code analysis, helps to bypass the step of kernel tests. Code features can be statically obtained. In Figure 4.20, kernels are sorted according to the static complexity of the loop (in cy- cle/iteration) and the ratio between the issues and the bound. We will see that the performance are very bound to these static features. We are particularly interested in a study on dot-product kernel. In this experiment, the results are produced varying the unrolling factor U1 and U2 of the following code. We compare them to the static analysis of MAQAO. The experiment results are written in the following tabular, where U1 and U2 represents the unrolling factor of the loops i and j. The size T is the iteration number of the loop j and the 16000/T is for i. This size is to make the code remain in L2 cache. The column MIN, MAX and AV G respectively shows the performance of each kernel in cycle/FMA. In the column C1 et C2 represents the static performance provided by MAQAO for the most used loop and the second most used loop. The performances found by Maqao and the experimented performance are close. The last column of this array is the Ratio, it is for the ration between the issue and the bound. It measures the quality of an assembly code. A code is really efficient if a bundle is totally busy(the value is 1). MAQAO results are reliable because the real performance and the estimated performance are the same.

#pragma unroll(U1) for (j=0;j

Figure 4.20: Matrix-vector code

Kernel Name U1 U2 16000/T T MIN MAX AVG C1 C2 ratio DGEMV 8 4 4 4 4 2000 8 0.507 0.508 0.508 0.5 x N 0.5 x N 1 DGEMV 4 4 2 4 2 4000 4 0.504 0.588 0.509 0.63 x N 0.5 x N 1.2 DGEMV 4 4 1 4 1 4000 4 0.504 0.587 0.509 0.63 x N 0.5 x N 1.2 DGEMV 8 8 2 8 2 2000 8 0.507 0.538 0.510 0.5 x N 0.5 x N 1 DGEMV 8 8 1 8 1 2000 8 0.507 0.538 0.511 0.5 x N 0.5 x N 1 DGEMV 10 5 1 5 1 1600 10 0.509 0.558 0.512 0.5 x N 0.5 x N 1 DGEMV 10 5 2 5 2 1600 10 0.509 0.558 0.512 0.5 x N 0.5 x N 1 DGEMV 4 4 4 4 4 4000 4 0.505 0.599 0.512 0.5 x N 0.5 x N 1 DGEMV 8 8 4 8 4 2000 8 0.508 0.539 0.513 0.5 x N 0.5 x N 1 DGEMV 16 4 4 4 4 1000 16 0.513 0.514 0.513 0.5 x N 0.5 x N 1 DGEMV 10 10 10 10 10 1600 10 0.593 0.608 0.513 0.69 x N 0.52 x N 1 DGEMV 16 4 2 4 2 1000 16 0.513 0.537 0.514 0.63 x N 0.5 x N 1.2 DGEMV 16 4 1 4 1 1000 16 0.513 0.538 0.514 0.63 x N 0.5 x N 1.2 DGEMV 16 12 10 12 10 1000 16 1.158 1.189 1.164 1.13 x N 2xN 2.25 ...... DGEMV 20 12 6 12 6 800 20 1.165 1.231 1.173 2xN 0.57 x N 2 DGEMV 8 5 5 5 5 2000 8 1.134 1.285 1.176 2xN 0.6 x N 2 DGEMV 10 6 12 6 12 1600 10 1.167 1.242 1.177 2xN 0.61 x N 2 DGEMV 20 12 8 12 8 800 20 1.176 1.242 1.181 2xN 0.85 x N 2 DGEMV 10 6 10 6 10 1600 10 1.172 1.244 1.182 2xN 0.62 x N 2 DGEMV 32 12 10 12 10 500 32 1.180 1.197 1.187 1.13 x N 2xN 2.25 DGEMV 32 10 12 10 12 500 32 1.205 1.225 1.211 2xN 0.8 x N 2 DGEMV 10 6 5 6 5 1600 10 1.262 1.328 1.273 2xN 0.6 x N 2 DGEMV 20 12 5 12 5 800 20 1.306 1.374 1.312 2xN 0.58 x N 2

93 In this table, the least kernels are the less efficient. We notice that the ratio is close to 2 which means that the code is not optimized, the complexity of this loop is not so good to have good performance (2 × N). The first kernels are around 1 and the complexity around 0, 5 × N which means that it is a well optimized code. Finally, the exhaustive search of our method can be replaced by an assembly code analysis. Which improves the execution time of library generation.

4.6.2 Model for an accurate performance prediction For recomposition, the approach is not accurate. Indeed, data locality of kernels is not taken into account for performance estimation. When a code accesses an array element, it loads this data and the other contiguous data. These data are in the cache memory. Then, if another kernel performing on data already in cache, the performance evaluation is overestimating the real value. A model can be required to predict data locality. This model could specify if copy mechanism is useful or not.

94 Chapter 5

Conclusion

5.1 Contributions

Due to complex interactions between hardware mechanisms and software dynamic behaviors, high-performance optimization at compile time is an extremely difficult task. State-of-the-art compilers are still challenged to achieve high performance independently from runtime or micro- architectural parameters. To fix a part of this problem, scientific code sections are organized around sets of highly optimized standardized functions that are provided by vendors as stand-alone libraries. These libraries are either hand-tuned or generated by an automatic tuning system; they are specifically dedicated to the optimization of a single set of kernels. This thesis advocates an empirical approach for the automatic tuning of loops which account for a large fraction of the execution time in most scientific applications. We focus on the portability of optimizations with high-level source-to-source transformations. In a first time, we studied a purely generative approach with meta-programming languages like MetaOCaml. We also created a language expressing transformations and the way to search the parameters. We demonstrated that such languages could deal with the problem to obtain high perfor- mance. The main drawback of their use is the difficulty for a user to use it. Indeed, using transformation on source code, the transformation legality appears in the optimization process. Transformation sequences may be created or may be rejected. Users have to predict resulting codes of each transformation. The creation of a specific language for transformations has not solved the problem of transformation expressivity. An automatic method has been created using meta-programming techniques for library generation such as ATLAS. This is a general-purpose approach different from specific code generators which are domain-specific. Unlike ATLAS, we only use source code. With all potential code transformations, the number of test cases is really huge in an iterative search. The optimization process first relies on the decomposition of critical code sections (vector loops) into simpler codes. Several variants of these codes are generated with an iterative search and a systematic micro- benchmarking step is performed exploring a large parameter space. As the search performs on code fragments, the search space is shorter. These measurements allow to find the set of parameters for which variants give optimal performance. Next, original code sequences are replaced by versions rebuilt from those simple and op- timized kernels. Thus, the system incrementally generates a library which contains detected code sequences and their optimized versions. Therefore, the optimization of a given application may use already optimized kernels instead of redoing a full optimization process. A simple performance model has been created to recompose these kernels. In this study, finding solutions relying on compilers was one of our goals to keep the portability of this optimization process.

95 Finally, the main results are close to domain specific libraries. As for ATLAS, our approach is always better: 15% speedup for matrix-matrix multiply, 20% for matrix-vector, 50% for Potrs and more than 200% for matrix-matrix multiply with non square matrices. For Intel MKL library, the results are nearly the same for matrix-matrix multiply but 30% better for non square matrices. The results are similar for 1D convolution and Potrs functions. Our results are automatically generated and are efficient on two architectures. With the Intel IPP library, our results are 10% better on Itanium 2 and 1% better on Pentium 4. This approach is not only efficient for general purpose application but it is portable on different architectures.

5.2 Method limitations

Our approach allows us to obtain good results with regular codes and particularly with perfect nested loops and static controls. To sum up the limitations, we use only:

• for-loops with static controls. The kernel construction needs to have constant loop-trip counts.

• computation-bound codes, large amount of accessed data. On codes already divided into kernels like FFTW, the use of a such method is useless. Kernels are so simplified that compilers generate pretty efficient code. The major limitation is on the algorithm. Our approach doesn’t deal with the problem of the algorithm complexity. It cannot transform algorithms, but only improve the generated codes.

• 4 elementary transformations.

• an efficient compiler.

However, these limitations are not so strict. We can play with the codes by different trans- formations or techniques:

• for while-loops, we will apply a tiling which transforms loops into for-loops with constant loop-trip counts.

• for loops with non-static controls, we apply versioning and the Deep Jam algorithm [15] which applies unrolling and jam on these loops.

• we must have a low number of elementary transformations. If it is not the case, the time to find a solution is so huge that we have to use machine learning and models to decrease this time. Testing kernels takes a long time. The execution can be bypassed using a model which can predict the performance or the kernel selection can be improved by machine learning models.

5.3 Future work

Even if library generation is the main target of our method, kernels generated by this method are data for knowledge data-base. We can reuse them like building blocks for different programs.

5.3.1 GPU processors Inspired by the attractive Flops/dollar ratio and the incredible growth in the speed of mod- ern graphics processing units (GPUs), using a cluster of GPUs for high performance scientific computing becomes more and more frequent. A Graphics Processing Unit (also occasionally

96 called Visual Processing Unit or VPU) is a dedicated graphics rendering device for a personal computer, workstation, or game console. Modern GPUs are very efficient at manipulating and displaying computer graphics, and their highly parallel structure makes them more effective than typical CPUs for a range of complex algorithms (matrix-matrix multiply for instance). More and more GPUs are used for high-performance computing, we have to adapt our method to use them. We can use the compiler specific to this GPU processor to generate codes. With the X-Language, we can use directly assembly code in order to produce specific code for this processor. We can measure the performance of these functions of this GPU. Then, we apply the recomposition method to obtain high performance.

5.3.2 Multi-core processors A multi-core microprocessor (or chip-level multiprocessor, CMP) is one that combines two or more independent CPU cores into a single package, often a single integrated circuit. A dual- core device contains two independent microprocessors and a quad-core device contains four microprocessors. A multi-core microprocessor implements multiprocessing in a single physical package. The last version of CPUs for high-performance computing are multi-core: Intel Mon- tecito, Amd 64, Pentium 4 Woodcrest. This architecture is more and more used in the world of high-performance. We need to adapt this method to the calculation on these machines. A core is like a single processor. We can optimize a code for one core and after parallelizing it. The kernel decomposition allows to divide codes into smaller code fragments which are performant on a single core. For the multi-core processors, this is the same principle. We use this method with parallel architectures and research works are currently launched to test the approach va- lidity. We use an optimization library like Open-Mp. We have already tested this solution for a kNNN kernel and the performance as twice as performant with two processors. This last case requires delicate tradeoffs between locality, inter and intra processor parallelism. The limitation is also the bandwidth. Kernels on different cores compete for memory access.

5.4 Summary

Finally, the metaprogramming was an excellent tool to generate kernels. The main problem was the difficulty to create a code. For the kernel decomposition has fixed the problem of metapro- gramming languages but the iterative search can be as long as the previous method. The main overhang of our technics is to be general and efficient on different architecture. To compare our works with a basic iterative search, we use this following table:

Approach Specificity User-friendly Time consuming Performance Iterative search General-purpose + +++ + Meta-programming Customized search ++ ++ ++ Kernels Automatic +++ + +++

97 Bibliography

[1] F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. O’Boyle, J. Thomson, M. Tou- ssaint, and C. Williams. Using machine learning to focus iterative optimization. In Pro- ceedings of the 4th Annual International Symposium on Code Generation and Optimization (CGO), 2006.

[2] Christophe Alias and Denis Barthou. Algorithm recognition based on demand-driven data- flow analysis. In WCRE ’03: Proceedings of the 10th Working Conference on Reverse Engineering, page 296, Washington, DC, USA, 2003. IEEE Computer Society.

[3] R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan and Kaufman, 2002.

[4] G. Almasi, L.DeRose, B. Fraguela, J. Moreira, and D. Padua. Programming for Locality and Parallelism with Hierarchically Tiled Arrays. In Proc. of 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC).Also in Languages and Com- pilers for Parallel Computing. LNCS 2958. Editor: Lawrence Rauchwerger, pages 374-389, Springer-Verlag. ISBN 3-540-21199-3, October 2003.

[5] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999.

[6] D. I. August, D. A. Connors, S. A. Mahlke, J. W. Sias, K. M. Crozier, B.-C. Cheng, P. R. Eaton, Q. B. Olaniran, and W.-M. Hwu. Integrated predicated and speculative execution in the IMPACT EPIC architecture. In Proceedings of the 25th Intl. Symp. on Computer Architecture, pages 227–237, July 1998.

[7] David F. Bacon, Susan L. Graham, and Oliver J. Sharp. Compiler transformations for high-performance computing. ACM Computing Surveys, 26(4):345–420, 1994.

[8] D. Barthou, A. Cohen, and J.-F. Collard. Maximal static expansion. In 25th ACM Symp. on Principles of Programming Languages (PoPL’98), pages 98–106, San Diego, California, January 1998.

[9] C. Bastoul. Code generation in the polyhedral model is easier than you think. In Parallel Architectures and Compilation Techniques (PACT’04), pages 7–16, Juan-les-Pins, septem- ber 2004.

[10] C. Bastoul, A. Cohen, S. Girbal, S. Sharma, and O. Temam. Putting polyhedral loop transformations to work. In Workshop on Languages and Compilers for Parallel Computing (LCPC’03), LNCS, pages 23–30, College Station, Texas, October 2003.

98 [11] O. Beckmann, A. Houghton, P. H. J. Kelly, and M. Mellor. Run-time code generation in c++ as a foundation for domain-specific optimisation. In Proceedings of the 2003 Dagstuhl Workshop on Domain-Specific Program Generation, 2003.

[12] A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian. Automatic intra-register vectorization for the Intel architecture. Int. J. of Parallel Programming, 30(2):65–98, 2002.

[13] P. Boulet and X. Redon. SPPoC: Symbolic parameterized polyhedral calculator. http: //www.lifl.fr/west/sppoc.

[14] C. Calcagno, W. Taha, L. Huang, and X. Leroy. Implementing multi-stage languages using ASTs, Gensym, and reflection. In ACM SIGPLAN/SIGSOFT Intl. Conf. Generative Programming and Component Engineering (GPCE’03), pages 57–76, 2003.

[15] P. Carribault, A. Cohen, and W. Jalby. Deep Jam: Conversion of coarse-grain parallelism to instruction-level and vector parallelism for irregular applications. In Parallel Architec- tures and Compilation Techniques (PACT’05), St-Louis, Missouri, September 2005. IEEE Computer Society Press. To appear.

[16] A. Chauhan and K. Kennedy. Optimizing strategies for telescoping languages: procedure and procedure vectorization. In ACM Int. Conf. on Supercomputing (ICS’04), pages 92–101, June 2001.

[17] C. Chen, J. Chame, and M. W. Hall. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In CGO ’05, pages 111–122, 2005.

[18] P. Clauss. Counting solutions to linear and nonlinear constraints through Ehrhart polyno- mials: Applications to analyze and transform scientific programs. In ACM Int. Conf. on Supercomputing, pages 278–295. ACM Press, 1996.

[19] A. Cohen, S. Donadio, M.-J. Garzaran, D. Padua, and C. Herrmann. In search for a pro- gram generator to implement generic transformations for high-performance computing. In 1st MetaOCaml Workshop (associated with GPCE), Vancouver, British Columbia, October 2004.

[20] A. Cohen, S. Girbal, D. Parello, M. Sigler, O. Temam, and N. Vasilache. Facilitating the search for compositions of program transformations. In ACM Int. Conf. on Supercomputing (ICS’05), Boston, Massachusetts, June 2005. To appear.

[21] A. Cohen, S. Girbal, and O. Temam. A polyhedral approach to ease the composition of program transformations. In Euro-Par’04, number 3149 in LNCS, pages 292–303, Pisa, Italy, August 2004. Springer-Verlag.

[22] S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In PLDI ’95, pages 279–290, 1995.

[23] A. Colmerauer and P. Roussel. The birth of prolog. pages 331–367, 1996.

[24] K. D. Cooper, D. Subramanian, and L. Torczon. Adaptive optimizing compilers for the 21st century. J. of Supercomputing, 23(1):7–22, 2002.

[25] A. Darte, Y. Robert, and F. Vivien. Scheduling and Automatic Parallelization. Birkhauser,¨ Boston, 2000.

99 [26] L. De Rose and D. Padua. Techniques for the translation of matlab programs into fortran 90. ACM Trans. on Programming Languages and Systems, 21(2):286–323, 1999.

[27] L. Djoudi, D. Barthou, P. Carribault, C. Lemuet, J.-T. Acquaviva, and W. Jalby. A new tool for assembler analysis and optimization on epic architecture. In Proc. of the Epic Workshop (in conjunction with CGO’05), 2005.

[28] S. Donadio, J. Brodman, K.Yotov, T. Roeder, D. Barthou, A. Cohen, M. Garzaran, D. Padua, and K. Pingali. A language for the Compact Representation of Multiple Program Versions. In LCPC ’05, Hawthorne, New York, October 2005.

[29] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of FOR- TRAN Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 14(1):1–17, March 1988.

[30] J. Ekhardt, R. Kaiabachev, and K. Swadi. Offshoring: Representing C and Fortran90 in OCaml. In 1st MetaOCaml workshop, page 11, October 2004.

[31] Engineering and scientific subroutine library. Guide and Reference. IBM.

[32] P. Feautrier. Array expansion. In ACM Int. Conf. on Supercomputing, pages 429–441, St. Malo, France, July 1988.

[33] P. Feautrier. Parametric integer programming. RAIRO Recherche Op´erationnelle, 22:243– 268, September 1988.

[34] P. Feautrier. Parametric integer programming. RAIRO Recherche Op´erationnelle, 22(3):243–268, 1988.

[35] P. Feautrier. Dataflow analysis of scalar and array references. Int. J. of Parallel Program- ming, 20(1):23–53, February 1991.

[36] P. Feautrier. Some efficient solutions to the affine scheduling problem, part II, multidimen- sional time. Int. J. of Parallel Programming, 21(6):389–420, December 1992. See also Part I, one dimensional time, 21(5):315–348.

[37] B. Fraguela, J. Guo, G. Bikshandi, M.J. Garzar´an, G. Almasi, J. Moreira, and D. Padua. The Hierarchically Tiled Arrays Programming Approach. In Proc. of Seventh Workshop on Languages, Compilers and Run-Time Support for Scalable Systems., October 2004.

[38] M. Frigo and S. G. Johnson. FFTW: An adaptive software architecture for the FFT. In Proc. of the ICASSP Conf., volume 3, pages 1381–1384, 1998.

[39] T. Fruhwirth. Theory and practice of constraint handling rules. Journal of Logic Program- ming, Special Issue on Constraint Logic Programming, 37(1-3):95–138, October 1998.

[40] G. Fursin, M. O’Boyle, and P. Knijnenburg. Evaluating iterative compilation. In 11thWorkshop on Languages and Compilers for Parallel Computing, LNCS, Washington DC, July 2002. Springer-Verlag.

[41] K. Goto and R. van de Geijn. On reducing tlb misses in matrix multiplication. Technical report, The University of Texas at Austin, Department of Computer Sciences, 2002.

[42] M. Hall et al. Maximizing multiprocessor performance with the SUIF compiler. IEEE Computer, 29(12):84–89, December 1996.

100 [43] C. A. Herrmann and C. Lengauer. Parallelization of divide-and-conquer by translation to nested loops. J. of Functional Programming, 9(3):279–310, 1999. [44] C. A. Herrmann and C. Lengauer. HDC: A higher-order language for divide-and-conquer. Parallel Processing Letters, 10(2/3):239–250, 2000. [45] W. Jalby, C. Lemuet, and X. Le Pasteur. Wbtk: a new set of microbenchmarks to explore memory system performance for scientific computing. Int. J. High Perform. Comput. Appl., 18(2):211–224, 2004. [46] S. Kamin, L. Clausen, and A. Jarvis. Jumbo: run-time code generation for java and its applications. In ACM Conf. on Code Generation and Optimization (CGO’03), pages 48–56, 2003. [47] KAP C/OpenMP for Tru64 UNIX and KAP DEC Fortran for Digital UNIX. http://www. hp.com/techsevers/software/kap.html. [48] W. Kelly. Optimization within a unified transformation framework. Technical Report CS-TR-3725, University of Maryland, 1996. [49] K. Kennedy. Telescoping languages: A compiler strategy for implementation of high-level domain-specific programming systems. In Proc. Intl. Parallel and Distributed Processing Symposium (IPIPS’00), pages 297–304, 2000. [50] O. Kiselyov, K. N. Swadi, and W. Taha. A methodology for generating verified combi- narorial circuits. In Embedded Software Conf. (EMSOFT’04), pages 249–258, Pisa, Italy, September 2004. [51] P. Kisubi, P.M.W. Knijnenburg, and M.F.P. O’Boyle. The Effect of Cache Models on Iterative Compilation for Combined Tiling and Unrolling. In Proc. of the International Conference on Parallel Architectures and Compilation Techniques, pages 237–246, 2000. [52] T. Kisuki, P. Knijnenburg, M. O’Boyle, and H. Wijshoff. Iterative compilation in program optimization. In Proc. CPC’10 (Compilers for Parallel Computers), pages 35–44, 2000. [53] I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. In ACM Symp. on Programming Language Design and Implementation (PLDI’97), pages 346–357, Las Vegas, Nevada, June 1997. [54] I. Kodukula and K. Pingali. Transformations for imperfectly nested loops. In Supercom- puting ’96, page 12, 1996. [55] M. Lam. Software pipelining: an effective scheduling technique for vliw machines. In PLDI ’88: Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation, pages 318–328, New York, NY, USA, 1988. ACM Press. [56] C. L. Lawson, Richard J. Hanson, D. R. Kincaid, and Fred T. Krogh. Algorithm 539: Basic linear algebra subprograms for fortran usage [f1]. ACM Trans. Math. Softw., 5(3):324–325, 1979. [57] C. L. Lawson, Richard J. Hanson, D. R. Kincaid, and Fred T. Krogh. Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Softw., 5(3):308–323, 1979. [58] L.Djoudi, D.Barthou, P.Carribault, C.Lemuet, Jean-Thomas Acquaviva, and W.Jalby. Ex- ploring application performance: a new tool for a static/dynamic approach. In Proceedings of the 6th LACSI Symposium, Santa Fe, NM, October 2005.

101 [59] V. Lefebvre and P. Feautrier. Automatic storage management for parallel programs. Par- allel Computing, 24(3):649–671, 1998.

[60] C. Lengauer, D. Batory, C. Consel, and M. Odersky, editors. Domain-Specific Program Generation. Number 3016 in LNCS. Springer-Verlag, 2003.

[61] X. Li, M.-J. Garzaran, and D. Padua. A dynamically tuned sorting library. In ACM Conf. on Code Generation and Optimization (CGO’04), pages 111–124, San Jose, CA, March 2004.

[62] S. Liang, P. Hudak, and M. Jones. Monad transformers and modular interpreters. In ACM Symp. on Principles of Programming Languages (PoPL’95), pages 333–343, 1995.

[63] J. Llosa, A. Gonzalez, E. Ayguade, and M. Valero. Swing modulo scheduling: A lifetime- sensitive approach. In Parallel Architectures and Compilation Techniques (PACT’96), pages 80–87, 1996.

[64] V. Loechner and D. Wilde. Parameterized polyhedra and their vertices. Int. J. of Parallel Programming, 25(6), December 1997. http://icps.u-strasbg.fr/PolyLib.

[65] Intel math kernel library (intel mkl). Intel.

[66] S. S. Muchnick. Advanced Compiler Design & Implementation. Morgan Kaufmann, 1997.

[67] Q. Ning and G. R. Gao. A novel framework of register allocation for software pipelining. In ACM Symp. on Principles of Programming Languages (PoPL’93), pages 29–42, January 1993.

[68] D. Parello, O. Temam, A. Cohen, and J.-M. Verdun. Towards a systematic, pragmatic and architecture-aware program optimization process for complex processors. In ACM Supercomputing’04, page 15, Pittsburgh, Pennsylvania, November 2004.

[69] M. Poletto, W. C. Hsieh, D. R. Engler, and M. F. Kaashoek. ‘C and tcc: A language and compiler for dynamic code generation. ACM Trans. on Programming Languages and Systems, 21(2):324–369, March 1999.

[70] W. Pugh. A practical algorithm for exact array dependence analysis. Communications of the ACM, 35(8):27–47, August 1992.

[71] M. Puschel, J. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code Gen- eration for DSP Transforms. Proceedings of the IEEE, To appear 2005. Special issue on “Program Generation, Optimization, and Adaptation”.

[72] M. Puschel, B. Singer, J. Xiong, J. M .F. Moura, J. Johnson, D. Padua, M. M. Veloso, , and R. W. Johnson. SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms. Journal of High Performance Computing and Applications, special issue on Automatic Performance Tuning, 18(1):21–45, 2004.

[73] L. Rauchwerger and D. Padua. The LRPD test: Speculative run–time parallelization of loops with privatization and reduction parallelization. IEEE Transactions on Parallel and Distributed Systems, Special Issue on Compilers and Languages for Parallel and Distributed Computers, 10(2):160–180, 1999.

102 [74] D. Remy. Using, understanding, and unraveling - the ocaml language - from practice to theory and vice versa.

[75] M. Schordan and D. J. Quinlan. A source-to-source architecture for user-defined optimiza- tions. In Joint Modular Languages Conference (JMLC’03), volume 2789 of LNCS, pages 214–223. Springer-Verlag, August 2003.

[76] M. D. Smith. Overcoming the challenges to feedback-directed optimization. In ACM SIGPLAN Workshop on Dynamic and Adaptive Compilation and Optimization, pages 1– 11, 2000. (Keynote Talk).

[77] Standard performance evaluation corporation. http://www.spec.org.

[78] W. Taha. Multi-Stage Programming: Its Theory and Applications. PhD thesis, Oregon Graduate Institute of Science and Technology, November 1999.

[79] W. Taha. A sound reduction semantics for untyped CBN mutli-stage computation. or, the theory of MetaML is non-trival. In Proc. of the ACM workshop on Partial Evaluation and semantics-based Program Manipulation (PEPM’00), pages 34–43, Boston, Massachusetts, 2000.

[80] W. Thies, F. Vivien, J. Sheldon, and S. Amarasinghe. A unified framework for schedule and storage optimization. In ACM Symp. on Programming Language Design and Imple- mentation (PLDI’01), pages 232–242, 2001.

[81] S. Triantafyllis, M. Vachharajani, N. Vachharajani, and D. I. August. Compiler optimization-space exploration. In CGO ’03: Proceedings of the international symposium on Code generation and optimization, pages 204–215, Washington, DC, USA, 2003. IEEE Computer Society.

[82] D. A. Turner. A new implementation technique for applicative languages. Software – Practice and Experience, 9(1):31–49, January 1979.

[83] A. Van Deursen, P. Klint, and J. Visser. Domain-specific languages: An annotated bibli- ography. ACM SIGPLAN Notices, 35(6):26–36, 2000.

[84] T. Veldhuizen. Using C++ template metaprograms. C++ Report, 7(4):36–43, 1995.

[85] T. Veldhuizen and D. Gannon. Active libraries: Rethinking the roles of compilers and libraries. In SIAM Workshop on Object Oriented Methods for Inter-operable Scientific and Engineering Computing, pages 21–23, October 1998.

[86] R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated Empirical Optimiza- tion of Software and the ATLAS Project. Parallel Computing, 27(1–2):3–35, 2001. Also available as University of Tennessee LAPACK Working Note #147, UT-CS-00-448, 2000 (www.netlib.org/lapack/lawns/lawn147.ps)”.

[87] M. E. Wolf. Improving Locality and Parallelism in Nested Loops. PhD thesis, Stanford University, August 1992. Published as CSL-TR-92-538.

[88] M. Wolfe. Iteration space tiling for memory hierarchies. In Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, pages 357–361, 1989.

[89] Caps entreprise. http://www.caps-entreprise.com.

103 [90] K. Yotov, X. Li, G. Ren, M. Cibulskis, G. DeJong, M. Garzar´an, D. Padua, K. Pingali, P. Stodghill, and P. Wu. A Comparison of Empirical and Model-driven Optimization. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, pages 63–76. ACM Press, 2003.

[91] K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua, K. Pingali, and P. Stodghill. Is search really necessary to generate high-performance blas, 2005.

104