Memory-Aware Algorithms and Scheduling Techniques for Matrix Computations Julien Herrmann

Memory-aware Algorithms and Scheduling Techniques for Matrix Computations Julien Herrmann To cite this version: Julien Herrmann. Memory-aware Algorithms and Scheduling Techniques for Matrix Computations. Distributed, Parallel, and Cluster Computing [cs.DC]. Ecole normale supérieure de lyon - ENS LYON, 2015. English. NNT : 2015ENSL1043. tel-01241485 HAL Id: tel-01241485 https://tel.archives-ouvertes.fr/tel-01241485 Submitted on 10 Dec 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. No attribué par la bibliothèque: 2015ENSL1043 -ÉCOLE NORMALE SUPÉRIEURE DE LYON - Laboratoire de l’Informatique du Parallélisme - UMR5668 - LIP THÈSE en vue d’obtenir le grade de Docteur de l’Université de Lyon, délivré par l’École Normale Supérieure de Lyon Spécialité : Informatique au titre de l’École Doctorale Informatique et Mathématiques présentée et soutenue publiquement le 25 Novembre 2015 par Julien HERRMANN Memory-aware Algorithms and Scheduling Techniques for Matrix Computations Directeur de thèse : Yves ROBERT Co-encadrant de thèse : Loris MARCHAL Devant la commission d’examen formée de : Luc GIRAUD Examinateur Pierre MANNEBACK Examinateur Loris MARCHAL Co-encadrant Yves ROBERT Directeur Oliver SINNEN Rapporteur Denis TRYSTRAM Rapporteur ii Contents Introduction i I Linear Algebra 1 1 Mixing LU and QR factorization algorithms 3 1.1 Introduction ......................................... 3 1.2 Hybrid LU-QR algorithms ................................. 5 1.2.1 LU step ....................................... 6 1.2.2 QR step ....................................... 7 1.2.3 LU step variants .................................. 8 1.2.4 Comments ..................................... 9 1.3 Robustness criteria ..................................... 10 1.3.1 Max criterion .................................... 10 1.3.2 Sum criterion .................................... 11 1.3.3 MUMPS criterion ................................. 12 1.3.4 Extending the MUMPS criterion to tiles ...................... 13 1.3.5 Complexity ..................................... 14 1.4 Implementation ....................................... 15 1.5 Experiments ......................................... 17 1.5.1 Experimental framework .............................. 17 1.5.2 Results for random matrices ............................ 20 1.5.3 Results for different processor grid sizes ...................... 22 1.5.4 Results for special matrices ............................ 23 1.5.5 Results for varying the condition number ..................... 25 1.5.6 Assessment of the three criteria .......................... 26 1.6 Related work ........................................ 26 1.6.1 LUPP ........................................ 26 1.6.2 LU NoPiv ...................................... 26 1.6.3 LU IncPiv ...................................... 27 1.6.4 CALU ........................................ 27 1.6.5 Summary ...................................... 27 1.7 Conclusion ......................................... 27 iii iv CONTENTS 2 Cholesky factorization on heterogeneous platforms 29 2.1 Introduction ......................................... 29 2.2 Context ........................................... 30 2.2.1 Cholesky factorization ............................... 30 2.2.2 Multiprocessor scheduling ............................. 32 2.3 Tools and libraries ...................................... 33 2.3.1 StarPU runtime system ............................... 33 2.3.2 Chameleon dense linear algebra library ...................... 34 2.3.3 Simgrid simulation engine ............................. 34 2.4 Lower bounds ........................................ 35 2.4.1 Linear Programming formulation ......................... 35 2.4.2 Constraint Programming formulation ....................... 36 2.4.3 Upper bounds on performance ........................... 37 2.5 Experiments and Results .................................. 38 2.5.1 Schedulers ..................................... 38 2.5.2 Experimental Setup ................................. 38 2.5.3 Results ....................................... 38 2.6 Discussion .......................................... 46 2.6.1 dmda vs dmdas scheduler ............................. 46 2.6.2 Mapping from Constraint Programming solution . 47 2.6.3 Constraint Programming schedule in actual execution . 47 2.7 Conclusion ......................................... 47 II Scheduling Under Memory Constraints 49 3 Memory-aware list scheduling for hybrid platforms 51 3.1 Introduction ......................................... 51 3.2 Related work ........................................ 53 3.2.1 Task graph scheduling ............................... 53 3.2.2 Pebble game and its variants ............................ 53 3.2.3 Scheduling with memory constraints ........................ 54 3.2.4 Hybrid computing ................................. 54 3.3 Model and framework .................................... 54 3.3.1 Flow and resources constraints ........................... 54 3.3.2 Memory constraints ................................ 56 3.3.3 Optimization problem ............................... 58 3.4 Tree traversal with pre-assigned task ............................ 59 3.4.1 Complexity results on trees ............................. 59 3.4.2 Depth-first traversals ................................ 66 3.4.3 Heuristics ...................................... 69 3.4.4 Experiments .................................... 70 3.5 General problem ...................................... 73 3.5.1 ILP formulation ................................... 75 3.5.2 Heuristics ...................................... 78 3.5.3 Simulation results ................................. 81 3.6 Conclusion ......................................... 86 CONTENTS v 4 Assessing the cost of redistribution 89 4.1 Introduction ......................................... 89 4.2 Related work ........................................ 92 4.2.1 Communication model ............................... 92 4.2.2 General data redistribution ............................. 92 4.2.3 Array redistribution ................................. 93 4.3 Model and framework .................................... 93 4.3.1 Definitions ..................................... 93 4.3.2 Cost of a redistribution ............................... 94 4.3.3 Optimization problems ............................... 95 4.4 Redistribution ........................................ 96 4.4.1 Total communication volume ............................ 96 4.4.2 Number of parallel communication steps ..................... 96 4.4.3 Evaluation of optimal vs. arbitrary redistributions . 98 4.5 Coupling redistribution and stencil computations . 100 4.5.1 Application model .................................101 4.5.2 Complexity .....................................102 4.6 Experiments .........................................108 4.6.1 Setup ........................................109 4.6.2 Stencil .......................................109 4.6.3 QR factorization ..................................110 4.7 Conclusion .........................................116 III Auto-adjoint computations 117 5 Optimal multistage algorithms for adjoint computation 119 5.1 Introduction .........................................119 5.2 Framework .........................................120 5.2.1 The AC problem ..................................120 5.2.2 Algorithm model ..................................122 5.3 Solution of PROB∞(l, cm, wd, rd) .............................123 5.3.1 Properties of an optimal algorithm . 123 5.3.2 Optimal algorithm .................................130 5.4 Simulations .........................................136 ⋆ 5.4.1 Stumm and Walther’s algorithm (SWA (l, cm, wd, rd)) . 136 5.4.2 Simulation setup ..................................137 5.4.3 Simulation results .................................137 5.5 Conclusion and future work .................................139 Conclusion 141 Bibliography 145 Publications 153 vi CONTENTS Introduction The processing power of computers has always been inextricably linked to the characteristics of their memory. From the early 1940’s, many different technologies have been designed to store information. Built between 1943 and 1945, the world’s first electronic digital computer, the ENIAC, was composed of twenty vacuum tube accumulators, each of them being able to store one 10-digit decimal number. Vac- uum tubes had a considerable power consumption and were frequently overheating, leaving the ENIAC non-functional most of the time. Instead of storing data in individual bits, the next generation of computer memory, Delay line memory, was turning electrical pulses into sound waves, sending them through a medium that slowed them down. It significantly increased the storage capacity and the reliability of the computational systems. Delay line memory was sequential-access, as opposed to modern random-access memory, and required power to maintain the information. In the late 1940’s, magnetic-core memory was the first non-volatile random-access memory and remained the main form of computer

Load more