UNIVERSITAT JAUME I DE CASTELLO´ ESCOLA DE DOCTORAT DE LA UNIVERSITAT JAUME I
Multithreaded Dense Linear Algebra on Asymmetric Multi-core Processors
Castello´ de la Plana, November 2017
Ph.D. Thesis
Presented by: Sandra Catalan´ Pallares´ Supervised by: Enrique S. Quintana Ort´ı Rafael Rodr´ıguez Sanchez´
Programa de Doctorat en Informàtica
Escola de Doctorat de la Universitat Jaume I
Multithreaded Dense Linear Algebra on Asymmetric Multi-core Processors
Memòria presentada per Sandra Catalán Pallarés per optar al grau de doctor/a per la Universitat Jaume I.
Sandra Catalán Pallarés Enrique S. Quintana Ortí, Rafael Rodríguez Sánchez
Castelló de la Plana, desembre de 2017
Agraïments Institucionals
Esta tesi ha estat finançada pel projecte FP7 318793 “EXA2GREEN” de la Comisió Europea, l’ajuda predoctoral de la Universitat Jaume I i per l’ajuda predoctoral FPU del Ministeri de Ciència i Educació. vi Contents
1 Introduction 1 1.1 Motivation...... 1 1.2 Contributions...... 3 1.3 Organization...... 4
2 Background 5 2.1 Basic Linear Algebra Subprograms (BLAS)...... 5 2.1.1 Levels of BLAS...... 6 2.1.2 The BLAS library election...... 7 2.2 Linear Algebra PACKage (LAPACK)...... 8 2.3 Target architectures...... 8 2.3.1 Power consumption...... 8 2.3.2 Asymmetric multi-core processors...... 9 2.3.2.1 ARM big.LITTLE...... 9 2.3.2.2 Odroid...... 10 2.3.2.3 Juno...... 10 2.3.3 Intel...... 11 2.4 Measuring power consumption...... 11 2.4.1 PMLib and big.LITTLE platforms...... 12 2.5 Summary...... 13
3 Basic Linear Algebra Subprograms (BLAS) 15 3.1 Blas-like Library Instantiation Software (BLIS)...... 15 3.2 BLIS-3...... 16 3.2.1 General matrix-matrix multiplication (gemm)...... 17 3.2.2 Generalization to BLAS-3...... 19 3.2.3 Triangular system solve (trsm)...... 20 3.3 BLIS-3 on ARM big.LITTLE...... 20 3.4 Optimizing symmetric BLIS gemm on big.LITTLE...... 23 3.4.1 Cache optimization for the big and LITTLE cores...... 23 3.4.2 Multi-threaded Evaluation of BLIS gemm on the big and LITTLE clusters. 24 3.4.3 Asymmetric BLIS GEMM on big.LITTLE...... 26
vii 3.4.3.1 Static-asymmetric scheduling (sas)...... 26 3.4.3.2 Cache-aware static-asymmetric scheduling (ca-sas)...... 29 3.4.3.3 Cache-aware dynamic-asymmetric scheduling (ca-das)...... 30 3.5 Asymmetric BLIS-3 Evaluation...... 32 3.5.1 Square operands in gemm ...... 33 3.5.2 gemm with rectangular operands...... 34 3.5.3 Other BLIS-3 kernels with rectangular operands...... 34 3.5.4 BLIS-3 in 64-bit AMPs...... 38 3.6 BLIS-2...... 38 3.6.1 General matrix-vector multiplication (gemv)...... 39 3.6.2 Symmetric matrix-vector multiplication (symv)...... 40 3.7 BLIS-2 on big.LITTLE...... 41 3.7.1 Level-2 BLAS micro-kernels for ARMv7 architectures...... 41 3.7.2 Asymmetry-aware symv and gemv for ARM big.LITTLE AMPs...... 41 3.7.3 Cache optimization for the big and LITTLE cores...... 43 3.8 Evaluation of Asymmetric BLIS-2...... 44 3.8.1 symv and gemv with square operands...... 44 3.8.2 gemv with rectangular operands...... 45 3.9 Summary...... 45
4 Factorizations 47 4.1 Cholesky factorization...... 47 4.2 LU...... 50 4.3 Look-ahead in the LU factorization...... 52 4.3.1 Basic algorithms and Block-Data Parallelism (BDP)...... 52 4.3.2 Static look-ahead and nested Task-Parallelism+Block-Data Parallelism (TP+BDP)...... 55 4.3.3 Advocating for Malleable Thread-Level Linear Algebra Libraries...... 57 4.3.3.1 Worker sharing (WS): Panel factorization less expensive than update 58 4.3.3.2 Early termination: Panel factorization more expensive than update 61 4.3.3.3 Relation to adaptive look-ahead via a runtime...... 63 4.3.4 Experimental Evaluation...... 63 4.3.4.1 Optimal block size...... 66 4.3.4.2 Performance comparison of the variants with static look-ahead... 67 4.3.4.3 Performance comparison with OmpSs...... 69 4.3.4.4 Multi-socket performance comparison with OmpSs...... 70 4.4 Malleable LU on big.LITTLE...... 71 4.4.1 Mapping of threads for the variant with static look-ahead...... 72 4.4.1.1 Thread mapping of GEMM...... 72 4.4.1.2 Configuration of TRSM...... 74 4.4.1.3 The relevance of the partial pivoting...... 75 4.4.2 Dynamic look-ahead with OmpSs...... 75 4.4.3 Global comparison...... 76 4.5 Summary...... 77
viii 5 Reduction to Condensed Forms 79 5.1 General Structure of the Reduction to Condensed Forms...... 79 5.2 General Optimization of the TSOR Routines...... 81 5.2.1 Selection of the core configuration...... 82 5.3 Reduction to Tridiagonal Form (sytrd)...... 84 5.3.1 The impact of the algorithmic block size in sytrd ...... 85 5.3.2 Performance of the optimized sytrd ...... 86 5.4 Reduction to Bidiagonal Form (gebrd)...... 88 5.4.1 The impact of the algorithmic block size in the Reduction to Bidiagonal Form (gebrd)...... 88 5.4.2 Performance of the optimized gebrd ...... 89 5.5 Reduction to Upper Hessenberg Form (gehrd)...... 90 5.5.1 The impact of the algorithmic block size in the Reduction to Upper Hessen- berg Form (gehrd)...... 91 5.5.2 Performance of the gehrd routine...... 91 5.6 Modeling the Performance of the TSOR Routines...... 93 5.7 Summary...... 97
6 Conclusions 99 6.1 Conclusions and Main Contributions...... 99 6.1.1 BLAS-3 kernels...... 100 6.1.2 BLAS-2 kernels...... 100 6.1.3 TSOR routines on asymmetric multicore processors (AMPs) and models... 101 6.1.4 LU and Cholesky factorization on big.LITTLE...... 101 6.1.5 Thread-level malleability...... 101 6.2 Related Publications...... 102 6.2.1 Directly related publications...... 102 6.2.1.1 Chapter3. Basic Linear Algebra Subprograms (BLAS)...... 102 6.2.1.2 Chapter4. Factorizations...... 103 6.2.1.3 Chapter5. Reductions...... 105 6.2.2 Indirectly related publications...... 106 6.2.3 Other publications...... 107 6.3 Open Research Lines...... 108
7 Conclusiones 109 7.1 Conclusiones y Contribuciones Principales...... 109 7.1.1 Kernels BLAS-3...... 110 7.1.2 Kernels BLAS-2...... 110 7.1.3 Rutinas TSOR en AMPs y modelos...... 111 7.1.4 Factorizaciones LU y Cholesky en big.LITTLE...... 111 7.1.5 Maleabilidad a nivel de thread...... 112 7.2 Publicaciones relacionadas...... 112 7.2.1 Publicaciones directamente relacionadas...... 112 7.2.1.1 Chapter3. Basic Linear Algebra Subprograms (BLAS)...... 112 7.2.1.2 Chapter4. Factorizaciones...... 113 7.2.1.3 Chapter5. Reducciones...... 114 7.2.2 Publicaciones indirectamente relacionadas...... 114 7.2.3 Otras publicaciones...... 115
ix 7.3 L´ıneasabiertas de investigaci´on...... 116
x List of Figures
2.1 Exynos 5422 block diagram...... 11 2.2 ARM Juno SoC block diagram...... 11 2.3 Intel Xeon block diagram...... 12 2.4 PMLib diagram...... 13
3.1 High performance implementation of gemm in BLIS...... 17 3.2 Data movement involved in the BLIS implementation of gemm...... 18 3.3 Standard partitioning of the iteration space and assignment to threads/cores for a multi-threaded BLIS implementation...... 22 3.4 Performance and energy efficiency of the reference BLIS gemm...... 22 3.5 BLIS optimal cache configuration parameters mc and kc for the ARM Cortex-A15 and Cortex-A7 in the Samsung Exynos 5422 SoC...... 24 3.6 Performance and energy efficiency of the BLIS gemm using exclusively one type of core, for a varying number of threads...... 25 3.7 Power consumption of the BLIS gemm for a matrix size of 4,096 using exclusively one type of core, for a varying number of threads...... 26 3.8 Static-asymmetric partitioning of the iteration space and assignment to threads/- cores for a multi-threaded BLIS implementation...... 28 3.9 Performance and energy-efficiency of the sas version of BLIS gemm...... 28 3.10 Performance and energy-efficiency of the sas and ca-sas versions of BLIS gemm... 30 3.11 Performance and energy-efficiency of the ca-sas version of BLIS gemm...... 31 3.12 Performance and energy-efficiency of the ca-das and das versions of BLIS gemm.. 32 3.13 Performance of (general) matrix multiplication for different dimensions: gemm, gepp, gemp, gepm...... 33 3.14 Execution traces of gepm using the parallelization strategy D3S4 for a problem of dimension n = k = 2000 and m = 256...... 35 3.15 Performance of gepm for different cache configuration parameters mc...... 35 3.16 Performance of two rectangular cases of symm, trmm, and trsm...... 36 3.17 Performance of a rectangular case of syrk and syr2k...... 37 3.18 Performance of BLAS-3 on Juno SoC...... 39 3.19 High performance implementation of the gemv kernel in BLIS...... 40 3.20 Implementation of symv in BLIS...... 40
xi 3.21 Impact of the use of architecture-aware micro-kernels on the performance of symv and gemv on the Exynos 5422 SoC...... 42 3.22 Dynamic workload distribution of Loop 1 in symv between one big core and one LITTLE core...... 42 3.23 Parallel performance of symv and gemv on the Exynos E5422 SoC...... 43 3.24 Parallel performance of gemv and symv with square operands on the Exynos E5422 SoC...... 44 3.25 Parallel performance of gemv with rectangular operands on the Exynos E5422 SoC. 45
4.1 Performance and energy efficiency of potrf for the solution s.p.d. linear systems.. 48 4.2 Performance and energy efficiency of getrf for the solution general linear systems.. 50 4.3 Unblocked and blocked RL algorithms for the LU factorizatio...... 53 4.4 Exploitation of BDP in the blocked RL LU parallelization...... 54 4.5 Execution trace of the first four iterations of the blocked RL LU factorization with partial pivoting (n = 10,000)...... 54 4.6 Blocked RL algorithm enhanced with look-ahead for the LU factorization...... 55 4.7 Exploitation of TP+BDP in the blocked RL LU parallelization with look-ahead... 56 4.8 Execution trace of the first four iterations of the blocked RL LU factorization with partial pivoting, enhanced with look-ahead (n = 10,000)...... 57 4.9 Execution trace of the first four iterations of the blocked RL LU factorization with partial pivoting, enhanced with look-ahead (n = 2,000)...... 58 4.10 Distribution of the workload among t = 3 threads when Loop 4 of BLIS gemm is parallelized...... 59 4.11 Exploitation of TP+BDP in the blocked RL LU parallelization with look-ahead and WS...... 60 4.12 Execution trace of the first four iterations of the blocked RL LU factorization with partial pivoting, enhanced with look-ahead and malleable BLIS (n = 10,000)..... 61 4.13 Outer vs inner LU and use of algorithmic block sizes...... 62 4.14 Exploitation of TP+BDP in the blocked RL LU parallelization with look-ahead and ET...... 64 4.15 Execution trace of the first four iterations of the blocked RL LU factorization with partial pivoting, enhanced with look-ahead and early termination (n = 2,000).... 65 4.16 GFLOPS attained with gemm (left) and ratio of flops performed in the panel fac- torizations normalized to the total cost (right)...... 66 4.17 Optimal block size of the blocked RL algorithms for the LU factorization...... 67 4.18 Performance comparison of the blocked RL algorithms for the LU factorization (ex- cept LU OS)...... 68 4.19 Performance comparison between the OmpSs implementation and the blocked RL algorithm for the LU factorization with look-ahead, malleable BLIS and ET..... 69 4.20 Optimal block size of the blocked ET and OmpSs algorithms for the LU factorization. 70 4.21 Performance comparison between the OmpSs implementation and the blocked RL algorithm for the LU factorization with look-ahead, malleable BLIS and ET..... 71 4.22 Performance of the conventional parallelization of the blocked RL algorithm with static look-ahead enhanced with MTL and WS/ET for different gemm configurations. 73 4.23 Performance of the conventional parallelization of the blocked RL algorithm with static look-ahead enhanced with MTL and WS/ET for different trsm configurations. 74 4.24 Performance of the conventional parallelization of the blocked RL algorithm with dif- ferent laswp parallelizations and static look-ahead enhanced with MTL and WS/ET. 75
xii 4.25 Performance of the OmpSs parallelization of the blocked RL algorithm for LU (dy- namic look-ahead)...... 76 4.26 Performance of the best parallel configuration for each variant: conventional blocked RL algorithm, static look-ahead version enhanced with MTL and WS/ET, and runtime-based implementation...... 77
5.1 Performance of Level-2 and Level-3 BLAS on the Exynos 5422 SoC...... 83 5.2 Performance of Level-2 and Level-3 BLAS on the Exynos 5422 SoC...... 84 5.3 Performance of sytrd on a single Cortex-A15 core within the Exynos 5422 SoC using different algorithmic block sizes...... 85 5.4 Profile of execution time spent by sytrd on a single ARM Cortex-A15 core embedded in the Exynos 5422 SoC...... 86 5.5 Performance of sytrd on the Exynos 5422 SoC...... 87 5.6 Performance of gebrd on a single Cortex-A15 core within the Exynos 5422 SoC using different algorithmic block sizes...... 89 5.7 Profile of execution time spent by gebrd on a single ARM Cortex-A15 core embed- ded in the Exynos 5422 SoC...... 89 5.8 Performance of gebrd on the Exynos 5422 SoC...... 90 5.9 Performance of gehrd on a single Cortex-A15 core within the Exynos 5422 SoC using different algorithmic block sizes...... 92 5.10 Profile of execution time spent by gehrd on a single ARM Cortex-A15 core embed- ded in the Exynos 5422 SoC...... 92 5.11 Performance of gehrd on the Exynos 5422 SoC...... 93 5.12 Model-driven estimation of the relative execution time and actual performance.... 96
xiii xiv List of Tables
2.1 Size of NEON operations on ARM architectures...... 10
3.1 Kernels of BLIS-3...... 19 3.2 Kernels of BLIS-2...... 39 3.3 Parameters for optimal performance of the Level-3 kernels in BLIS on the Exynos 5422 SoC using real single-precision ieee arithmetic...... 44
4.1 Performance of matrix factorizations for the solution of s.p.d. and general linear systems (potrf)...... 49 4.2 Performance of matrix factorizations for the solution of general linear systems (getrf). 51
5.1 Parameters for optimal performance of the Level-3 kernels in BLIS on the ARMv7 big.LITTLE embedded in the Exynos 5422 SoC using real single-precision ieee arith- metic...... 81 5.2 Parameters for optimal performance of the Level-2 kernels in BLIS on the Exynos 5422 SoC using real single-precision ieee arithmetic...... 82 5.3 Differences of time for sytrd between executions using the optimal block size de- termined by the model and the real optimal value detected...... 97
xv I’m not telling you it is going to be easy, I’m telling you it’s going to be worth it.
Nicholas Negroponte Summary
Nowadays, there exists a large variety of scientific, industry and engineering applications that require high computational power and storage, and their demands continue to grow; in order to obtain more precise solutions in these applications, scientists need to elaborate and work with more sophisticated and complex physical and mathematical models. Consequently, the capacity of new data processing systems and High Performance Computing (HPC) centers is saturated shortly after of their set up [2, 17, 44, 50]. Nonetheless, these resources are still used: scientific computation (or computational sciences, that is, the elaboration of mathematical models and the use of computers to analyze and solve scientific problems) is an effective tool in scientific discovering, complementary to more traditional methods based on theory and experimentation [44, 50]. Large-scale HPC systems are large energy consumers, that employ computing resources and auxiliary systems to operate [44, 49, 51, 74]. This consumption has a direct impact on the op- erational and maintenance costs of computing centers. However, electricity cost is not the only problem; in general, energy consumption turns into carbon emissions that are dangerous to the environment and public health, and the heat reduces the reliability of the hardware [51]. The situation requires additional measures: studying the Green500 list of June 2017 [1] we can see that, nowadays, the most efficient HPC systems in terms of power consumption attain 14,110 MFLOPS per Watt (MFLOPS/W). A simple calculation reveals that reaching the EXAFLOPS rate with the current technology will require 70.9 MFLOPS/W approximately, with an approximate cost of 70.9 million dollar per year. Although EXAFLOPS challenge will unleash innovative scientific discov- eries, it is also true that more efficient hardware and software technologies are required from the energy point of view [6, 13, 18]. The pressure of HPC centers has forced hardware manufacturers to improve their designs to increase energy efficiency: Central Processing Unit (CPU), memory and disks (three of the large energy consumers in computing systems, with the remaining ones being the interconnection net- work and the power supply) integrate energy saving strategies, based on the system transition to low power states or the dynamic reduction of frequency and voltage (DVFS or Dynamic Voltage Frequency Scaling). On the other hand, software systems, communication libraries and, especially, computational libraries and application codes running in HPC centers have been, in general, obliv- ious to energy consumption. The Top500 [4] list is a good example. Computers in this list are classified according to the sustained performance (in floating-point arithmetic operations per sec- ond (FLOPS)) that the Linpack benchmark attains (basically, the solution of a dense linear system of large dimension). However, the numerical method behind this test, the LU factorization, is far from representative of the real performance attained by most scientific codes [18].
xvii Even though this is a mature topic in other segments, the development of energy-aware solutions for HPC applications, which optimize both the execution time and the energy consumption, is still only in its early stages, despite the huge benefits that it may produce [6, 51]. The HPC community is now aware of the energy costs, as was demonstrated with the creation of the Green500 [1] list. As a response to this situation, the general objective of this thesis is the study, design, develop- ment and analysis of experimental solutions that are energy-aware for the execution of scientific and engineering numerical applications on low power architectures, more specifically asymmetric plat- forms. With the aim of demonstrating the benefits of these contributions, we selected diverse dense linear algebra operations that arise in very different areas, such as image processing, molecular dynamics simulation, and big data analytics, among others.
xviii Resumen
En la actualidad existe una gran variedad de aplicaciones cient´ıficas, industriales y de ingenier´ıa que requieren un alto poder computacional y de almacenamiento, y su demanda contin´uacrecien- do; para obtener soluciones m´asprecisas en estas aplicaciones, los cient´ıficosnecesitan elaborar y trabajar con modelos f´ısicosy matem´aticosmucho m´assofisticados. Como consecuencia, los nuevos equipos de procesamiento de datos y los centros de Computaci´onde Altas Prestaciones (CAP) se saturan a las pocas semanas de haber sido puestos en funcionamiento [2, 17, 44, 50]. Todos estos recursos no se desaprovechan: la computaci´oncient´ıfica(o ciencias de la computaci´on,es decir, la elaboraci´onde modelos matem´aticosy el uso de computadoras para analizar y resolver proble- mas cient´ıficos)es una herramienta eficaz para el descubrimiento cient´ıfico,complementaria a los m´etodos m´astradicionales basados en la teor´ıay la experimentaci´on[44, 50]. Los sistemas de CAP de gran escala son grandes consumidores de energ´ıa,que emplean los recursos de computaci´ony los sistemas auxiliares para funcionar [44, 49, 51, 74]. Este consumo tiene un impacto directo en los costes de funcionamiento y mantenimiento de los centros de computaci´on, poniendo en peligro su existencia y perjudicando la puesta en marcha de nuevas instalaciones. Sin embargo, el coste de la electricidad no es el ´unicoproblema; en general, el consumo de energ´ıaresulta en emisiones de di´oxidode carbono que son un peligro para el medio ambiente y la salud p´ublica, y el calor reduce la fiabilidad de los componentes de hardware [51]. La situaci´onrequiere medidas adicionales: estudiando la lista Green500 de junio de 2017 [1] se puede observar que, a d´ıade hoy, los sistemas de CAP m´aseficientes en cuanto a potencia consumida ofrecen 14.110 MFLOPS por vatio (MFLOPS/W). As´ı,un simple c´alculorevela que alcanzar la tasa de EXAFLOPS con la tecnolog´ıa actual requerir´ıa,aproximadamente, 70,9 MFLOPS/W, con un coste tambi´enaproximado de 70,9 millones de d´olarespor a˜nosolo en electricidad. Aunque el desaf´ıo del EXAFLOPS har´a,sin duda, que se produzcan descubrimientos cient´ıficosinnovadores, tambi´enes cierto que se necesita que la tecnolog´ıahardware y el software del sistema sean m´aseficientes desde el punto de vista energ´etico[6, 13, 18]. La presi´onde los centros de CAP ha obligado a los fabricantes de hardware a mejorar sus dise˜nos para conseguir una mejora en la eficiencia energ´etica: Unidad Central de Procesamiento (CPU), memoria y discos (tres de los grandes consumidores de energ´ıaen los sistemas inform´aticos,siendo los otros dos la red de interconexi´ony la fuente de alimentaci´on)cuentan con algunas estrategias de ahorro, basadas en la transici´ondel sistema a alg´unestado de bajo consumo o en la reducci´on de la frecuencia y el voltaje de forma din´amica (DVFS o Dynamic Voltage Frequency Scaling). Por otro lado, los sistemas software, las bibliotecas de comunicaci´ony, en especial, las bibliotecas computacionales y los c´odigosde aplicaciones usados en los centros de CAP han sido, en general,
xix insensibles al consumo de energ´ıa.La lista Top500 [4] es un claro ejemplo. Los computadores de esta lista quedan clasificados en funci´ondel rendimiento sostenido (en operaciones aritm´eticasen coma flotante por segundo (FLOPS)) que alcanzan en el test Linpack (b´asicamente, la soluci´onde un sistema lineal denso de dimensi´onescalable). Sin embargo, el m´etodo num´ericoque hay detr´as de este test, la factorizaci´onLU, es poco representativo del rendimiento real conseguido por la mayor´ıade c´odigoscient´ıficos[18]. Si bien es un tema maduro en otros segmentos, el desarrollo de soluciones conscientes de la energ´ıapara las aplicaciones de CAP, que optimizan tanto el tiempo de ejecuci´oncomo la con- servaci´onde la energ´ıa,se encuentra todav´ıaen sus inicios, a pesar de los enormes beneficios que puede producir [6, 51]. La comunidad que trabaja en CAP ha empezado a tomar consciencia de los costes energ´eticos,y as´ıse ha creado una clasificaci´oncomo la lista Green500 [1], que utiliza este tipo de m´etricaspara comparar y clasificar los supercomputadores en el mundo de acuerdo a su eficiencia energ´etica. En respuesta a esta situaci´on,el objetivo general de esta tesis es el estudio, dise˜no,desarrollo y an´alisisexperimental de soluciones conscientes de la proporcionalidad energ´eticapara aplicacio- nes cient´ıficasy de ingenier´ıanum´ericossobre arquitecturas de bajo consumo, concretamente en plataformas asim´etricas.Con el prop´ositode demostrar los beneficios de estas aportaciones, se han escogido distintas operaciones de ´algebralineal presentes en aplicaciones pertenecientes a campos tan distintos como el procesado de im´agenes,la simulaci´onde din´amicamolecular o el an´alisisde grandes vol´umenes de datos.
xx Agradecimientos
El tiempo pasa volando y prueba de ello han sido los ´ultimosa˜nosque han dado lugar a esta tesis doctoral. Ha sido un periodo intenso, en el que el esfuerzo, la dedicaci´ony el sacrificio han sido claves; pero todo ha sido posible gracias a la motivaci´ony satisfacci´onde trabajar en lo que m´as me gusta. Ha sido una etapa llena de retos y experiencias, tanto a nivel personal como profesional, lo que me ha permitido disfrutar de la novedad de las circunstancias. Sin embargo, nada de esto hubiese sido posible sin la ayuda de las personas que me han brindado la oportunidad de trabajar con ellas, y las que, desinteresadamente, me han acompa˜nadoen cada momento de esta aventura. A todas ellas, me gustar´ıaexpresar mi m´assincero agradecimiento.
A mis directores, Enrique S. Quintana Ort´ıy Rafael Rodr´ıguez S´anchez. A Enrique, por su apoyo, por ser una fuente infinita de conocimientos y trabajador incansable. A Rafa, por su dedicaci´ony su ´animoen mis malos momentos.
A las personas del grupo HPC&A de la Universitat Jaume I, a las que en alg´unmomento han estado en ´ely a sus visitantes, Jos´e,Jos´eManuel, Sergio B., Asun, Maribel, Juan Carlos, Germ´an L., Germ´anF., Merche, Rafa M., Alfredo, Manel, Toni, Fran, Jos´eAntonio, Maria, Roc´ıo,Sergio I., H´ector,Adri´an,Sisco, Sonia, Andr´es,Goran, Jos´eRam´ony Pedro. Gracias por el buen ambiente y por vuestra colaboraci´on. Tambi´enquiero agradecer su ayuda a los t´ecnicosdel Departamento de Ingenier´ıay Ciencia de los Computadores, Gustavo y Vicente.
A Costas Bekas, por darme la oportunidad de conocer de primera mano la investigaci´onen el entorno empresarial. A Cristiano, por su ayuda, gu´ıay atenci´on.Tambi´ena los internos de verano de IBM, que hicieron de la estancia una experiencia inolvidable. A Robert van de Geijn y su grupo, por ofrecerme la oportunidad de integrarme en su entorno de trabajo y por su contribuci´onen la tesis. Asimismo, me gustar´ıadar las gracias por su hospitalidad a los compa˜nerosdel grupo Programming Models del BSC, especialmente a Vicen¸c.
A mis amigos, por las conversaciones infinitas d´andome apoyo y sabios consejos.
A mi familia y muy especialmente a mis padres, Manolo y Mar´ıaJos´e,por su apoyo, comprensi´on y paciencia, por estar ah´ısimpre.
A Rub´en,por hacer de mi proyecto parte del suyo. Por sus ´animosen los momentos de frus- traci´ony por compartir los logros.
– ¡Gracias! · Gr`acies! · Thanks! · Danke! –
xxi xxii CHAPTER 1
Introduction
For many decades, the evolution of computer technologies and architectures has enlarged the performance gap between the throughput of processors and the memory access rate [63]. While the difference is no longer growing at a significant rate, the scenario that we face nowadays is that, in order to deliver very high performance, programmers have to explicitly take into account the cost of data movements. This is particularly the case of dense linear algebra (DLA) operations for which there exist highly tuned implementations, almost for any architecture, from the past vector processors to the current multicore designs, and the fancier graphics processing units and co-processors such as the Intel Xeon Phi. This dissertation targets two important problems. The first one is the design of low-level DLA kernels for architectures comprising two (or more) classes of cores. The main question we have to address here is how to attain a balanced distribution of the computational workload among the heterogeneous cores while taking into account that some of the resources, in particular cache levels, are either shared or private. The second question is partially related to the first one. Concretely, this dissertation explores an alternative to runtime-based systems in order to extract “sufficient” parallelism from complex DLA operations while making an efficient use of the cache hierarchy of the architecture.
1.1 Motivation
Asymmetry-aware BLAS. DLA is at the bottom of the “food chain” for many scientific and engineering applications, which require kernels to tackle linear systems, linear least squares problems or eigenvalue computations, among others [38]. To address this, the scientific community has reacted by creating the Basic Linear Algebra Subroutines (BLAS) [47] in order to define standard domain-specific interfaces for a few dozens of fundamental DLA operations (such as the vector norm, the matrix-vector product, and the matrix-matrix multiplication) and improve performance portability across a wide range of computer architectures. Highly-tuned implementations of BLAS (for example, Intel MKL [66], IBM ESSL [64], NVIDIA cuBLAS [81], GotoBLAS [57, 58], OpenBLAS [97], ATLAS [96] or BLIS [95]) aim to deliver high performance by carefully orchestrating the movement of data across the memory hierarchy to hide this overhead. To attain this goal, these implementations interleave data packing operations with
1 CHAPTER 1. INTRODUCTION computations to maximize the access to data that is closer to the arithmetic units and favour the use of vector instructions. When the target architecture comprises multiple cores, these instances of BLAS also exploit loop-parallelism to distribute the workload in a balanced manner, ensuring that the cores make an efficient use of their private cache memories, and collaborate in the access to the shared cache levels instead of competing for them. Asymmetric processors are heterogeneous architectures in principle conceived as a low-power designs for embedded systems. However, as we progress along the final steps of Moore’s Law, the constraints imposed by the end of Dennard’s scaling [40] and the high energy proportionality [17] of asymmetric processors are promoting them into a building block to assemble large-scale facilities for high performance computing [3]. Asymmetry can also appear in a conventional multicore processor when some of the cores are enforced to operate at a lower frequency, due for example to constraints imposed in the processor’s power budget. From the point of view of a parallel BLAS, asymmetric processors present the difficulty of distributing the work among a heterogeneous collection of resources (in practice, two classes). In order to attain high performance, the scheduling has then to take into consideration the distinct computational capabilities of the cores (e.g., due to different width/dimension of the floating-point units or different core frequency). In addition, for processors that are physically-asymmetric, the parallelization has to consider also the distinct cache configurations. As we will expose in this dissertation, even for simple and regular operations such as those composing BLAS, the challenge is far from trivial when the goal is to squeeze the last few drops of floating-point performance.
Cache-aware matrix factorizations. The Linear Algebra Package (LAPACK) [11] specifies a collection of DLA operations, built on top of the BLAS, with a functionality that is closer to the application level. For example, LAPACK includes driver routines for the solution of linear systems, the calculation of the singular value decomposition, the computation of the eigenvalues of a symmetric matrix, etc. For multicore processors, the conventional approach to exploit parallelism from LAPACK has relied, for many years, on the use of a multi-threaded instance of the BLAS. This solution exerts a strict control over the data movements and can be expected to make an extremely efficient use of the cache memories. Unfortunately, for complex DLA operations, the exploitation of parallelism only within the BLAS constrains the concurrency that can be leveraged by imposing an artificial fork-join model of execution. Specifically, with this solution, parallelism does not expand across multiple invocations to BLAS kernels even if they are independent and, therefore, could be executed in parallel. The increase in hardware concurrency of multicore processors in recent years, at the pace dic- tated by Moore’s Law, has led to the development of parallel versions of some DLA operations that exploit task-parallelism via a runtime. (See, for example, the efforts with OmpSs [48], PLASMA- Quark [5], StarPU [91] and libflame-SuperMatrix [52].) In some detail, these task-parallel ap- proaches decompose a DLA operation into a collection of fine-grained tasks with dependencies, and issue the execution of each task to a single core, simultaneously executing independent tasks on different cores. The runtime-based solutions are better equipped to tackle the increasing number of cores of current and future architectures, because they leverage the natural concurrency that is present in the application. However, with this type of solution, the cores ferociously compete for the shared memory resources and may not amortize completely the overheads of BLAS (e.g., for packing) due to the use of fine-grain tasks. In the mixed scenario described in the above paragraphs, this dissertation investigates whether it is possible to extract nested task- and loop-parallelism to feed all the cores of a current multi-
2 1.2. CONTRIBUTIONS socket architecture. The complementary objective is to make an efficient use of the cache hierarchy. In the framework of our study, the target is obviously a DLA operation.
1.2 Contributions
This dissertation makes the following contributions to advance the state-of-the-art on the high performance implementation of DLA operations:
• We introduce efficient multi-threaded implementations of some kernels of the Level-2 BLAS and the complete Level-3 BLAS for asymmetric architectures composed of two types of cores with the following properties:
– Our solution leverages the multi-threaded implementation of the matrix-matrix multi- plication and the sequential implementation of the matrix-vector product kernels in the BLIS library [95], which decompose the operation into a collection of nested loops around a micro-kernel. Starting from these reference codes, we propose a modification of the loop stride configuration and scheduling to distribute the workload comprised by certain loops among the two core types while taking into account their distinct computational performance and cache organization. – We demonstrate the generality of the approach by applying the same parallelization principles to develop tuned versions of the BLAS for 32-bit and 64-bit ARM big.LITTLE processors consisting, respectively, of 4+4 (slow+fast) cores and 4+2 (slow+fast cores). The kernels not only distinguish between different operations, paying special care to the parallelization of the triangular system solve, but also take into consideration the operands’ dimensions (shapes). – The correction of the new asymmetry-aware BLAS is validated by integrating them with the reference implementation of LAPACK from the netlib public repository as well as some new routines for the reduction to compact band forms involved in the solution of symmetric eigenvalue problems and the computation of the singular value decomposition (SVD).
• We propose a novel nested parallelization strategy for dense matrix factorizations that applies a static look-ahead to expose a certain degree of task concurrency in combination with the use of a multi-threaded implementation of the BLAS in order to attain an efficient use of the cache system. In more detail, our solution makes the following specific contributions:
– We demonstrate the benefits of this nested approach by targeting the LU factorization with partial pivoting, an operation that is representative of many other matrix decom- positions which are key for the solution of key DLA problems. Furthermore, our solution can be expected to carry over to any matrix factorization that is composed by a loop that iterates over the column/rows of a matrix, with a loop body that is composed of a “sequential” panel factorization and a highly parallel trailing update. This is the case of the blocked right-looking variants for the LU, QR, Cholesky and LDLT factorization and, to some extent, also of some decompositions for the reduction to compact forms that are employed in the solution of symmetric eigenvalue problems and the computation of the singular value decomposition (SVD). – We introduce a malleable thread-level implementation of BLIS that allows to change the number of threads that participate in the execution of a BLAS kernel at execution
3 CHAPTER 1. INTRODUCTION
time. This technique departs from the current, inflexible solution adopted in all imple- mentations of the BLAS, which presuppose that, once issued, the number of threads participating in the execution of a BLAS kernel will not vary. – In case the panel factorization is less expensive than the update of the trailing submatrix, we leverage the malleable instance of the BLAS to improve workload balancing and performance, by allowing the thread team in charge of the panel factorization to be re-allocated to the execution of the trailing update. – To tackle the opposite case, where panel factorization is more expensive than the update of the trailing submatrix we design an early termination (ET) mechanism that allows the thread team in charge of the trailing update to communicate the alternative team of this event.This alert forces an ET of the panel factorization, and the advance of the factorization into the next iteration
1.3 Organization
This chapter presents the motivation and main contributions of this thesis.. In addition, the structure of the whole document is detailed next. Chapter2 reviews the current state of the Dense Linear Algebra (DLA) libraries BLAS and LAPACK. The evolution, structure and main features of those libraries are presented in this chapter. The systems used in this thesis, the multi-core Intel Xeon E5-2603 and the asymmetric multi-core processors Odroid-XU3 and Juno are described next. That chapter also includes the description of the PMLib framework (used for the power consumption measurements) and its interaction with tracing and visualization tools. In Chapter3, the design, implementation and evaluation of BLAS-2 and BLAS-3 routines for asymmetric multi-core processors is carried out. Routines gemm and gemv are studied as the reference kernels for the corresponding levels, BLAS-3 and BLAS-2 respectively. Moreover, we identify the flaws of the initial asymmetry-aware implementation and, after a thorough performance analysis, present the selected solution in each case. Chapter4 analyzes several advanced operations for DLA. The results of the direct migration of the Cholesky and LU factorizations relying on the asymmetric-aware version of BLAS-3 are presented. That chapter includes the proposal of a new technique, thread-level malleability, which allows a more efficient use of the resources. The implementation of the technique and evaluation of its benefits are presented there. In Chapter5, the study of LAPACK routines in combination with the asymmetry-aware im- plementation of BLIS is extended in order to expose the benefits of the BLAS-2 asymmetry-aware version. To this end, we evaluate three routines that perform two-sided orthogonal reductions (TSOR) of a dense matrix to a condensed form. In addition, we present a model in order to se- lect specific parameters that may increase the performance of the TSOR. The study presented in this chapter completes the validation of the new asymmetry-aware BLAS, started in Chapter3 for BLAS-3, by integrating the new version of the library with the reference implementation of LAPACK. Finally, in Chapter6, we present the general conclusions of the thesis, the main results obtained as a list of publications, and a collection of open research lines.
4 CHAPTER 2
Background
In this chapter we offer a short review of the DLA libraries, tools and platforms used in this thesis. DLA operations are classified in two categories: basic and advanced. Basic DLA operations are implemented in Basic Linear Algebra Subprograms (BLAS) library, while LAPACK supports more sophisticated functionality. The first part of this chapter elaborates on the history of both libraries, their structure and main features. Then, a description of the platforms used in the thesis is carried out. Finally the PMLib framework is presented as the framework to collect power samples in our tests.
2.1 Basic Linear Algebra Subprograms (BLAS)
Fundamental problems of linear algebra, such as the solution of linear (equation) systems or the computation of eigenvalues or singular values, underlie a large number of scientific and engineering applications. These problems are found in very different fields such as structure computations, automatic control, integrated circuits design, and chemistry simulations, to name only a few. In the same vein, while solving these problems, a small set of basic operations are often used, e.g., a scalar vector multiplication, a triangular system solve or a matrix-matrix multiplication. In this scenario, BLAS specifies a set of routines that perform these basic DLA operations, defining the functionality and interface of each routine of the library. The specification of BLAS was carried out by experts of different fields [47], which makes it specially valuable, as a result of an interdisciplinary collaboration. BLAS is a compromise between functionality and simplicity. On one hand, it defines a reasonable number of routines with a limited number of parameters; on the other hand, it provides a wide functionality. There exists a reference implementation of BLAS published on the Internet [78], but the main strength of BLAS are its different implementations for specific architectures. Since the very be- ginning, the development of optimized implementations of BLAS routines has been expected of processor manufacturers. As a result, there are propietary implementations from AMD (AMD Core Math Library (ACML)), Intel (MKL), IBM (ESSL) and Nvidia (cuBLAS). In addition, there are also open source implementations such as GotoBLAS, OpenBLAS, ATLAS, and BLIS. In gen- eral, these libraries implement each routine taking into account the organization of the target architecture to exploit its resources in order to optimize performance.
5 CHAPTER 2. BACKGROUND
The first implementation of BLAS[72] was made in the 1970s and it has been relevant in the solution of dense linear algebra problems ever since. Due to its reliability, robustness and efficiency, several libraries were designed in order to use BLAS routines internally. In addition to reliability and efficiency, BLAS presents the following advantages:
• Code readability: the name of the routines express their functionality.
• Portability and efficiency: when migrating the code to a new platform, the efficiency will remain reasonable as long as a specific version of BLAS for the new architecture is used.
• Documentation: all BLAS routines are thoroughly documented.
The most common implementations of BLAS are implemented in C and Fortran. However, a limited amount of assembly code is sometimes present in order to generate more efficient code.
2.1.1 Levels of BLAS At the beginning of the development of BLAS, in the early 1970s, the most powerful computers integrated vector processors. The original BLAS were designed with this type of processor as the target, creating a small set of operations that worked on vectors (Level-1 BLAS or BLAS-1). The main objective of that specification was to foster the development of optimized implementations depending on the target architecture. In this way, from the very beginning, the BLAS creators defended the importance of “outsourcing” the implementation of efficient basic operations to the architecture designers, who were able to generate more robust and efficient code [71]. The initial set of routines comprised by the BLAS-1 provided a reduced functionality, since it only supported a limited number of vector operations. In 1987, the BLAS were extended with a new set of routines that constitute the second level of BLAS (BLAS-2). Those new routines implement matrix-vector operations, involving a quadratic amount of arithmetic operations and data. A public reference implementation of the BLAS-2 was released [46]. From that implementation it is worth highlighting the fact that matrix storage is done by columns, following the storage convention of Fortran. As done for BLAS-1, manufacturers were encouraged to create specific implementations and, to this end, some advice were given such as guidelines in order to adapt the inner loop of the routine to the specific architecture, the use of assembly code, or compiler directives for the target architecture. Eventually, the growing disparity between the processor frequency and the memory bandwidth resulted in the design of architectures with multiple levels of cache memory (memory hierarchy). With the new memory structure, the BLAS-1 and BLAS-2 were not able to attain a reasonable level of performance, since in both cases the ratio between the number of operations and data movement is O(1), while the gap between the processor and memory throughput is much higher. Consequently, the performance of BLAS-1 and BLAS-2 is constrained by the speed of the data transfer from memory. To tackle this problem, a third set of routines was defined in 1989, the Level-3 BLAS (BLAS- 3) [47], that implemented operations with computations of cubic order versus data movements of quadratic order. The difference between the number of computations and memory data transfers, when using a properly designed algorithm, allows to exploit the principle of locality in architectures with a hierarchical organization of the memory, hiding the memory access latency and attaining performances closer to the processor peak. From the algorithmic point of view, this is attained via so-called blocked algorithms. These algorithms divide the matrix into submatrices (or blocks) and carefully orchestrate the movement of these blocks across the memory hierarchy in order to increase
6 2.1. BASIC LINEAR ALGEBRA SUBPROGRAMS (BLAS) the probability of finding the data in those memory levels that are closer to the processor. Hence, blocked algorithms attain higher performance by making a better use of the memory hierarchy. As was the case in the previous levels of BLAS, the third level of BLAS presents a compro- mise between complexity and functionality. An example of simplicity is that there are no specific routines for trapezoidal matrices, since this fact would increase the number of parameters in the routines. However, in order to increase functionality, implicitly transposed operands are considered because, otherwise, the user should transpose the matrix previously and this operation may be highly expensive due to the amount of memory accesses. In summary, the BLAS routines are organized into three levels, named after the number of computations performed in each one:
• Level 1: Operations with vectors (lineal order of computations).
• Level 2: Matrix-vector operations (quadratic order of computations).
• Level 3: Matrix-matrix operations (cubic order of computations). Regarding performance, the differences among the three levels is due to the ratio between the number of computations and the amount of data. This ratio is crucial in architectures with a hierarchical organization of memory as, in case the number of operations is higher than the amount of data accessed, it is possible to perform several operations per memory access and, consequently, increase the productivity. Therefore, the three levels are also defined according to the ratio between the number of computations and the data needed by the operation:
• Level 1: The number of computations and data grows linearly with the problem size.
• Level 2: The number of computations and data grows quadratically with the problem size.
• Level 3: The number of computations grows cubically with the problem size, while the amount of data grows quadratically only.
Multi-threaded versions of BLAS for multiprocessors with shared memory deserve special at- tention. Those versions implement a multi-threaded code that distributes the computations among the processor cores in order to make an efficient use of the resources. Such implementations are especially useful for BLAS-3 routines. The blocked BLAS-3 routines have two additional advantages [45]. On one hand, they allow different processors to work on distinct blocks simultaneously and, on the other hand, within the same block, operations on different scalars or vectors may be performed simultaneously. From the performance point of view, the following conclusions can be extracted:
• The performance of BLAS-1 and BLAS-2 is constrained by the memory bandwidth.
• BLAS-3 is the most efficient level, due to the fact that a higher number of computations can be performed per memory access. In addition, the BLAS-3 are more suitable for parallelization, providing significant improvements on multiprocessor architectures with shared memory.
2.1.2 The BLAS library election Among the great variety of BLAS implementations upon which to build, in this dissertation we selected BLAS-like Library Instantiation Software Framework (BLIS)[95]. This instance of the BLAS, which will be described in detail in Chapter3, is an open-source framework developed at The University of Texas at Austin that provides a framework to implement BLAS-like operations.
7 CHAPTER 2. BACKGROUND
The flexibility of this library is one of the main reasons behind its election, along with the complete exposition of the code that allows the manipulation of the library at different points. Moreover, in terms of performance, BLIS is competitive with other commercial and open-source libraries. Due to these features, BLIS is the most suitable library, among those considered, to carry out the work towards the objectives of this dissertation.
2.2 LAPACK
LAPACK[79] is a library that provides routines to solve fundamental linear algebra problems via the current state in numerical methods. In the same vein as BLAS, LAPACK provides support for dense matrices. However, it tackles more complex problems, for example, systems of linear equations as well as linear least squares (LLS), eigenvalue and singular value problems. LAPACK emerged as the result of a project started in the late 80s. The objective was to obtain a library that gathered the functionalities and improved the performance of EISPACK [54] and LINPACK [39]. Those libraries, designed for vector processors, do not provide acceptable performance on current high performance processors, with segmented pipelines and hierarchical memories. The main reason of their inefficiency is the fact that they are based on BLAS-1 operations, which do not make an optimal use of the hierarchical memory for the reasons exposed earlier in this chapter. The performance improvements of LAPACK arise from a reorganization of the algorithms in order to make an intensive use of the efficient level-3 BLAS. Regarding the integration of new algorithms, improvements were introduced in almost all linear algebra problems supported by the library, being especially relevant those applied to the solution of eigenvalue problems. For multiprocessor systems, LAPACK extracts parallelism from within a parallel version of BLAS. In other words, LAPACK routines do not include any type of explicit parallelism in the code, but rely on a multithreaded implementation of BLAS for this purpose.
2.3 Target architectures
2.3.1 Power consumption Supercomputers are key nowadays in solving challenging problems from very different areas (from engineering design, and financial analysis to disaster prediction); however, they also incur a huge power consumption, not only due to the computation, but also for cooling. For decades, per- formance and price/performance have been the natural targets of the supercomputing community, looking for the fastest computer (in terms of FLOPS) as advocated in the TOP500 [4] list. In this sense vast efforts have been made and supercomputers’ performance has grown from 59.7 billions of floating-point arithmetic operations per second (GFLOPS) in 1993 to 93,014.6 TFLOPS in 2017. The performance gains in supercomputers has been largely attained via the combined increase of the number of processors in the system, the amount of transistors per processor, and the pro- cessor’s frequency. Nevertheless, power consumption was not considered as a main factor to take into account until mid of the last decade. The reliable operation of supercomputers depend on the continuous cooling of the machine, which results in enormous costs due to electrical power consumption. For example, the Chinese Sunway TaihuLight, ranked as the top supercomputer in the June 2017 TOP500 list consumes 15 MW. Considering that one MW costs approximately one million dollar per year, the cost of operating and cooling down that system is around 15 million dollars per year.
8 2.3. TARGET ARCHITECTURES
Energy consumption came officially into the picture a decade ago, when the Green500 [1] list was created. Since that moment, different approaches have been taken in order to improve the energy efficiency of supercomputers. The improvement of energy efficiency has been impressive over the last ten years, moving from 0.4 GFLOPS/W in 2008 to 11.1 GFLOPS/W in 2017. Part of this improvement is due to heterogeneity via hardware accelerators, as the introduction of Graphics Processing Units (GPUs) and manycore processors (e.g., Intel Xeon Phi) added more energy effi- cient resources to the traditional multi-core systems through the combination of cores of different nature. However, the supercomputing community is still looking for new alternatives in order to reduce energy consumption. For that reason a new approach with low-power architectures has been taken. Fujitsu announced recently the use of ARMv8-based cores for their first exascale supercom- puter, the Post-K system [53]. In the same line, the Marenostrum 4 supercomputer [32], from Barcelona Supercomputing Center, will include the same technology as an update of the current supercomputer. The Mont-Blanc project [3] is the reference example of the low-power architecture trend in supercomputing. This project, started in 2011, advocates for the assembly of supercomputers from low-power processors, more specifically mobile systems-on-chips (SoCs). Several prototypes have been created as an output of the project, starting with Tibidabo in 2011, which features Nvidia Tegra2 SoCs with two ARM Cortex-A9 at 1 GHz. The next prototype was Pedraforca (2013), a moderate-scale prototype with 78 Tegra3 nodes with 4 Cortex-A9 cores running at 1.3 GHz. The last prototype is Mont-Blanc (2015) which contains 1080 Exynos 5250 SoCs organized as 72 compute blades over two racks.
2.3.2 Asymmetric multi-core processors Nowadays, diversity is the new trend in computing systems. Workloads of different nature run on a wide variety of systems (from mobile devices to datacenters) and require resources with distinct capabilities. Against this background, traditional multi-core systems are not enough; their sophisticated and highly optimized cores are usually underutilized by most workloads yielding a waste of energy. On the other hand, many-core systems that feature simpler cores with low power consumption may not be sufficient for workloads that require high performance. Heterogeneity emerged as a response to this situation, combining cores of different nature on the same platform in order to provide a variety of resources that meet the necessities of distinct workloads. A specific type of heterogeneity is that provided by AMPs, which combine two (or more) types of cores in the same processor. These platforms provide more flexibility, thanks to their processor specialization, which makes them target either performance or energy efficiency. However, complexity in AMPs is increasing due to different reasons, such as the variety of cores or the use of different instruction set architectures (ISAs) and, consequently, they are becoming harder to program. Moreover, diversity has made terminology about AMPs confusing due to the fact that many authors use different names to refer to the same type of platform. According to [77], our definition of an AMP matches Physical asymmetric in [90] and Performance asymmetry in [69]; that is, two sets of cores with same ISA but different microarchitecture.
2.3.2.1 ARM big.LITTLE All the experimentation carried out throughout this thesis has been performed on a ARM big.LITTLE platform. This technology, designed by ARM, integrates LITTLE (or slower and energy-saving) processor cores with big (or more powerful and power-hungry) cores. Both types of processors are memory coherent and share the same ISA, but feature different microarchitectures
9 CHAPTER 2. BACKGROUND
Armv7-A/R Armv8-A/R Armv8-A AArch32 AArch64 Floating-point 32-bit 16-bit*/32-bit 16-bit*/32-bit/64-bit Integer 8-bit/16-bit/32-bit 8-bit/16-bit/32-bit/64-bit 8-bit/16-bit/32-bit/64-bit
Table 2.1: Size of NEON operations on ARM architectures. in order to implement their performance and power characteristics required by the big.LITTLE concept. The big cores are more complex, apply out-of-order execution, and have a multi-issue pipeline. The LITTLE cores are simpler, their execution is in-order, and have a simple multistage pipeline. The big.LITTLE platforms used in our tests include ARM NEON [12] technology, a Single In- struction Multiple Data (SIMD) architecture extension. In this technology, registers are considered vectors of elements (of the same data type) that are processed simultaneously increasing the per- formance of the processor. The data types and the architectures that support them are presented in Table 2.1 (values marked with * refer to the ARMv8.2-A architecture).
2.3.2.2 Odroid ODROID refers to a series of single-board computers manufactured by Hardkernel Co. Ltd. that include some implementations of the big.LITTLE design announced by ARM in 2011. The first generation of ODROID that implemented the big.LITTLE architecture (ODROID-XU) was launched in 2013 and combined a single Cortex-A15 quad-core and a single Cortex-A7 quad-core (Samsung Exynos 5410). This platform did not allow the simultaneous operation of both clusters, an issue that was solved later with the ODROID-XU3. For the tests presented in this thesis using an ODROID platform, we used either an ODROID- XU3 or an ODROID-XU4 both furnished with a Samsung Exynos 5422 SoC. The ODROID-XU4 is the evolution of the ODROID-XU3, with similar performance and energy efficiency hardware, but smaller board. Moreover, the ODROID-XU3 includes power sensors that are not present on the ODROID-XU4 board. Hereafter ODROID will be used to refer indistinctly to any of these platforms. The ODROID SoC comprises an ARM Cortex-A15 quad-core processing cluster (big) plus an ARM Cortex-A7 quad-core processing cluster (LITTLE), both implementing the ARMv7 micro- architecture; that is a 32-bit architecture. Each Cortex core has its own private 32-Kbyte L1 (data) cache. The four ARM Cortex-A15 cores share a 2-Mbyte L2 cache, and the four ARM Cortex-A7 cores share a smaller 512-Kbyte L2 cache; see Figure 2.1. Moreover, the two clusters access a common 2-Gbyte DDR3 RAM. Additionally, the Cortex-A7 cores may operate from 200 MHz to 1.4 GHz, while the Cortex-A15 may work from 200 MHz to 2 GHz.
2.3.2.3 Juno Juno is the first ARMv8-A 64-bit development board featuring a Cortex-A57 dual-core process- ing cluster and a Cortex-A53 quad-core processing cluster. Each Cortex-A57 core has a private 48+32-Kbyte L1 (instruction+data) cache, while each Cortex-A53 core has a private 32+32-Kbyte L1 (instruction+data) cache. The four Cortex-A57 cores share a 2-Mbyte L2 cache and the four Cortex-A53 cores share a smaller 1-Mbyte L2 cache; both clusters access a common 8-Gbyte DDR3 RAM (Figure 2.2). Furthermore, the Cortex-A53 cluster may operate from 450 MHz to 850 MHz and the Cortex-A57 from 450 MHz to 1.1 GHz.
10 2.4. MEASURING POWER CONSUMPTION
Figure 2.1: Exynos 5422 block diagram.
Figure 2.2: ARM Juno SoC block diagram.
2.3.3 Intel
For experiments with the thread-level malleability technique, we have used an Intel Xeon E5- 2603 v3. This platform features two homogeneous sockets with 6 cores each. Each core may run in the frequency range 1.2-2.4 GHz, has a private 32 KByte L1 cache, and a private 256 KByte L2 cache. All the cores in the same socket share a 20 MByte L3 cache. Moreover, the platform is furnished with 64 GB of DDR3 RAM (Figure 2.3). Although this platform is homogeneous or symmetric, it has been used to illustrate the advan- tages of thread-level malleability in a simpler way. However, the initial approach tested on this platform was adapted in order to be ported to the ODROID platforms as a part of this dissertation.
2.4 Measuring power consumption
Obtaining power dissipation measurements has been key in order to analyze the behavior of the target DLA library on the different platforms. To do so, we have used PMLib, a portable framework developed at Universitat Jaume I designed to interact with power measurement units.
11 CHAPTER 2. BACKGROUND
Figure 2.3: Intel Xeon block diagram.
The current implementation of PMLib provides an interface to access internal/external Watt- meters and a number of tracing tools. Power measurement is retrieved by the application using a collection of routines that allow the user to retrieve information from the power measurement units, create counters associated with a device where power data is stored, start/continue/terminate power sampling, etc. All this information is managed by the PMLib server, which is in charge of acquiring data from the devices and sending back the appropriate answers to the invoking client application via the proper PMLib routines. The PMLib server is implemented as two different types of daemons in the framework: the external-PMLib server and the internal-PMLib server. The former runs in a separate system and collects power samples from one or more Wattmeters attached to the target platform where the application is running. In this case interferences from the running daemon are avoided and noise is removed from the power measurements. However, there is some information that is only accessible locally (like, e.g., power measurements from local sensors). In those situations the use of the internal-PMLib daemon running in each node of the target platform is used. Figure 2.4 illustrates the PMLib framework and its interaction with the target application and different tools, such as a performance tracing suite and a visualization tool. The starting point is a scientific application, instrumented with PMLib, that runs on the target platform. Attached to the application node(s) there are different internal/external power measurement devices (managed by the external-PMLib daemon). The users’ code running on the target platform makes calls to the PMLib Application Programming Interface (API) to instruct the external-PMLib server [14] to do distinct operations (e.g., start/ stop collecting the data captured by the power measurement de- vices). Once the users’ application is executed, and thanks to the smooth integration of the PMLib traces with different tracing suites (e.g., Extrae [87], TAU [88], VampirTrace [61], HDTrace [70]), the power trace can be inspected along with performance traces obtained from the tracing suites with the appropriate visualization tool (e.g., Paraver [82]).
2.4.1 PMLib and big.LITTLE platforms
As explained in the previous section, PMLib is composed by two daemons. The external-PMLib daemon is in charge of all the hardware mesuring devices, which can be either internal Direct
12 2.5. SUMMARY
Figure 2.4: PMLib diagram.
Current (DC) or external Alternating Current (AC) Wattmetters. On the other hand, the internal- PMLib server manages information that can be retrieved only locally; this information includes the C-states of the processor and power measurements from energy sensors (e.g., RAPL [65], NVML counters [80], ARM current sensors [62]). Power samples on the big.LITTLE ODROID-XU3 platform are collected in this thesis using the internal-PMLib daemon and the current/voltage sensors present in this SoC. There exist four sensors on the ODROID-XU3 that measure the power consumption of the Cortex-A15 cluster, Cortex-A7 cluster, GPU and DRAM respectively every 250 ms. The PMLib framework retrieves power consumption samples from this platform at 4Hz, and dumps them to the standard output or a text/paraver file.
2.5 Summary
In this chapter, we presented the BLAS and LAPACK DLA libraries, which are used through- out this thesis. We also described the platforms used in our experiments, two ARM platforms with different architectures and distinct architectures. Finally, we reviewed PMLib, a framework designed to collect power samples from a wide variety of Wattmeters.
13 CHAPTER 2. BACKGROUND
14 CHAPTER 3
Basic Linear Algebra Subprograms (BLAS)
In this chapter, a wide description of two (BLAS-2 and BLAS-3) out of the three BLAS levels is performed for the BLIS library. This specific library is used for the study since its structure allows to adopt a low level approach and extract detailed information about the source of the costs of the operations, something that is not possible with other solutions that follow a black-box ap- proach,.e.g., MKL. An adaption of the library for AMPs is presented along with a detailed study of the performance and energy consumption gains derived from this new version. That implemen- tation includes a full asymmetry-aware BLAS-3 version and two asymmetry-aware routines (symv and gemv) from BLAS-2. The asymmetry-aware version of the library is tested on two recent ARM big.LITTLE SoCs, both implementing a particular class of heterogeneous architecture that com- bines a few high performance (but power hungry) cores with a collection of energy efficient (though slower) cores. In order to carry out a complete study, different operand shapes and precisions are used in the tests.
3.1 Blas-like Library Instantiation Software (BLIS)
BLIS is a framework for instantiating the functionality of BLAS that increases the productivity of developers when implementing BLAS and BLAS-like operations. In the same vein as BLAS, BLIS is organized in three levels (hereinafter named as BLIS-x), plus 4 sublevels in BLIS-1, that provide the same functionality as BLAS, but also offers additional options that allow the user to implement tuned operations thanks to its open source philosophy:
• Level-1 is composed of 4 sublevels that tackle operations with operands of different nature:
– Level-1v targets operations on vectors. – Level-1d tackles element-wise operations only on matrix diagonals. – Level-1m performs element-wise operations on matrices. – Level-1f focus on fused operations on multiple vectors, meaning that one call to the routine performs the specific operation on f vectors simultaneously.
• Level-2 is in charge of operations with one matrix and (at least) one vector operand.
15 CHAPTER 3. BASIC LINEAR ALGEBRA SUBPROGRAMS (BLAS)
• Level-3 deals with operations that take one or more matrices as operands.
BLIS follows the GotoBLAS approach [57] but further modularizes operations or kernels in order to isolate a micro-kernel. This micro-kernel performs the targeted operation and is written in assembly or using vector intrinsics, bringing out the most of the underlying architecture. Thus, to guarantee portability, the developer only needs to implement the micro-kernel in order to obtain an implementation for a different architecture. In addition, if high-performance is desired, a few configuration parameters need to be adjusted to optimize the micro-kernel and the loops around it.
The configuration parameters of BLIS are mr and nr, referring to register configuration pa- rameters, and mc, kc and nc, identified as cache configuration parameters. Register configuration parameters are essential in the implementation of the micro-kernel and their values depend on the throughput and latency of the SIMD units and the number of available registers. Moreover, the cache configuration parameter kc is also crucial for the micro-kernel, since it defines the number of updates that the micro-kernel performs. kc along with mc and nc dictate the strides of the loops around the micro-kernel and their value depends on the size and associativity degree of the cache levels. These cache configuration parameters are also used by the packing routines, which are in charge of arranging data in memory in order to minimize cache misses. Given an input matrix and the cache configuration parameters, the packing routine produces as an output a submatrix (or micro-panel) of the dimension dictated by the cache parameters with its values arranged properly. The goal, thus, is that the micro-panels obtained as the output of the packing routines fit in the appropriate cache levels in order to fully amortize these transfers with enough computation from within the micro-kernel, as explained in [73]. Moreover, BLIS offers multiple levels of multithreading for nearly all level-3 operations via OpenMP or POSIX threads. This multi-level parallelism allows the partitioning of the matrices in several dimensions to attain high performance on multi-core architectures. In order to encode the information about the logical thread topology, the execution of all routines is encoded as a control-tree, a recursive data structure that holds all the information necessary to combine the basic building blocks offered by BLIS. The control-tree for a given BLAS-3 operation governs, among others, which combination of loops have to be executed to complete the operation (that is, the exact algorithmic variant to execute at each level of the general algorithm), the stride for each loop (specific to each target architecture), and the exact points at which packing must occur. In BLIS, there is a single control-tree per operation composed by as many recursive levels as loops comprise the operation. Thus, when executing a BLIS-3 kernel, at a given level of the algorithm, the control-tree indicates the loop stride of the current loop; whether any of the input operands has to be packed at that point; the parallelism degree of the current loop; whether any of the input operands needs to be transposed; and the next step of the algorithm (next loop or the micro-kernel instantiation).
3.2 BLIS-3
In this section, we present the level-3 BLAS in BLIS (hereafter BLIS-3) and a detailed descrip- tion of its multi-threaded implementation. Given that all BLIS-3 kernels rely on the general matrix multiplication (gemm) , this operation is explained in the first place. Next, we describe how the same idea presented for the gemm is extended to the rest of BLIS-3 operations. Finally, we describe the triangular system solve (trsm) due to its specific features.
16 3.2. BLIS-3
3.2.1 General matrix-matrix multiplication (gemm) gemm is a crucial operation for the optimization of the Level-3 BLAS[47], as portable and highly tuned versions of the remaining Level-3 kernels are in general built on top of gemm. As exposed in [68], all BLAS-3 kernels can be implemented by means of gemm if a suitable partitioning of the matrices is performed, since gemm is applied to some of these partitions, and the “non-gemm-wise” partitions can be handled by Level-1 and/or Level-2 kernels. Modern high-performance implementations of gemm for general-purpose architectures , includ- ing BLIS and OpenBLAS, follow the design pioneered by Goto [59]. BLIS in particular implements the gemm C = α·A·B+β·C, where the sizes of A, B, C are respectively m×k, k×n, m×n, as three nested loops around a macro-kernel plus two packing routines (see Loops 1–3 in Figure 3.1). The macro-kernel is then implemented in terms of two additional loops around a micro-kernel (Loops 4 and 5 in Figure 3.1). In BLIS, the micro-kernel is typically encoded as a loop around a rank–1 (i.e., outer product) update using assembly or with vector intrinsics, while the remaining five loops and packing routines are implemented in C. Figure 3.2 illustrates how the loop ordering, together with the packing routines and an ap- propriate choice of the BLIS cache configuration parameters orchestrate a regular pattern of data transfers through the levels of the memory hierarchy. In practice, the cache parameters nc, kc and mc, and the register parameters nr and mr dictate the strides of the five outermost loops with the goal of streaming a kc × nr micro-panel of Bc, say Br, and the mc × kc macro-panel Ac into the FPUs from the L1 and L2 caches, respectively; while the kc × nc macro-panel Bc resides in the L3 cache (if present).
Loop 1 for jc = 0, . . . ,n − 1 in steps of nc Loop 2 for pc = 0, . . . ,k − 1 in steps of kc B(pc : pc + kc − 1,jc : jc + nc − 1) → Bc // Pack into Bc Loop 3 for ic = 0, . . . ,m − 1 in steps of mc A(ic : ic + mc − 1,pc : pc + kc − 1) → Ac // Pack into Ac Loop 4 for jr = 0, . . . ,nc − 1 in steps of nr // Macro-kernel Loop 5 for ir = 0, . . . ,mc − 1 in steps of mr Cc(ir : ir + mr − 1,jr : jr + nr − 1) // Micro-kernel += Ac(ir : ir + mr − 1,0 : kc − 1) · Bc(0 : kc − 1,jr : jr + nr − 1) endfor endfor endfor endfor endfor
Figure 3.1: High performance implementation of gemm in BLIS. In the code, Cc ≡ C(ic : ic +mc − 1,jc : jc + nc − 1) is just a notation artifact, introduced to ease the presentation of the algorithm, while Ac,Bc correspond to actual buffers that are involved in data copies.
The multi-threaded version of gemm integrated in BLIS exploits the concurrency available in the nested five-loop organization at one or more levels (i.e., loops). Furthermore, the approach takes into account the cache organization of the target platform (e.g., the presence of multiple sockets, which cache levels are shared/private, etc.), while discarding the parallelization of loops that would incur race conditions as well as those that exhibit too-fine granularity. The insights gained from the analyses carried out in [94, 89] about the loop(s) to be parallelized in a multi- threaded implementation of gemm can be summarized as follows:
• Loop 5 (indexed by ir). With this option, different threads execute independent instances of the micro-kernel, while accessing the same micro-panel Br in the L1 cache. The amount of
17 CHAPTER 3. BASIC LINEAR ALGEBRA SUBPROGRAMS (BLAS)