Enabling the Use of Low Power Mobile and Embedded Technologies For
Total Page:16
File Type:pdf, Size:1020Kb
Enabling the Use of Embedded and Mobile Technologies for High-Performance Computing Author: Advisor: Nikola Rajovic´ Alex Ramirez A THESIS SUBMITTED IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor per la Universitat Polit`ecnicade Catalunya Departament d’Arquitectura de Computadors Barcelona, Spring 2017 To my family . Моjоj породици . Acknowledgements During my PhD studies I received help and support from many people - it would be strange if I did not. Thus I would like to thank professors Mateo Valero and Veljko Milutinovi´cfor recognizing my interest in HPC and introducing me to my advisor, professor Alex Ramirez. I would like to thank him for trusting in me and giving me an opportunity to be a pioneer of HPC on mobile ARM platforms, for guiding and shaping my work last five years, for helping me understand what real priorities are, and for being pushy when it was really needed. In addition, thank you Carlos and Puzo for helping me to have a smooth start of my PhD and for filling the gap when Alex was too busy. Last two years of my PhD I have been working with Alex remotely, and I would like to thank the local crew who helped me: professor Eduard Ayguade for providing me with some very hard-to-get manuscripts and accepting to be my "ponente" at the University; professor Jesus Labarta for thorough dis- cussions about parallel performance issues and know-how lessons on demand; Alex Rico for helping me deliver work for the Mont-Blanc project and for friendly advices every so little; Filippo Mantovani for making sure I could use the prototypes without outages and for filtering-out internal bureaucracy issues. In addition, I would like to acknowledge support from BSC Tools depart- ment, especially Harald Servat and Judit Gimenez, who always managed to find a free slot for me in their busy agenda. Gabriele Carteni and Lluis Vilanova shared with me their Linux Guru experience, which was extremely helpful during deployment of my first proto- type - Tibidabo. Thank you guys! In addition, thank you Dani for bypassing ticketing system and fixing issues related to our internal prototypes as soon as I trigger them. Thanks to prelectura committee and the constructive comments I re- ceived, the quality of my PhD thesis manuscript was significantly improved. Professor Nacho Navarro used to cheer me up many times with his down- to-earth attitude, ideas and advices - I am sad that he cannot see me defend- ing my thesis. v I shared my office with a lot of people, and I would like to thank them for all the good and bad times. I would like to thank all my Serbian friends I lived, used to hang around, played basketball, or shared office with - thank you Браћала, Боки, Бране, Радуловача, Уги, Зоки, Зуки, Влаjко, Мрџи, Павле, Вуjке, Перо, Пузо ... Thanks to Rajovi´cheadquarters and their unconditional support for my ideas, my PhD adventure finally ends with the writing of this Acknowledge- ment. Knowing you are with me helped a lot! Last but not least I would like to thank my darling, my Ivana, for the patience, understanding and endless support. You are the true hero . and thank you Ugljeˇsa,my son, for inspiring me with your smile . Author My graduate work leading to these results was supported by the PRACE project (European Community funding under grants RI-261557 and RI-283493), Mont-Blanc project (European Community’s Seventh Framework Programme [FP7/2007-2013] under grant agreement no 288777), the Spanish Ministry of Science and Technology through Computacion de Altas Prestaciones (CICYT) VI (TIN2012-34557), and the Spanish Government through Programa Severo Ochoa (SEV-2011-0067). Author has been financially supported throughout his studies with a grant from Barcelona Supercomputing Center. vi Abstract In the late 1990s, powerful economic forces led to the adoption of commodity desktop processors in High-Performance Computing (HPC). This transfor- mation has been so effective that the November 2016 TOP500 list is still dominated by x86 architecture. In 2016, the largest commodity market in computing is not PCs or servers, but mobile computing, comprising smart- phones and tablets, most of which are built with ARM-based Systems on Chips (SoC). This suggests that once mobile SoCs deliver sufficient perfor- mance, mobile SoCs can help reduce the cost of HPC. This thesis addresses this question in detail. We analyze the trend in mobile SoC performance, comparing it with the similar trend in the 1990s. Through development of real system prototypes and their performance anal- ysis we assess the feasibility of building an HPC system based on mobile SoCs. Through simulation of the future mobile SoC, we identify the missing features and suggest improvements that would enable the use of future mo- bile SoCs in HPC environment. Thus, we present design guidelines for future generations mobile SoCs, and HPC systems built around them, enabling the new class of cheap supercomputers. vii Contents Front matteri Dedication . iii Acknowledgements . .v Abstract . vii Contents . xi List of figures . xvi List of tables . xviii 1 Introduction1 1.1 Microprocessors in Supercomputing . .1 1.2 Energy Efficiency . .3 1.3 Mobile Processors Evolution . .5 1.3.1 ARM Processors . .6 1.3.2 Embedded GPUs . .8 1.4 Contributions . .8 2 Related Work 11 3 Methodology 17 3.1 Hardware platforms . 17 3.2 Single core, CPU, and node benchmarks . 18 3.2.1 Mont-Blanc benchmarks . 18 3.3 System benchmarks and workloads . 20 3.4 Power measurements . 22 3.5 Simulation methodology . 22 3.6 Tools . 24 3.6.1 Extrae . 24 3.6.2 Paraver . 25 3.6.3 Clustering tools . 25 3.6.4 Basic analysis tool . 25 3.6.5 GA tool . 25 ix CONTENTS 3.7 Reporting . 26 4 ARM Processors Performance Assessment 27 4.1 Floating-Point Support Issue . 27 4.2 Compiler Flags Exploration . 28 4.2.1 Compiler Maturity . 29 4.3 Achieving Peak Floating-Point Performance . 30 4.3.1 Algebra Backend . 31 4.4 Comparison Against a Contemporary x86 Processor . 32 4.4.1 Results . 33 4.4.2 Discussion . 34 5 Tibidabo, The First Mobile HPC Cluster 37 5.1 Architecture . 37 5.2 Software Stack . 39 5.3 Evaluation . 40 5.3.1 Methodology . 40 5.3.2 Cluster Performance . 40 5.3.3 Interconnect . 44 5.4 Comparison Against an X86-Based Cluster . 45 5.4.1 Reference x86 System . 47 5.4.2 Applications . 47 5.4.3 Power Acquisition . 47 5.4.4 Input Configurations . 48 5.4.5 Results . 48 5.5 Projections . 51 5.6 Interconnect requirements . 57 5.6.1 Lessons Learned and Next Steps . 58 5.7 Conclusions . 61 6 Mobile Developer Kits 63 6.1 Evaluation Methodology . 64 6.2 CARMA Kit: a Mobile SoC and a Discrete GPU . 64 6.2.1 Evaluation Results . 65 6.3 Arndale Kit: Improved CPU Core IP and On-Chip GPU . 68 6.3.1 ARM Mali-T604 GPU IP . 68 6.3.2 Evaluation Results . 69 6.4 Putting It All Together . 70 6.4.1 Comparison Against a Contemporary x86 Architecture 70 6.5 Conclusions . 75 x CONTENTS 7 The Mont-Blanc Prototype 77 7.1 Architecture . 77 7.1.1 Compute Node . 78 7.1.2 The Mont-Blanc Blade . 78 7.1.3 The Mont-Blanc System . 79 7.1.4 The Mont-Blanc Software Stack . 80 7.1.5 Power Monitoring Infrastructure . 82 7.1.6 Performance Summary . 83 7.2 Compute Node Evaluation . 84 7.2.1 Core Evaluation . 85 7.2.2 Node Evaluation . 85 7.2.3 Node Power Profiling . 87 7.3 Interconnection Network Tuning and Evaluation . 88 7.4 Overall System Evaluation . 90 7.4.1 Applications Scalability . 90 7.4.2 Comparison With Traditional HPC . 95 7.5 Scalability Projection . 97 7.6 Conclusions . 99 8 Mont-Blanc Next-Generation 101 8.1 Methodology . 101 8.1.1 Description . 102 8.1.2 Benchmarks . 104 8.1.3 Applications . 105 8.1.4 Base and Target Architectures . 105 8.1.5 Validation . 105 8.2 Performance Projections . 106 8.2.1 Mont-Blanc Prototype . 108 8.2.2 NVIDIA Jetson . 109 8.2.3 ARM Juno . 109 8.2.4 NG Node . 110 8.3 Power Projections . 111 8.3.1 Methodology . 111 8.4 Conclusions . 117 9 Conclusions 119 9.1 Future Work . 120 Back matter 121 List of publications . 121 Bibliography . 123 xi List of Figures 1.1 Development of CPU architectures share in supercomputers from TOP500 list. .2 1.2 Vector and commodity processors peak floating-point perfor- mance developmenmt . .3 1.3 Server and mobile processors peak floating-point performance development. .5 3.1 Methodology: power measurement setup. 22 3.2 An example of a Dimemas simulation where each row presents the activity of a single processor: it is either in a computation phase (grey) or in MPI communication (black). 24 4.1 Comparison of different compilers for ARM Cortex-A9 with Dhrystone benchmark. 29 4.2 Comparison of different compilers for ARM Cortex-A9 with LINPACK1000x1000 benchmark. 30 4.3 Exploration of the ARM Cortex-A9 double-precision floating- point pipeline for FADD and FMAC instructions with mi- crobenchmarks. 31 4.4 Performance of HPL on ARM Cortex-A9 for different input matrix and block sizes. 32 4.5 Comparison between Intel Core i7-64M and ARM Cortex-A9 with SPEC CPU2006 benchmark suite. 35 5.1 Tibidabo prototype: physical view of the node card and the node motherboard . ..