Computing: from Your Desktop PC to Parallel Supercomputers a User-Friendly Introduction
Total Page:16
File Type:pdf, Size:1020Kb
Politecnico di Torino PhD School High Performance (Scientific) Computing: from your desktop PC to parallel supercomputers A user-friendly introduction Michele Griffa PhD student, XIX cycle Dept. of Physics, Politecnico di Torino corso Duca degli Abruzzi 24, 10129, Torino (Italy) and Bioinformatics and High Performance Computing Laboratory Bioindustry Park of Canavese via Ribes 5, 10010, Colleretto Giacosa (Italy) E-mail: [email protected] Personal Web site: http://www.calcolodistr.altervista.org 16/10/06 Michele Griffa - High Performance (Scientific) Computing Table of contents 1 Introduction 2 High Performance Computing (HPC): what does it mean and why do we need it ? 3 Computers architectures 3.1 The Von Neumann's architecture and its bottlenecks 3.2 CPU built-in parallelism: technological improvements 3.2.1 Bit-parallel memory and arithmetic 3.2.2 Implementation of I/O-dedicated processors 3.2.3 Memory interleaving 3.2.4 Hierarchies of memories 3.2.5 Multitasking 3.2.6 Instructions look-ahead 3.2.7 Multiple functions units 3.2.8 Pipelining 3.2.9 Floating point arithmetic: SSE technology 3.2.10 CPU design “philosophies”: past and future sceneries 3.2.10.1 CISC and RISC architectures 3.2.10.2 VLIW (Very Long Instruction Word) CPUs: 64-bits architectures 3.2.10.3 SMP (Symmetric Multiprocessor) and Multi-Cores computers 3.3 Parallel (multiprocessor) computers 3.3.1 SIMD systems 3.3.2 MIMD systems 3.3.2.1 Shared-memory architectures 3.3.2.2 Distributed-memory architectures 3.3.3 Interconnetion issues 3.4 Grid Computing: the future of HPC ? 4 1st level of HPC: optimization of scalar codes (squeezing the hardware of your desktop PC !) 4.1 Profiling your programs 4.1.1 Example of software for performance analysis 4.2 Source-code level optimization 4.3 Assembly-code level optimization by compilers 4.4 The Intel Fortran-C/C++ Linux compiler for x86/x86_64 architectures as an example of optimal compiler for Intel architectures 5 2nd level of HPC: parallel programming for parallel computers 5.1 Parallel programming paradigms 5.1.1 Message passing paradigm 5.1.1.1 MPI libraries 5.1.2 Shared-memory systems paradigms 5.1.2.1 OpenMP 6 Concluding remarks and aknowledgements 7 Bibliography 8 List of useful Web sites Michele Griffa - High Performance (Scientific) Computing (3) (1) (2) (4) (5) Group picture of the teachers (red numbers) and participants to the 13th Cineca Summer School on Parallel Computing. Instructors: (1) Sigismondo Boschi, (2) Carlo Cavazzoni, (3) Cristiano Calonaci, (4) Claudio Gheller, (5) Gerardo Ballabio. Giovanni Erbacci and Chiara Marchetto (the other teachers) were not present in this photo. Green arrow (and green t-shirt): me. Many thanks to C. Calonaci and Cineca Web Administrators for having recovered this photograph. (1) (2)(6) (3) (4) (5) Group picture of the teachers (red numbers) and participants to the 2nd Caspur Summer School on Advanced HPC. Instructors: (1) Federico Massaioli, (2) Giorgio Amati, (3) Gabriele Mambrini, (4) Roberto Verzicco, (5) Jack Dongarra. (6) Elena ?????. The other teachers of the School were not present in this photo. Green arrow (but white t-shirt, this time): me. Michele Griffa - High Performance (Scientific) Computing 1. Introduction This document is a brief and incomplete report on the art of High Performance (Scientific) Computing (HPC) targeted to Scientists (certainly not to Computer Scientists or Computational Scientists). Its main objective is to give the interested non-specialist reader an overview on how to improve the performance of his/her computing codes used in his/her scientific day-work. Performance does not mean only speed of execution. Indeed, high performance is concerned with the trade-off between speed, machine workload, volatile/hard memory occupation. This report should show that for improving the performance of one's scientific computing code it is not necessary or sufficient to have high performance hardware facilities: algorithms design, methods and approaches in the implementation of the code with a programming language, customization of the code according to hardware features are important issues yet. Section 2 presents a brief introduction to the world of HPC, with an example of real-world computing problem that for being solved requires computing resources that nowadays are still not available. This is an example of grand-challenge problem and it is useful in showing the importance of HPC in every-days human activities. Section 3 is completely dedicated to computers architectures with a specific logical flow that starts from the basic ingredients of the Von Neumann architecture (the basic logical model for a sequential electronic computer) and arrives at the main architectures of nowadays parallel supercomputers. The intermediate path covers, in a chronological and logical way, the main important steps that has led from single desktop PCs to parallel architectures, with a particular attention to the hardware technological solutions that have tried to introduce parallelism also within the way of operating of a single CPU. Section 3.1 is dedicated to the Von Neumann architecture of a serial computer with a single processing unit (CPU) and its bottlenecks. Section 3.2 is a temptative and incomplete presentation of the evolution of CPU architectures and technologies that have tried to improve their performances, with a stress on technological solutions to implement parallelism within a single processing unit. This choice of concentrating the attention on such issue is in contrast with usual treatments and presentations within non-specialized mass-media contexts, which tend to emphasize the great and powerful increase of CPU speed in terms of number of clock cycles per unit of time (clock frequency). As it might be shown in Section 3.2, the fantastic improvements in CPU performances are not only based on the increase in CPU speed but also on other technical solutions to exploit parallelism. The famous Moore's law, which has seemed to govern the developments of microelectronics in the last three decades and to be at the basis of every leap forward in computing power, is not all the world ! This fact has been getting very clear to hardware and software manufacturers in the last years: physical limits to device integration at the micron scales have been reached, although nanotechnologies promise to beat new limits. However, parallelism seems nowadays to be the only possible solution to maintain a positive rate for computing power. Section 3.3 is dedicated to multiprocessor computers, with a brief overview of the different ypes of architectures and technological solutions to realize them. This part concludes with a brief presentation of Grid Computing, as a kind of parallel computing solution, its promises and its limits. Section 4 and 5 are entirely dedicated to HPC with the point of view of a user who wants to obtain the best performance from the facilities he/she has, being them a single commodity PC or a big parallel supercomputer. Section 4 consists in a collection of treatments and tricks to improve the performance of one's computing codes on a single desktop PC, trying to show that off-the-shelf CPUs have many resources that could be exploited to obtain great results in scientific computing. HPC can Just be obtained, firstly, for serial codes running on off-the- shelf desktop PCs: the report considers some aspects of code optimization, using all the resources of the serial PC. In Section 4.2, some general rules are considered in order to improve the performance of the serial code according to general features common to nowadays CPUs architectures available on the desktop market. I treat here only the x86 architectures, also known as IA_32 (32 bit architectures), of modern CPUs. Since 2004, x86_64 (64 bit architectures) CPUs have been available on the high-level desktop market. Nowadays, they are commonly used on all categories of computers, from desktop to mobile, from servers to clusters. 64 bit CPUs compatible with the IA_32 instruction set (firstly introduced by AMD then by Intel, implemented in Pentium IV from the Proconona version) improve performance in fixed point and floating point arithmetics and, more important, can address RAM memories larger than 4 GiBytes, which is the main objective they have been designed for. Although, x86_64 CPUs are the standard in the design of nowadays multiprocessor supercomputers, the optimization of code running of them require additional knowledge of hardware features which are beyond the scope of this report. Anyway, some of the considerations presented below for x86 architectures remain valid for x86_64 ones. In Section 4.3, other types of optimization operations and methods are described, which can be realized not at the source code level but at the Assembly code level, using specific option flags implemented in modern compilers. I consider there and along the whole report the GNU/Linux OS as the standard environment of work and the GCC (GNU Compiler Collection, version 4.1.1) as the standard compiler set therein. This choice is motivated by the fact that GNU/Linux OSs and software are de facto the standard tools for Scientific Computing, particularly for cluster Michele Griffa - High Performance (Scientific) Computing computing. Section 4.4 introduces some features of the Intel Fortran-C/C++ compiler, which is released by Intel freely according to a non-commercial license for personal use. This compiler is optimized for Intel CPU architectures and is a good example of a tool for improving the performance of serial codes according to the specific hardware features of the CPU in use. When scaling issues imply that the resources of a single PC are not sufficient for a real-world execution of the code, it is necessary to implement a parallel version of the code, to be run on multiprocessor computers.