Introduction to Openmp
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to OpenMP HW & SW for High Performance Computing Master's Degree in Advanced Computing for Science and Engineering FIM - 2012/13 Vicente Martín v0.0 Contents ● Overview: Architectures and Tools. ● OpenMP. – The place of OpenMP among other HPC programming paradigms. – Memory and execution model. – Environment. – Directives. OpenMP UPC Taxonomy HPF MPI UPC HPF Ejemplo: cc-NUMA Silicon Graphics Origin 2000: two processors sharing a common hierarchical memory. Caches are held coherent (cc). The global shared memory space is unique. The access to memory banks physically located in remote boards is done through an interconnection network. The worst case memory access time to local, on board, memory is 318 ns. If the machine has 64 nodes, worst case access is 1067 ns. Ejemplo: Multicomputador Tarzan: Nodos 5-6: SMPs tetraprocesadores PPC 604e Nodos 1-4: Monoprocesadores Power2 SC Each node of the IBM SP2 is a workstation with its own Operating System controlling its separate address space. Access to a memory position belonging to other node requires the communication of two independent OS through an explicit message passing, in this case helped by specialised HW: The High Performance Switch. Machines available for practises. ● Magerit. – Multicomputer. Cluster architecture made up from 245 nodes with 16 Power7 cores each. Linux SLE. 8TB RAM / 200 TB disk under GPFS ● Triqui 1-2-3-4. Intel SMP 8 CPUs (2 quadcores). Linux (main machine for OpenMP) CeSViMa: Magerit Prácticas MPI (OpenMP limitado) Potencia sostenida >70 Tflops Multicomputador: Linux SLE – Infiniband QDR: comunicación interproceso MPI – Gigabit Ethernet: Sistema de ficheros (GPFS) + gestión. 245 nodos de cómputo – 16 Power 7 cores. 32 GB RAM per node. – 3920 cores, ~8 TB de RAM y 147 TB de disco local. Disks – 192 TB under GPFS – 256 HD SATA x 750GB – Distributed (16 servidores Power5) – Fault tolerant RAID5 + HOT SPARE Interactive Nodes: – ssh [email protected] Computational Nodes: – Under SLURM-Moab Compilers: – IBM XL C/C++ (xlc, xlC) Fortran 77/90/95 (xlf, xlf90, xlf95) and thread safe versions (_r commands) – GNU compilers (gcc, g++, g77) – MPI wrappers (mpicc, mpiCC, mpif70, mpif90) Backend are IBM XL compilers by default. – MPI: Lamm, Glenn Messages. A Job definition file is needed in order to submit a job. – (see Magerit docs, http://static.cesvima.upm.es/doc/manual/Magerit-GuiaUsuarios.pdf) – [email protected] SLURM: Simple Linux Utility for Resource Management – scalable cluster management and job scheduling system for Linux clusters – A Slurm daemon (slurmd) runs at every node. It is under the control of a central (slurmctl) daemon. – Useful commands: – sacct, salloc, sattach, sbatch, sbcast, scancel, scontrol, sinfo, smap, squeue, srun, strigger and sview – QoS based. SLURM-Moab Basic commands – jobcancel: Deletes a job from the execution queue. – jobcheck: Shows detailed information about a queued job. – jobq: Shows the state of the user's jobs in the system. – jobstart: Gives an estimate of the starting time of the job. – Jobsubmit: Sends a job to the system for its execution. A Job definition file is needed in order to submit a job. (see Magerit docs) #!/bin/bash #----------------------- Start job description ----------------------- #@ group = [project_id] #@ class = [class_name] #@ initialdir = /gpfs/projects/[project_id]/[data_dir] #@ output = res/[programoutfile].out #@ error = res/[programerrfile].err #@ total_tasks = [number of tasks] #@ wall_clock_limit = [hh:mm:ss] #------------------------ End job description ------------------------ #-------------------------- Start execution -------------------------- # Run our program srun ./[myprogram] #--------------------------- End execution --------------------------- ● Triqui 1-2-3-4: – 8 cores per node. – Intel compilers (icc, ifort): use -openmp switch for openMP. – Located in /opt/intel/Compiler/11.1/069 ● Look for the exact place: it changes qith compiler versions. ● There is a Documentation directory. ● You have to make sourcing of the files (source or . commands): ● iccvars.sh ( C language and Bourne shell. Also .csh version) ifortvars.sh (idem Fortran version) en /opt/intel/Compiler/11.1/069/bin with argument intel64 – Place the commands in .bash_profile if you don't want to have to repeat it each time. – GNU compilers: use the -fopenmp switch. ● Triqui 1-2-3-4: ● Examples: Intel compiler. – Ifort -free -openmp sourcefile.f -o compiled – Icc -openmp sourcefile.c -o compiled – ● Examples: GNU compiler. – gcc -fopenmp sourcefile.c -o compiled – gfortran -fopenmp -ffree-form sourcefile.f -o compiled Compilers available in: ● Magerit: – XL compilers (xlf, xlc): use switch -qsmp=omp and thread-safe libraries calling the compilers with the scripts xlf_r y xlc_r – GNU compilers: -fopenmp (check the OpenMP support level for each architecture) – NOTA: Magerit nodes are 16-core SMPs, hence the ideal speedup a factor of 16. ● Examples: IBM Xl compilers. – xlc_r -qsmp=openmp sourcefile.c -o compiled – xlf_r -qsmp=openmp sourcefile.f -o compiled – xlf77_r, xlf90_r ... Multiprocessors Examples. Shared memory, NUMA access. SGI Origin 2000 [Laudon 98] Cray T3D Parallel programs always have overheads: ● Communications and Synchronizations: Typically this is the most important overhead. ● Non optimal algoritmh: The parallel algorithm could be not as efficient as the sequential one. ● Parallel SW overhead: Extra cost of the parallel implementation (e.g.: calculations associated to domain decomposition) ● Load balancing: Task migration, context switches, etc. needed to balance the load among different threads/ processors/nodes. These contribuctions, correctly expressed, are additive. ● Speedup factor S: Ratio between the time taken by a sequential machine and the same calculation in a parallel one. T sec S= T conc ● Define T conc N as the time taken by a parallel machine with N nodes. Usually we will be interested in: T conc 1 S N = T conc N ● Ideally, the value of S is N. – This is linear speedup. – There could be superlinear speedup... (usually an effect of the increased aggregated cache in a parallel machine.) ● Efficiency: Speedup per processor. S = N ● The ideal value of would be 1. Two limiting situations for the speedup factor. Speedup saturation Due to parallel Overhead (usually communications) Fixed subdomain (grain) size Fixed problem size Domain Decomposition and Communications Overhead. ● How to decompose the problem domain affects parallel overhead through communications. ● Example: Balanced first neighbours algorithm in 2D. – Bidimensional square domain with nN=nN nN nodal points. – Sets of n nodal (grain size) points are distributed to each of the N processors. – The algorithm only needs to know the values at each neighboring point. How do we divide the problem domain? ● The amount of communications will be proportional to the lenght of the border between subdomains. ● Consider two possibilities: n – N rectangles nN× N – N squares n×n ● The lenght of the border, for big enough N will be: – Rectangular case: lrec =N nN – Squared case: l sqr=2N n ● The communications ratio between them: l N rec = l sqr 2 – Using a rectangular domain decomposition, we will perform, approximately, N times more communications than in the square case: It will be less efficient and less scalable. Amdahl, Gustafson and Scalability ● In1967 Amdahl judged that if a program has an intrinsically sequential part, s, and another parallel one, p (s+p=1) using N processors, then the maximum speedup would be: s p 1 S= = p p s s N N – According to this, if we have 1024 processors and the sequential part is just 0.5% the maximum speedup would be 168!!... A sequential part of 5% would limit it to less than 20... wikipedia ● In 1988 Gustafson argued that s and p, were not constant for just the program itself and that the correct values to use would be s' and p', the values measured in the parallel system. According to this, a sequential system that would substitute the parallel one, would take s'+p'N. Then, the speedup would be: s' p' N S= =N1−N ×s' s' p' – Resembling much more the fixed subdomain size curves that we had... ● Scalability has to be studied for a given implementation in a given machine. MPI ● De facto standard for the Message Passing model. Agreed among companies, universities and other research centers. ● This is a multicomputer-oriented model (there are implementations in shared memory machines) ● Versión 1.0 en 1994, 2.0 en 1997. Parallel I/O paralela, dynamic creation and destruction of tasks, one-sided communication, etc. ● The programmer has to worry about all parallel detailsg: MPI solves only the parts closer to HW communications, but not from the messages passing logic. ● Bindings para F77, C, C++, f90 (con caveats...) incluso Java, Python... OpenMP ● First specification in 1998. Version 2.0 (f90) in 2000. C, C++ in 2002. 2.5 version (f90/C/C++) in 2005. Agreed mainly by HW vendors but also users. Version 3.0 May 2008 (major change: tasks). ● API for shared memory computers. Focused on getting reasonable efficiency with minimal investment (no parallel-specific source code). ● The original, serial, source code is “annotated” with compiler directives and conditional compilation statements. Also environment variables. ● F77, f90, C and C++ bindings. ● Allows for an “incremental” parallelization. Intermediate level among the HPF “high level” and MPI “low level”. http://www.openmp.org, www.compunity.org UPC ● ANSI C extension with PGAS ideas