IBM BG/P Workshop Lukas Arnold, Forschungszentrum Jülich, 14.-16.10.2009 contact: [email protected] aim of this workshop contribution

! give a brief introduction to the IBM BG/P (sw+hw) ! guide intensively through two aspects ! spent most time with hands-on

! this is not a complete reference talk, as there are already many of them ! aimed for HPC beginners

14.-16.10.2009 Lukas Arnold 2 contents ! part I - Introduction to FZJ/BGP ! systems at FZJ ! IBM Blue Gene/P architecture overview ! part II - jugene Usage ! compiler, submission system ! hands-on: “Hallo (MPI) World!” ! part III - PowerPC 450 ! ASIC, internal structure, compiler optimization ! hands-on: “Matrix-Matrix-Multiplication, a.k.a. dgemm” ! part IV - 3D Torus Network ! torus network strategy, linkage and usage, DMA engine ! hands-on: “Simple Hyperbolic Solver” and “communication and computation overlap”

14.-16.10.2009 Lukas Arnold 3 PART I INTRODUCTION TO FZJ/BGP

14.-16.10.2009 Lukas Arnold 4 Forschungszentrum Jülich (FZJ)

! one of the 15 Helmholtz Research Centers in Germany ! Europe’s largest multi-disciplinary research center ! Area 2.2 km2, 4400 employees, 1300 scientists

14.-16.10.2009 Lukas Arnold 5 Jülich Supercomputing Center (JSC) @ FZJ

! operation of the , user support, R&D work in the field of computer and computational science, education and training, 130 employees ! peer-reviewed provision of computer time to national and European computational science projects (NIC, John von Neumann Institute for Computing)

14.-16.10.2009 Lukas Arnold 6 research fields of current projects

14.-16.10.2009 Lukas Arnold 7 user support at JSC

14.-16.10.2009 Lukas Arnold 8 simulation laboratories

14.-16.10.2009 Lukas Arnold 9 systems @ JSC

jugene just hpc-ff juropa

! total power consumption: 2.5 MW (jugene) + 0.3 MW (just) + 1.5 MW (hpc-ff+juropa) + 0.9 MW (cooling) " 5 MW ! total performance: 1000 TF/s (jugene) + 300 TF/s (hpc-ff+juropa) " 1300 TF/s = 1.3 PF/s ! total storage: 0.3 PB (Lustre-FS) + 2.2 PB (GPFS@34GB/s) + 2.5 PB (Archive) ! 5 PB 14.-16.10.2009 Lukas Arnold 10 hpc-ff + juropa

! 3288 Compute nodes in total ! 2 Intel Xeon X5570 (Nehalem-EP) ! quad-core processors per node ! 2.93 GHz and Hyperthreading ! 3 GB per physical core ! Installed at JSC in April-June 2009 ! 308 TFlop/s peak performance ! 274.8 TFlop/s LINPACK performance ! No. 10 in TOP500 on June 2009

14.-16.10.2009 Lukas Arnold 11 jugene

! IBM BlueGene/P system ! 72 Racks (294,912 cores) ! Installed at JSC in April/May 2009 ! 1 PFlop/s peak performance ! 825.5 TFlop/s LINPACK performance ! No. 3 in TOP500 of June 2009 ! No. 1 system in Europe

14.-16.10.2009 Lukas Arnold 12 jugene setup in 60 seconds

14.-16.10.2009 Lukas Arnold 13 jugene building blocks

Node Card Jugene system (32 chips 4x4x2) 72 Racks, 72x32x32 32 compute, 0-2 IO cards 1 PF/s, 144 TB 435 GF/s, 64 GB Rack 32 Node Cards Cabled 8x8x16 13.9 TF/s, 2 TB

Chip 4 processors 13.6 GF/s Compute Card 1 chip, 13.6 GF/s 2.0 GB DDR2 (4.0GB optional) 14.-16.10.2009 Lukas Arnold 14 BG/P compute and node card

Blue Gene/P compute ASIC 4 cores, 8MB cache Cu heatsink

SDRAM – DDR2 2GB memory

Node card connector network, power

14.-16.10.2009 Lukas Arnold 15 BG/P in numbers Property

Node Node Processors 4* 450 PowerPC® Properties Processor Frequency 0.85GHz Coherency SMP L3 Cache size (shared) 8MB Main Store 2GB Main Store Bandwidth (1:2 pclk) 13.6 GB/s Peak Performance 13.9 GF/node

Torus Bandwidth 6*2*425MB/s=5.1GB/s Network Hardware Latency (Nearest 100ns (32B packet) Neighbour) 800ns (256B packet) Hardware Latency (Worst Case) 3.2#s(64 hops) Tree Bandwidth 2*0.85GB/s=1.7GB/s Network Hardware Latency (worst case) 3.5#s

System Area (72k nodes) 160m2 Properties Peak Performance (72k nodes) ~ 1PF Total Power ~2.3MW 14.-16.10.2009 Lukas Arnold 16 system access

Blue Gene/P 73728 Compute Nodes Control-System Service Node 600 I/O Nodes Service Node

mpirun

FrontEnd

FrontEnd SSH

RAID DB2

Fileserver JUST

14.-16.10.2009 Lukas Arnold 17 system access (cont.) ! Compute Nodes dedicated to running user application, and almost nothing else -simple compute node kernel (CNK) ! I/O Nodes run Linux and provide a more complete range of OS services –files, sockets, process launch, signalling, debugging, and termination ! Service Node performs system management services (e.g., partitioning, heart beating, monitoring errors) -transparent to application software

14.-16.10.2009 Lukas Arnold 18 BG/P compute node software

! Compute Node Kernel (CNK)

! minimal kernel ! handles signals, function shipping ! system calls to I/O nodes, starting/stopping jobs, threads ! not much else

! very “linux-like”, uses glibc ! missing some system calls (fork() mostly) ! limited support for mmap(), execve() ! but, most apps that run on Linux work out-of-the-box on BG/P

14.-16.10.2009 Lukas Arnold 19 BG/P I/O node software

! I/O Node Kernel, Mini-Control Program (MCP) ! Linux ! port of the Linux kernel, GPL/LGPL licensed ! Linux version 2.6.16 ! very minimal distribution ! only connection from compute nodes to outside world ! handles syscalls (ie fopen()) and I/O requests ! file system support: NFS, PVFS, GPFS, Lustre FS

14.-16.10.2009 Lukas Arnold 20 BG/P networks

! 3D torus network ! only for point-to-point between compute nodes ! hardware latency: 0.5 – 5 #s MPI latency: 3 – 10 #s ! bandwidth: 6$2$425 MB/s=5.1 GB/s (per compute node) ! direct memory access (DMA) unit, communication and computation overlap ! collective network ! one-to-all, reduction functionality (compute and I/O nodes) ! one way tree transversal latency: 1.3 #s; MPI: 5 #s ! bandwidth: 850 MB/s per link

14.-16.10.2009 Lukas Arnold 21 BG/P networks (cont.)

! barrier network ! hardware latency for full system: 0.65 #s; MPI 1.6 #s

! 10 Gb network ! I/O nodes only ! file I/O, all external communication ! 1 Gb network ! control network (boot, debug, monitor) ! compute and I/O nodes

14.-16.10.2009 Lukas Arnold 22 BG/P architectural features

! low area foot print (4k cores per rack)

! high energy efficiency (2.5kW per 1 TF/s)

! no network hierarchy, scalable up to full system

! easy programming based on MPI

! high reliability

! balanced system

14.-16.10.2009 Lukas Arnold 23 comparison to other architectures (approximation) ! core linpack performance ! BG/P 3 GF/s ! XT5/PWR6/ 7/ 12.5/ 12 GF/s ! triad memory bandwidth [related to GF/s] per core ! BG/P: 4.4 GB/s [ 1.5 byte/flop ] ! XT5/PWR6/x86 2.5/ 3.3/ (8) GB/s [ 0.3/ 0.25/ 0.7 ] ! all-to-all performance, two nodes [related to GF/s] ! BG/P: 1 GB/s [ 0.08 byte/flop ] ! XT5/PWR6/x86 3/ 3/ 2 GB/s [ 0.05/ 0.004/ 0.01 ] ! energy efficiency ! BG/P: 300 MF/J ! XT5/PWR6/x86 150/ 85/ 200 MF/J

14.-16.10.2009 Lukas Arnold 24 BG/P cons

! only 512 MB memory per core

! low core performance, 5 to 10 times more cores needed (compared to nowadays general CPUs)

! torus network might not perform well for unstructured communication pattern

! cross compilation

! CNK (compute node kernel) is not a full Linux system

14.-16.10.2009 Lukas Arnold 25 application scaling example

PEPC performance

100

10

time in inner loop [s] loop inner timein 1 512 1024 2048 4096 number of cores

IBM BG/P - jugene Intel Nehalem - juropa Cray XT5 - louhi IBM Power6 - huygens 14.-16.10.2009 Lukas Arnold 26 application scaling example (cont.) PEPC performance

100

10 time in inner loop [s] loop inner timein

1 1 10 100 partition performance [TF/s]

IBM BG/P - jugene Intel Nehalem - juropa Cray XT5 - louhi IBM Power6 - huygens 14.-16.10.2009 Lukas Arnold 27 practical information

! contact me (now or tomorrow) for a private key ! account will be valid until 18.10.2009 ! common passphrase: (WS-kra09)

! make sure you are able to login on jugene, !"#$$%#&'#()*+,-)#%./0,1223456)+)789:45)0'/%7;)#

! have a brief look at our documentation and user info, http://www.fz-juelich.de/jsc/jugene/

! you will be able to submit jobs on 16./17.10.2009

14.-16.10.2009 Lukas Arnold 28 PART II JUGENE USAGE

14.-16.10.2009 Lukas Arnold 29 login

! use the uniquely distributed private key

! Login via ! ssh -i ssh_key [email protected]

! Automatically distributed to two different login nodes ! jugene3 and jugene4

see: http://www.fz-juelich.de/jsc/jugene/usage/logon/

14.-16.10.2009 Lukas Arnold 30 available compiler ! need to cross-compile

! compiler for front-end (Power6) only ! GNU: gcc, gfortran, ... ! IBM XL: xlc, xlf90, ... ! and for jugene (PowerPC 450) with MPI wrapper ! GNU: mpicc, mpif90, ... ! IBM XL: mpixlc, mpixlf90, ... ! thread save versions available (*_r)

FZJ-Info: http://www.fz-juelich.de/jsc/jugene/usage/tuning/ IBM XL documentation: http://publib.boulder.ibm.com/infocenter/compbgpl/v9v111/index.jsp BP/P redbook: http://www.fz-juelich.de/jsc/datapool/jugene/bgp_appl_sg247287_V1.4.pdf

14.-16.10.2009 Lukas Arnold 31 XL compiler options (optimization)

! -O2 ! default optimization level ! eliminates redundant code ! basic loop optimization ! can structure code to take advantage of -qarch and -qtune settings ! -O3 ! In-depth memory access analysis ! Better loop scheduling ! High-order loop analysis and transformations ! Inlining of small procedures within a compilation unit by default ! Pointer aliasing improvements to enhance other optimizations ! ...

14.-16.10.2009 Lukas Arnold 32 XL compiler options (optimization, cont.)

! -O4 ! propagation of global and parameter values between compilation units ! inlining code from one compilation unit to another ! reorganization or elimination of global data structures ! an increase in the precision of aliasing analysis ! equal to -O3 -qipa -qhot -qarch=auto -qtune=auto -qcache=auto ! -O5 ! most aggressive optimizations available ! makes full use of loop optimizations and IPA ! equal to -O4 -qipa=level=2 ! to get more info search the compiler documentation for "optimization" http://publib.boulder.ibm.com/infocenter/compbgpl/v9v111/index.jsp

14.-16.10.2009 Lukas Arnold 33 useful XL compiler options

! additional compiler output ! -qreport, generates *.lst report files ! -qxflag=diagnostic, print simdization information to stdout

! PowerPC 450 specific instructions ! -qtune=450 ! -qarch=450 (single FPU) ! -qarch=450d (double FPU)

14.-16.10.2009 Lukas Arnold 34 load leveler

! batch submission system on jugene ! main LL commands: ! llq [-u username], lists jobs [of user] ! llqx, detailed informations on queue status ! llsubmit file, submit job in file ! llcancel jobid, cancel job with jobid ! FZJ extensions: ! llview, gui for system status ! llrun, interactive lauch on the interactive partition, e.g.: • llrun [-np #ranks] [-mode nodemode] execname execargs • llrun -h, provides complete list

14.-16.10.2009 Lukas Arnold 35 batch job scripts

! file structure ! LL variable lines: # @ name = value ! last LL line: # @ queue ! remaining part: shell script to be executed ! start parallel job with mpirun, main arguments • -np #ranks , start job with #ranks MPI ranks • -mode SMP,DUAL,VN , set the execution mode • -env NAME=VALUE , pass environment variables • -verbose 0,1,2,3,4 , verbosity level • -h , extended help, run at fornt-end

14.-16.10.2009 Lukas Arnold 36 batch job scripts (cont.)

! main jugene specific loadleveler variables ! BG_CONNECTION, sets the topology ! BG_SIZE, requested number of nodes ! BG_SHAPE, requested torus shape in midplanes ! BG_ROTATE, allow the requested shape to be rotated ! RES_ID, specifies the reservation to run on

for more keywords, see: http://www.fz-juelich.de/jsc/jugene/usage/loadl/keywords/

14.-16.10.2009 Lukas Arnold 37 batch job scripts (example)

# @ job_name = LoadL_Sample_1 # @ comment = "BGP Job by Size“ # @ error = $(job_name).$(jobid).out # @ output = $(job_name).$(jobid).out # @ environment = COPY_ALL; # @ wall_clock_limit = 00:20:00 # @ notification = error # @ notify_user = [email protected] # @ job_type = bluegene # @ bg_size = 32 # @ queue mpirun -exe myprogp.rts -mode VN -np 128 -verbose 2 -args "-t 1"

for more examples, see: http://www.fz-juelich.de/jsc/jugene/usage/loadl/jobfile/

14.-16.10.2009 Lukas Arnold 38 scheduling policy job class max wall time base partition m144 24h on demand only m128 24h on demand only m064 24h all m032 24h all m016 24h R5-R8 m008 24h R5-R8 m004 24h R5-R8 m002 24h R5-R8 m001 24h R5-R8 small 30 min R87 nocsmall 30 min R87

14.-16.10.2009 Lukas Arnold 39 llview (written by Wolfgang Frings)

! available on all FZJ machines

! current partitioning

! queue preview

usage: llview

14.-16.10.2009 Lukas Arnold 40 how to get computing time

! application twice a year accepted by the John von Neumann Institut für Computing (NIC)

! project run time is one year

! steering committee

! scaling approval for jugene needed

see the NIC page for more information: http://www.fz-juelich.de/nic/index-e.html

14.-16.10.2009 Lukas Arnold 41 hands-on examples repository ! all examples are located in the subversion repository (username+passwd: guest), please update every day

https://svn.version.fz-juelich.de/sc-geegaw/bgp-workshop-krakow-2009

! contains this talk, exercise description and example code: ! "hello MPI world" (mpi-, bgp-personality-version) ! "dgemm test" ! "comm./comp. overlap test" (base-, full-version) ! "simple hyperbolic solver" (base-, simd-, mpi_cart-, mapping-, libhpm-, scalasca-version) ! use the following command to check out (see workshop-repository.txt) $> svn co --username guest REPOSITORY

14.-16.10.2009 Lukas Arnold 42 exercise 1: “Hallo (MPI) World!”

! thursday ! use Fortran/C/C++ and MPI, preferred C ! use MPI compiler wrapper ! try out the interactive and batch queuing

! friday ! print the position on the torus network, see BGP personality, only C ! observe the MPI mapping (BG_MAPPING) ! see redbook appendix A for node hardware naming

14.-16.10.2009 Lukas Arnold 43 PART III THE POWERPC450 DFPU

14.-16.10.2009 Lukas Arnold 44 PowerPC 450 schematics

14.-16.10.2009 Lukas Arnold 45 PowerPC 450 chip

! IBM Cu-08 90nm CMOS ASIC process technology ! die size 173 mm2 ! clock frequency 850MHz ! transistor count 208M ! power dissipation 16W

14.-16.10.2009 Lukas Arnold 46 execution modes

Virtual Node (VN) DUAL Mode SMP Mode Mode 2 MPI tasks 1 MPI task 4 MPI tasks 1-2 Threads / Task 1-4 Threads / Task 1 Thread / Task

P0 P1 P0 P1 P0 T0 T0 T0 T0 T0 T1

P2 P3 T0 T0 T1 T1 T2 T3

usage: [llrun,mpirun] -mode [VN,DUAL,SMP]

14.-16.10.2009 Lukas Arnold 47 double FPU ! SIMD instructions over both register files ! floating point multiply-add (FMA) operations over double precision data ! more general operations available with cross and replicated operands ! useful for complex arithmetic, matrix multiply, FFT ! not two independent units ! parallel (quadword) loads/stores ! Fastest way to transfer data between processors and memory ! Data needs to be 16-byte aligned ! Load/store with swap order available

14.-16.10.2009 Lukas Arnold 48 simdization criteria

! data must be 16 byte aligned ! loop boundaries must be known at compile time ! no function calls inside loop ! no branches inside loop ! disjoined pointer ! high trip count ! stride one access

14.-16.10.2009 Lukas Arnold 49 compiler options for simdization

! SIMD instructions are only generated with the -qarch=450d and -qhot=simd compiler option

! at least -O2 optimization level, -O4 contains both above options

! use -qreport and/or -qxflag=diagnostic for additional information on SIMD instrumentation

14.-16.10.2009 Lukas Arnold 50 data alignment hints

! already aligned structures (IBM XL) ! heap and global variables ! not true for common blocks in Fortran ! tell the compiler, that arrays are aligned: ! __alignx in C; ALIGNX in Fortran ! check/make sure they really are ! use disjoined pointer in C ! restricted keyword ! #pragma disjoint

see: section 8.12.3 in BG/P redbook

14.-16.10.2009 Lukas Arnold 51 compiler simd hints, example ! use -qreport to generate output to *.lst, it contains pseudo-code and simd info ! simdization done ... 1586-534 (I) Loop (loop index 2) at dgemmtest.f was not SIMD vectorized because the loop is not the innermost loop. 1586-538 (I) Loop (loop index 2) at dgemmtest.f was not SIMD vectorized because it contains unsupported loop structure. 1586-542 (I) Loop (loop index 3 with nest-level 1 and iteration count 200) at dgemmtest.f was SIMD vectorized. 1586-534 (I) Loop (loop index 4) at dgemmtest.f was not SIMD vectorized because the loop is not the innermost loop. 1586-542 (I) Loop (loop index 7 with nest-level 1 and iteration count 200) at dgemmtest.f was SIMD vectorized. 1586-543 (I) Total number of the innermost loops considered <"2">. Total number of the innermost loops SIMD vectorized <"2">. ... ! simdization prevented by unaligned memory ... 1586-534 (I) Loop (loop index 2) at dgemmtest.f was not SIMD vectorized because the loop is not the innermost loop. 1586-538 (I) Loop (loop index 2) at dgemmtest.f was not SIMD vectorized because it contains unsupported loop structure. 1586-550 (I) Loop (loop index 3) at dgemmtest.f was not SIMD vectorized because it is not profitable to vectorize. 1586-536 (I) Loop (loop index 3) at dgemmtest.f was not SIMD vectorized because it contains memory references ((char *)&data + -1608 + (1600)*($.CIVD * 2 + 2) + (8)*($.CIVC + 1)) with non-vectorizable alignment. ... 14.-16.10.2009 Lukas Arnold 52 example 2: matrix-matrix-multiplication

! consider two cases ! the not-matrix-matrix-multiplication

ci, j "ci, j + $ak,i # bk, j k ! the correct matrix-matrix-multiplication

ci, j "ci, j + $ai,k # bk, j ! k ! follow the optimization instructions in exercise_2.txt

14.-16.10.2009 Lukas Arnold 53 ! pseudo-dgemm, performance

! not a matrix matrix multiplication ! performance in MF/s ! run in SMP mode

-qarch=450 -qarch=450d -O2 30 30 -O3 -qstrict 485 511 -O3 -qhot 526 700 -O4 488 694 -O5 530 695

14.-16.10.2009 Lukas Arnold 54 dgemm, performance

! performance in MF/s ! run in SMP mode

-qarch=450d -O2 130 -O3 -qstrict 344 -O3 -qhot 380 -O4 380 -O5 380 -O5 -qessl 2530

14.-16.10.2009 Lukas Arnold 55 PART IV THE TORUS NETWORK

14.-16.10.2009 Lukas Arnold 56 BG/P messaging framework

! MPI 2.1 standard, see here (http://www.mpi-forum.org/mpi2_1/index.htm) ! derived from MPICH-2, see here (http://www.mcs.anl.gov/research/projects/mpich2/) ! deep computing messaging framework (DCMF), see doxygen documentation here (http://dcmf.anl-external.org/wiki) 14.-16.10.2009 Lukas Arnold 57 BG/P networks

! 3D torus network ! only for point-to-point between compute nodes ! hardware latency: 0.5 – 5 #s MPI latency: 3 – 10 #s ! bandwidth: 6$2$425 MB/s=5.1 GB/s (per compute node) ! direct memory access (DMA) unit, communication and computation overlap ! collective network ! one-to-all, reduction functionality (compute and I/O nodes) ! one way tree transversal latency: 1.3 #s; MPI: 5 #s ! bandwidth: 850 MB/s per link

14.-16.10.2009 Lukas Arnold 58 BG/P networks (cont.)

! barrier network ! hardware latency for full system: 0.65 #s; MPI 1.6 #s

! 10 Gb network ! I/O nodes only ! file I/O, all external communication ! 1 Gb network ! control network (boot, debug, monitor) ! compute and I/O nodes

14.-16.10.2009 Lukas Arnold 59 torus network constrains

! exclusive cable usage by application ! always same maximal network performance ! restrictions for system partitioning

! cable lengths must be nearly equal ! unintuitive partitioning ! modular extendable

! torus network is used only for point-to-point communication

14.-16.10.2009 Lukas Arnold 60 1D torus network

! each node has 6 ! simple 1D torus, bidirectional connections a.k.a. a ring

14.-16.10.2009 Lukas Arnold 61 1D torus network (cont.)

! torus cabling strategy

14.-16.10.2009 Lukas Arnold 62 3D torus cabling

! same strategy as 1D ! midplane is the elementary network unit in BG/P (512 nodes)

! shape is measured in midplanes, i.e

14.-16.10.2009 Lukas Arnold 63 Z-links in the BG/P torus

! connect four midplanes in the z-direction ! jugene row: 16 midplanes, 8 racks, 4 z-linked groups

14.-16.10.2009 Lukas Arnold 64 Y-linkage in the BG/P torus

! connects the four z-linked groups

14.-16.10.2009 Lukas Arnold 65 partitioning example of a jugene row

! exclusive usage of cabling limits possible partitions

14.-16.10.2009 Lukas Arnold 66 X-linkage of the BG/P torus

! X-cable (top) and X-split-cable (bottom)

14.-16.10.2009 Lukas Arnold 67 full jugene torus linkage (Z- and Y-links)

14.-16.10.2009 Lukas Arnold 68 full jugene torus linkage (X-links)

14.-16.10.2009 Lukas Arnold 69 full jugene torus linkage (X-split-links)

14.-16.10.2009 Lukas Arnold 70 two jugene rows partitioning example

14.-16.10.2009 Lukas Arnold 71 get the own position in torus

! use the bg_personality struct ! main functions ! data struct: _BGP_Personality_t pers ! get data: Kernel_GetPersonality(&pers, sizeof(pers)) ! torus extension in x: xsize = personality.Network_Config.Xnodes ! x coords: torus_x = personality.Network_Config.Xcoord

see: section Appendix B in BG/P redbook

14.-16.10.2009 Lukas Arnold 72 MPI rank mapping

! use BG_MAPPING environment variable ! usage: mpirun -env BG_MAPPING=[TXYZ,XYZT,...] ! XYZT, first place one MPI rank on each node, then start distributing the second MPI rank, and so on; default mapping ! TXYZ, first fill first node (4 ranks in VN mode), then proceed to the next node ! explicit map file ! usage: mpirun -mapfile filename ! each line contains the torus (4D;x y z t) coordinates of one rank; line 1 sets the coordinates of rank 0, 2 of 1, ... ! must specify BG_SHAPE=XxYxZ and BG_ROTATE=FALSE

14.-16.10.2009 Lukas Arnold 73 mapping of a 1d system (slithering snake)

14.-16.10.2009 Lukas Arnold 74 mapping of a 2d system (folded paper)

14.-16.10.2009 Lukas Arnold 75 libhpc (former libhpm)

! instrument code for hw counter ! HPCT dir: /bgsys/local/hpct ! main routines (see example_4.txt) ! initialize ! start measurement ! end measurement ! finalize ! include header file (libhpc.h): -I$(HPCT)/include ! link with following libraries: ! $(HPCT)/lib/libhpc.a $(HPCT)/lib/fake_dlfcn.o $(HPCT)/lib/average.o $(HPCT)/lib/liblicense.a ! user guide: $(HPCT)/doc/hpm/HPM_ug.pdf

14.-16.10.2009 Lukas Arnold 76 scalasca

! code instrumentation tool, including visualization ! three steps to analysis ! (load the scalasca module: module load scalasca) ! instrument: put skin before the compiler/linker command ! run experiment: put skan before mpirun/llrun command ! analyse experiment: square EPIKFOLDER ! documentation

! quick guide: available here (http://www.fz-juelich.de/jsc/datapool/scalasca/QuickReference.pdf)

! full user guide: available here (http://www.fz-juelich.de/jsc/datapool/scalasca/UserGuide.pdf) ! web page: www.scalasca.org

14.-16.10.2009 Lukas Arnold 77 exercise 3: "overlap comm. and comp."

! very simple structure ! even ranks send to odd ranks ! non blocking: send/receive, compute, wait ! blocking: send/receive, wait, compute

! use DCMF_INTERRUPT variable, set it to 1 (default is 0) ! overlap might be defined as t(nonblocking) " t(work) overlap =1" t(blocking) " t(work)

14.-16.10.2009 Lukas Arnold 78

! results for exercise 3, "overlap"

1 0.9 0.8 0.7 0.6 0.5 0.4 overlap 0.3 0.2 0.1 0 -0.1

ratio communication to computation time

14.-16.10.2009 Lukas Arnold 79 simple hyperbolic solver (shs)

! solve the Euler equation ! simple one-step explicit, finite difference scheme ! problem setup: ! 2D computational domain ! each MPI rank has a local grid of nx times ny size ! weak scaling ! communication ! only neighbour communication ! px times py rank distribution

14.-16.10.2009 Lukas Arnold 80 shs (cont.)

! 2D domain decomposition

boundary MPI ranks MPI py

local domain (nx x ny) px MPI ranks

14.-16.10.2009 Lukas Arnold 81 shs implementation ! not interesting files ! boundary.[ch], output.[ch], defs.c ! compilation ! makefile, no need for modification ! makefile.defs, adapt compiler options ! main loop is in shs.c ! mpi rank to physical partition mapping ! mpi_distribution() in setup.c ! integrator, a.k.a. floating point operation part ! just modify calculateFluxes() in integrator.c ! parameter file: param ! usage: shs param

14.-16.10.2009 Lukas Arnold 82 exercise 4 – shs

! compile and run the example ! scaling: does the code scale up to 128 cores? ! single core performance: get compiler to use SIMD instructions ! network performance ! naive mpi mapping ! MPI_Cart_create ! own mapping file ! measure the performance using libhpc ! use scalasca to demonstrate the torus benefits ! shs parameter: ! problem size: nx = ny = 1000; nt = 10 ! partitioning: px, py needs to be adapted to partition size

14.-16.10.2009 Lukas Arnold 83 shs, single core performance

! performance (setFluxes function, MF/s) impact on execution mode and optimization level

VN DUAL SMP -O2 155 155 155 -O3 180 180 180 -O3 -qhot=simd -qarch=450d 180 460 460

! note: shs is memory bound, no memory locality

14.-16.10.2009 Lukas Arnold 84 shs, network performance

! px=16; py=128; time per communication step in ms

MESH TORUS MESH TORUS (XYZT) (XYZT) (TXYZ) (TXYZ) naive 1.9 1.1 0.62 0.53 MPI_Cart 0.59 0.28 1.1 0.68 mapfile 0.33 0.24 0.33 0.24 not well fitting 0.63 0.40 0.63 0.40 the mapfile

! benefits of the order of 5, even for small torus ! in general MPI 2D communicators are not bad

14.-16.10.2009 Lukas Arnold 85 shs, network performance, mapping MPI 2D cartesian communicator

0.0025 0.002 0.0015 0.001 0.0005 comm.time[s] 0 1/2048 2/1024 4/512 8/256 16/128 32/64 problem partitioning px/py

MESH (TXYZ) MESH(XYZT) TORUS (TXYZ) TORUS (XYZT)

14.-16.10.2009 Lukas Arnold 86 shs scalasa instrumentation, MESH network

14.-16.10.2009 Lukas Arnold 87 shs scalasa instrumentation, TORUS network

14.-16.10.2009 Lukas Arnold 88 shs, libhpm instrumentation

! shs internal performance estimation (nt=10, i.e. total FLOPs are 271 GF)

* approx. total (per step) FLOPs: 2.71e-02 [GF], performance: 8.31e-02 GF/s!

! libhpc output

Total floating point operations : 352.947 M! Algebraic / user time : 93.318 Mflop/s!

! main difference due to the unequal measurement scope ! internal: just the integrator part, i.e. without the boundary ! libhpc: full internal loop, i.e. integrator and boundary treatment

14.-16.10.2009 Lukas Arnold 89