IBM BG/P Workshop Lukas Arnold, Forschungszentrum Jülich, 14.-16.10.2009 Contact: [email protected] Aim of This Workshop Contribution

IBM BG/P Workshop Lukas Arnold, Forschungszentrum Jülich, 14.-16.10.2009 contact: [email protected] aim of this workshop contribution ! give a brief introduction to the IBM BG/P (sw+hw) ! guide intensively through two aspects ! spent most time with hands-on ! this is not a complete reference talk, as there are already many of them ! aimed for HPC beginners 14.-16.10.2009 Lukas Arnold 2 contents ! part I - Introduction to FZJ/BGP ! systems at FZJ ! IBM Blue Gene/P architecture overview ! part II - jugene Usage ! compiler, submission system ! hands-on: “Hallo (MPI) World!” ! part III - PowerPC 450 ! ASIC, internal structure, compiler optimization ! hands-on: “Matrix-Matrix-Multiplication, a.k.a. dgemm” ! part IV - 3D Torus Network ! torus network strategy, linkage and usage, DMA engine ! hands-on: “Simple Hyperbolic Solver” and “communication and computation overlap” 14.-16.10.2009 Lukas Arnold 3 PART I INTRODUCTION TO FZJ/BGP 14.-16.10.2009 Lukas Arnold 4 Forschungszentrum Jülich (FZJ) ! one of the 15 Helmholtz Research Centers in Germany ! Europe’s largest multi-disciplinary research center ! Area 2.2 km2, 4400 employees, 1300 scientists 14.-16.10.2009 Lukas Arnold 5 Jülich Supercomputing Center (JSC) @ FZJ ! operation of the supercomputers, user support, R&D work in the field of computer and computational science, education and training, 130 employees ! peer-reviewed provision of computer time to national and European computational science projects (NIC, John von Neumann Institute for Computing) 14.-16.10.2009 Lukas Arnold 6 research fields of current projects 14.-16.10.2009 Lukas Arnold 7 user support at JSC 14.-16.10.2009 Lukas Arnold 8 simulation laboratories 14.-16.10.2009 Lukas Arnold 9 systems @ JSC jugene just hpc-ff juropa ! total power consumption: 2.5 MW (jugene) + 0.3 MW (just) + 1.5 MW (hpc-ff+juropa) + 0.9 MW (cooling) " 5 MW ! total performance: 1000 TF/s (jugene) + 300 TF/s (hpc-ff+juropa) " 1300 TF/s = 1.3 PF/s ! total storage: 0.3 PB (Lustre-FS) + 2.2 PB (GPFS@34GB/s) + 2.5 PB (Archive) ! 5 PB 14.-16.10.2009 Lukas Arnold 10 hpc-ff + juropa ! 3288 Compute nodes in total ! 2 Intel Xeon X5570 (Nehalem-EP) ! quad-core processors per node ! 2.93 GHz and Hyperthreading ! 3 GB per physical core ! Installed at JSC in April-June 2009 ! 308 TFlop/s peak performance ! 274.8 TFlop/s LINPACK performance ! No. 10 in TOP500 on June 2009 14.-16.10.2009 Lukas Arnold 11 jugene ! IBM BlueGene/P system ! 72 Racks (294,912 cores) ! Installed at JSC in April/May 2009 ! 1 PFlop/s peak performance ! 825.5 TFlop/s LINPACK performance ! No. 3 in TOP500 of June 2009 ! No. 1 system in Europe 14.-16.10.2009 Lukas Arnold 12 jugene setup in 60 seconds 14.-16.10.2009 Lukas Arnold 13 jugene building blocks Node Card Jugene system (32 chips 4x4x2) 72 Racks, 72x32x32 32 compute, 0-2 IO cards 1 PF/s, 144 TB 435 GF/s, 64 GB Rack 32 Node Cards Cabled 8x8x16 13.9 TF/s, 2 TB Chip 4 processors 13.6 GF/s Compute Card 1 chip, 13.6 GF/s 2.0 GB DDR2 (4.0GB optional) 14.-16.10.2009 Lukas Arnold 14 BG/P compute and node card Blue Gene/P compute ASIC 4 cores, 8MB cache Cu heatsink SDRAM – DDR2 2GB memory Node card connector network, power 14.-16.10.2009 Lukas Arnold 15 BG/P in numbers Property Node Node Processors 4* 450 PowerPC® Properties Processor Frequency 0.85GHz Coherency SMP L3 Cache size (shared) 8MB Main Store 2GB Main Store Bandwidth (1:2 pclk) 13.6 GB/s Peak Performance 13.9 GF/node Torus Bandwidth 6*2*425MB/s=5.1GB/s Network Hardware Latency (Nearest 100ns (32B packet) Neighbour) 800ns (256B packet) Hardware Latency (Worst Case) 3.2#s(64 hops) Tree Bandwidth 2*0.85GB/s=1.7GB/s Network Hardware Latency (worst case) 3.5#s System Area (72k nodes) 160m2 Properties Peak Performance (72k nodes) ~ 1PF Total Power ~2.3MW 14.-16.10.2009 Lukas Arnold 16 system access Blue Gene/P 73728 Compute Nodes Control-System Service Node 600 I/O Nodes Service Node mpirun FrontEnd FrontEnd SSH RAID DB2 Fileserver JUST 14.-16.10.2009 Lukas Arnold 17 system access (cont.) ! Compute Nodes dedicated to running user application, and almost nothing else -simple compute node kernel (CNK) ! I/O Nodes run Linux and provide a more complete range of OS services –files, sockets, process launch, signalling, debugging, and termination ! Service Node performs system management services (e.g., partitioning, heart beating, monitoring errors) -transparent to application software 14.-16.10.2009 Lukas Arnold 18 BG/P compute node software ! Compute Node Kernel (CNK) ! minimal kernel ! handles signals, function shipping ! system calls to I/O nodes, starting/stopping jobs, threads ! not much else ! very “linux-like”, uses glibc ! missing some system calls (fork() mostly) ! limited support for mmap(), execve() ! but, most apps that run on Linux work out-of-the-box on BG/P 14.-16.10.2009 Lukas Arnold 19 BG/P I/O node software ! I/O Node Kernel, Mini-Control Program (MCP) ! Linux ! port of the Linux kernel, GPL/LGPL licensed ! Linux version 2.6.16 ! very minimal distribution ! only connection from compute nodes to outside world ! handles syscalls (ie fopen()) and I/O requests ! file system support: NFS, PVFS, GPFS, Lustre FS 14.-16.10.2009 Lukas Arnold 20 BG/P networks ! 3D torus network ! only for point-to-point between compute nodes ! hardware latency: 0.5 – 5 #s MPI latency: 3 – 10 #s ! bandwidth: 6$2$425 MB/s=5.1 GB/s (per compute node) ! direct memory access (DMA) unit, communication and computation overlap ! collective network ! one-to-all, reduction functionality (compute and I/O nodes) ! one way tree transversal latency: 1.3 #s; MPI: 5 #s ! bandwidth: 850 MB/s per link 14.-16.10.2009 Lukas Arnold 21 BG/P networks (cont.) ! barrier network ! hardware latency for full system: 0.65 #s; MPI 1.6 #s ! 10 Gb network ! I/O nodes only ! file I/O, all external communication ! 1 Gb network ! control network (boot, debug, monitor) ! compute and I/O nodes 14.-16.10.2009 Lukas Arnold 22 BG/P architectural features ! low area foot print (4k cores per rack) ! high energy efficiency (2.5kW per 1 TF/s) ! no network hierarchy, scalable up to full system ! easy programming based on MPI ! high reliability ! balanced system 14.-16.10.2009 Lukas Arnold 23 comparison to other architectures (approximation) ! core linpack performance ! BG/P 3 GF/s ! XT5/PWR6/x86 7/ 12.5/ 12 GF/s ! triad memory bandwidth [related to GF/s] per core ! BG/P: 4.4 GB/s [ 1.5 byte/flop ] ! XT5/PWR6/x86 2.5/ 3.3/ (8) GB/s [ 0.3/ 0.25/ 0.7 ] ! all-to-all performance, two nodes [related to GF/s] ! BG/P: 1 GB/s [ 0.08 byte/flop ] ! XT5/PWR6/x86 3/ 3/ 2 GB/s [ 0.05/ 0.004/ 0.01 ] ! energy efficiency ! BG/P: 300 MF/J ! XT5/PWR6/x86 150/ 85/ 200 MF/J 14.-16.10.2009 Lukas Arnold 24 BG/P cons ! only 512 MB memory per core ! low core performance, 5 to 10 times more cores needed (compared to nowadays general CPUs) ! torus network might not perform well for unstructured communication pattern ! cross compilation ! CNK (compute node kernel) is not a full Linux system 14.-16.10.2009 Lukas Arnold 25 application scaling example PEPC performance 100 10 time in inner loop [s] inner loop timein 1 512 1024 2048 4096 number of cores IBM BG/P - jugene Intel Nehalem - juropa Cray XT5 - louhi IBM Power6 - huygens 14.-16.10.2009 Lukas Arnold 26 application scaling example (cont.) PEPC performance 100 10 time in inner loop [s] inner loop timein 1 1 10 100 partition performance [TF/s] IBM BG/P - jugene Intel Nehalem - juropa Cray XT5 - louhi IBM Power6 - huygens 14.-16.10.2009 Lukas Arnold 27 practical information ! contact me (now or tomorrow) for a private key ! account will be valid until 18.10.2009 ! common passphrase: (WS-kra09) ! make sure you are able to login on jugene, !"#$$%#&'#()*+,-)#%./0,1223456)+)789:45)0'/%7;)# ! have a brief look at our documentation and user info, http://www.fz-juelich.de/jsc/jugene/ ! you will be able to submit jobs on 16./17.10.2009 14.-16.10.2009 Lukas Arnold 28 PART II JUGENE USAGE 14.-16.10.2009 Lukas Arnold 29 login ! use the uniquely distributed private key ! Login via ! ssh -i ssh_key [email protected] ! Automatically distributed to two different login nodes ! jugene3 and jugene4 see: http://www.fz-juelich.de/jsc/jugene/usage/logon/ 14.-16.10.2009 Lukas Arnold 30 available compiler ! need to cross-compile ! compiler for front-end (Power6) only ! GNU: gcc, gfortran, ... ! IBM XL: xlc, xlf90, ... ! and for jugene (PowerPC 450) with MPI wrapper ! GNU: mpicc, mpif90, ... ! IBM XL: mpixlc, mpixlf90, ... ! thread save versions available (*_r) FZJ-Info: http://www.fz-juelich.de/jsc/jugene/usage/tuning/ IBM XL documentation: http://publib.boulder.ibm.com/infocenter/compbgpl/v9v111/index.jsp BP/P redbook: http://www.fz-juelich.de/jsc/datapool/jugene/bgp_appl_sg247287_V1.4.pdf 14.-16.10.2009 Lukas Arnold 31 XL compiler options (optimization) ! -O2 ! default optimization level ! eliminates redundant code ! basic loop optimization ! can structure code to take advantage of -qarch and -qtune settings ! -O3 ! In-depth memory access analysis ! Better loop scheduling ! High-order loop analysis and transformations ! Inlining of small procedures within a compilation unit by default ! Pointer aliasing improvements to enhance other optimizations ! ..

Load more