2016/05/30 How to use FX10, Parallel Numerical Algorithms 2016 1

Satoshi OHSHIMA (Assistant professor, Supercomputing Research Division, Information Technology Center, The University of Tokyo) 2016/05/30 2

1. Introduction of FX10 (Oakleaf-FX) – Introduction of SCD/ITC, UTokyo – System overview (Hardware, Software, and Services) 2. How to use Oakleaf-FX – First step to login – How to use “job management system” 3. Optimization Techniques

Q&A 2016/05/30 3 2016/05/30 4

• Oakleaf- FX? FX10? – product name = FUJITSU PRIMEHPC FX10 • commercial version of “K” computer – nickname = Oakleaf-FX • Oakleaf- FX is installed in Information Technology Center, The University of Tokyo (ITC, UTokyo) 2016/05/30 5

Kashiwa Campus 「柏の葉」area oak leaf

Oakleaf-FX Hongo Campus

Komaba Campus

Oakbridge-FX Yayoi 2016/05/30 6

• Campus/Nation-wide Services on Infrastructure for Information, related Research & Education • Established in 1999 – Campus-wide Communication & Computation Division – Digital Library/Academic Information Science Division – Network Division – Supercomputing Division • Core Institute of Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN) since 2010 • Key Institute of HPCI (HPC Infrastructure) 2016/05/30 7

http://www.cc.u-tokyo.ac.jp/ • 11 Faculty Members + 8 Technical Staff – System Software, Numerical Library, Applications, GPU, etc. • History – Supercomputing Center, UTokyo (1965~1999) • Oldest Academic Center in Japan • Nation-Wide, Joint-Use Facility – Information Technology Center (1999~) (4 divisions) • Services & Operations, Research, Education 2016/05/30 8

http://www.cc.u-tokyo.ac.jp/ • Collaboration with Users – Linear Solvers, Parallel Vis., Performance Tuning • Research Projects – FP3C (collab. with French Institutes) (FY.2010-2013) • Tsukuba, Tokyo Tech, Kyoto – Feasibility Study of Advanced HPC in Japan (towards Japanese Exascale Project) (FY.2012-2013) • 1 of 4 Teams: General Purpose Processors, Latency Cores – ppOpen -HPC (FY.2011-2015) – Post K with RIKEN AICS (FY.2014-) – ESSEX -II (FY.2016-2018): German-Japan Collaboration • International Collaborations – Lawrence Berkeley National Laboratory (USA) – National Taiwan University (Taiwan) – National Central University (Taiwan) – Intel Parallel Computing Center – ESSEX -II/SPPEXA/DFG (Germany) 2016/05/30 99

FY 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19

Hitachi SR16000/M1 Hitachi SR11000/J2 based on IBM Power-7 18.8TFLOPS, 16.4TB 54.9 TFLOPS, 11.2 TB Fat nodes with large memory Our last SMP, to be switched to MPP

Hitachi HA8000 (T2K) 140TFLOPS, 31.3TB today (Flat) MPI, good comm. performance Fujitsu PRIMEHPC FX10 based on SPARC64 IXfx 1.13 PFLOPS, 150 TB 2 big systems Turning point to Hybrid Parallel Prog. Model 6year cycle Post T2K 25+ PFLOPS

Peta 京(=K) Initial Plan 2016/05/30 10

Oakleaf-fx (retired, March 2014) (Fujitsu PRIMEHPC FX10) Total Peak performance : 1.13 PFLOPS Yayoi T2K-Todai Total number of nodes : 4800 (Hitachi SR16000/M1) (Hitachi HA8000-tc/RS425 ) Total memory : 150 TB Total Peak performance : 54.9 TFLOPS Total Peak performance : 140 TFLOPS Peak performance / node : 236.5 GFLOPS Total number of nodes : 56 Total number of nodes : 952 Main memory per node : 32 GB Total memory : 11200 GB Total memory : 32000 GB Disk capacity : 1.1 PB + 2.1 PB Peak performance / node : 980.48 GFLOPS Peak performance / node : 147.2 GFLOPS SPARC64 Ixfx 1.84GHz Main memory per node : 200 GB Main memory per node : 32 GB, 128 GB Disk capacity : 556 TB Disk capacity : 1 PB Oakbridge-fx IBM POWER 7 3.83GHz AMD Quad Core Opteron 2.3GHz small size FX10, for long-time job execution 136.2 TFLOPS, 576 nodes

Total Users > 2,000 2016/05/30 11

FX10 (Oakleaf-FX) SMP (Yayoi) HA8000 (T2K) PRIMEHPC FX10 SR16000/M1 (retired)

CPU FUJITSU SPARC64IXfx IBM Power7 3.83GHz AMD Quad Core 1.8GHz Opteron 2.3GHz

Total # of core 76800 1792 15232

Total Peak FLOPS 1.13 PFLOPS 54.9 TFLOPS 140 TFLOPS

Total # of nodes 4800 56 952

Total Memory 150 TB 11200 GB 32 TB

# of core / node 16 32 16

Perk FLOPS / node 236.5 GFLOPS 980.5 GFLOPS 147.2 GFLOPS

Memory / node 32 GB 200 GB 32 GB, 128 GB

Network Tofu 6D Mesh/Torus Hierarchical Full- Myrinet 10G bisection Full-bisection

Storage 1.1PB + 2.1 PB 556 TB 1 PB 2016/05/30 12

• Well -Balanced System – Peak Performance: 1.13 PFLOPS, 398 TB/sec – Max. Power Consumption < 1.40 MW (<2.00MW with A/C) • Strict Requirement after March 11, 2011 • 1.043 PFLOPS for Linpack with 1.177 MW (excluding A/C) • 6- Dim. Mesh/Torus Interconnect – Highly Scalable Tofu Interconnect – 5.0x2 GB/sec/link, 6 TB/sec for Bi-Section Bandwidth • High- Performance File System – FEFS (Fujitsu Exabyte File System) based on Lustre • Flexible Switching between Full/Partial Operation • K compatible (16 cores/node, K: 8 cores/node) ! • Open- Source Libraries/Applications • Highly Scalable for both of Flat MPI and Hybrid (OpenMP + MPI) 2016/05/30 13

計算ノード群・インタラクティブノード群Compute nodes, Interactive nodes Management servers PRIMEHPCPRIMEHPC FX10 FX1050筐体構成 x 50 racks 計算ノード ノード (4,800(4,800 compute+ 300 nodes IO ) ) Job management, operation management, authentication servers: [総理論演算性能 : 1.13PFLOPS]Peak Performance: 1.13 petaflops [総主記憶容量 : 150TByte]Memory capacity: 150 TB PRIMERGY RX200S6 x 16 [インターコネクト: 6次元メッシュInterconnect:/トーラス 6D mesh/torus] - ”Tofu” External connection router Local file system External Ethernet InfiniBand file system PRIMERGY RX300 S6 x 2 (MDS) network ETERNUS DX80 S2 x 150 (OST) network Storage capacity: 1.1PB (RAID-5) Campus LAN Shared file system

PRIMERGY RX300 S6 x 8 (MDS) PRIMERGY RX300 S6 x 40 (OSS) End users ETERNUS DX80 S2 x 4 (MDT) Log-in nodes ETERNUS DX410 S2 x 80 (OST) InfiniBand PRIMERGY RX300 S6 x 8 Ethernet Storage capacity: 2.1PB (RAID-6) FibreChannel • Aggregate memory bandwidth: 398 TB/sec. • Local file system for staging with 1.1 PB of capacity and 131 GB/sec of aggregate I/O performance (for staging) • Shared file system for storing data with 2.1 PB and 136 GB/sec. • External file system: 3.6 PB 2016/05/30 14

SPARC64™ IXfx SPARC64™ VIIIfx CPU 1.848 GHz 2.000 GHz “K” computer Number of Cores/Node 16 8

Size of L2 Cache/Node 12 MB 6 MB

Peak Performance/Node 236.5 GFLOPS 128.0 GFLOPS

Memory/Node 32 GB 16 GB

Memory Bandwidth/Node 85 GB/sec (DDR3-1333) 64 GB/sec (DDR3-1000) 2016/05/30 15

• Enhanced instruction set for the SPARC-V9 instruction set arch. – High -Performance & Power-Aware • Extended Number of Registers – FP Registers: 32→256 • S/W Controllable Cache – “Sector Cache” – for keeping reusable data sets on cache • High-Performance, Efficient – Optimized FP functions – Conditional Operation 2016/05/30 16

• A “System Board” with 4 nodes • A “Rack” with 24 system boards (= 96 nodes) • Full System with 50 Racks, 4,800 nodes 2016/05/30 17

• Node Group – 12 nodes = 1 group – A/C -axis : on system board, B-axis: 3 system boards • 6D : (X,Y,Z,A,B,C) – ABC 3D Mesh : connects 12 nodes of each node group – XYZ 3D Mesh : connects “ABC 3D Mesh” group 2016/05/30 18

Computing/Interactive Nodes Login Nodes OS Special OS(XTCOS) Red Hat Enterprise Linux Fujitsu Fujitsu (Cross Compiler) Fortran 77/90 Fortran 77/90 Compiler C/C++ C/C++ GNU GNU (Cross Compiler) GCC,g95 GCC,g95 Fujitsu SSL II (Scientific Subroutine Library II),C-SSL II,SSL II/MPI Library Open Source BLAS,LAPACK,ScaLAPACK,FFTW,SuperLU,PETSc,METIS, SuperLU_DIST,Parallel NetCDF OpenFOAM,ABINIT-MP,PHASE, Applications FrontFlow/blue FrontSTR,REVOCAP File System FEFS (based on Lustre) bash, tcsh, zsh, emacs, autoconf, automake, bzip2, cvs, gawk, gmake, gzip, make, Free Software less, sed, tar, vim, etc. NO ISV/Commercial Applications (e.g. NASTRAN, ABAQUS, STAR-CD etc.) FY.2014: 83.6% Average Oakleaf-FX + Oakbridge-FX

100 Oakleaf-FX

90 Oakbridge-FX Yayoi 80

70

60

% 50

40

30

20

10

0

19 20

Engineering Earth/Space Material Energy/Physics Information Sci. Education Industry Bio Economics

Applications 21

EngineeringEngineering Earth/SpaceEarth/Space MaterialMaterial Energy/PhysicsEnergy/Physics InformationInformation Sci.Sci. EducationEducation IndustryIndustry BioBio EconomicsEconomics

Oakleaf-FX + Oakbridge-FX 22

General Group Users HPCI JHPCN Industry Education HPC-Challenge Personal Users Young Researcher

Oakleaf-FX + Oakbridge-FX 2016/05/30 23

• Not FREE • Service Fee = Cost for Electricity (System+A/C) – 2M USD for Oakleaf-FX (2 MW) – 1M USD for T2K (1 MW) (~March 2014) 2016/05/30 24

• Originally, only academic users have been allowed to access our supercomputer systems. • Since FY.2008, we started services for industry – supports to start large-scale computing for future business – not compete with private data centers, cloud services … – basically, results must be open to public – max 10% total comp. resource is open for usage by industry – special qualification processes/special (higher) fee for usage • Currently Oakleaf-FX is open for industry – Normal usage (more expensive than academic users) • 3-4 groups per year, fundamental research – Trial usage with discount rate – Research collaboration with academic rate (e.g. Taisei) – Open -Source/In-House Codes (NO ISV/Commercial App.) 2016/05/30 25

• 2-Day “Hands-on” Tutorials for Parallel Programming by Faculty Members of SCD/ITC (Free) – Fundamental MPI (3 times per year) – Advanced MPI (2 times per year) – OpenMP for Multicore Architectures (2 times per year) – Participants from industry are accepted. • Graduate/Undergraduate Classes with Supercomputer System (Free) – We encourage faculty members to introduce hands-on tutorial of supercomputer system into graduate/undergraduate classes. – Up to 12 nodes (192 cores) of Oakleaf-FX – Proposal-based – Not limited to Classes of the University of Tokyo, 2-3 of 10 • RIKEN AICS Summer/Spring School (2011~) 2016/05/30 26

• Proposal -based Research Project • Each group with accepted proposal can use full- system of Oakleaf-FX with 4,800 nodes for 24 hours • Once per month • Open to public 2016/05/30 27

• First step to login • How to use “job management system” 2016/05/30 28

We can’t login to compute nodes!

We are here!

We have to use login nodes and use job management system to utilize compute nodes. 2016/05/30 29

• Preparation of the 1st login 1. get the account (ID and Password) 2. make ssh-key (ssh-keygen, PuTTYgen, etc.) 3. register the public key on User Portal web site • https/ :/ oakleaf-www.cc.u-tokyo.ac.jp/ • Login after preparation – ssh login using ssh-key • ssh [email protected] • from WindowsPC, you should use Cygwin, PuTTY, Tera Term, etc. 2016/05/30 30

• Password – Printed password is NOT a collect password. • Explained at the lecture – Password is only use at portal site. • required for viewing documents and using profiler • can change password at portal site – Caution (Security risk!) • DO NOT use ssh-key without path phrase • DO NOT write real (correct) password on a paper • Further detail (mainly for Windows users) – read the documents (user guide) on the web 2016/05/30 31

• How to use login nodes and job management system? • login nodes are commodity x86_64 Linux – not SPARC64 architecture – common GNU tools and etc. are available • If some programs are not installed, you should “make” software yourself. – DO NOT USE parallel programs and heavy load programs on login nodes • We can’t login to compute nodes! • have to write some commands on a job script and submit to a job management system 2016/05/30 32

• job script is simple shell-script file – simple case (for beginners) • only have to write the job information and commands by text editor (emacs, vi, etc.) – advanced case (for professionals) • can use many useful functions of shell scripts (csh, bash, etc.) 2016/05/30 33

• EASY TO USE – but there are many options to accommodate various requests – check the details, options, and advanced information by man command or user guide

• submit a job – pjsub jobscript.sh • confirm a status – pjstat • cancel a job After execution, results (stdout, stderr) are stored to files in the – pjdel job_ID submitted directory. 2016/05/30 34

• Job management system manage the jobs. – control the number of running programs • assign the given jobs considering number of nodes, limit time, and etc. • consume “TOKEN” (important, but don’t treat in this class) – in Oakleaf-FX, only 1 job is assigned and executed at one time on 1 Tofu node (1 job occupy 12 nodes) – wait (queued) if the queue is crowded • Tips – Most of large computer systems of the world use job management systems. – There are some different management systems and special customized system. – We can’t use same scripts anywhere. • read the corresponding documents (user guide, etc.) 2016/05/30 35

• Oakleaf-FX has FUJITSU’s optimized compilers and GNU compilers. • by the way … – compute nodes : SPARC64 architecture – login nodes : x86_64 architecture – compute nodes can’t execute binary files compiled & linked at login nodes • Oakleaf-FX has 2 kinds of compilers – cross compiler : compile on login nodes, execute on compute nodes • frtpx, mpifrtpx, fccpx, mpifxxpc, FCCpx, mpiFCCpx, xpfrtpx – own compiler : compile and execute on compute nodes • frt, mpifrt, fcc, mpifcc, FCC, mpiFCC, xpfrt target node “~px” is cross compiler’s postfix compute node – use compiler login node cross compiler on compute node own compiler 2016/05/30 36

• on login node, the simplest case – fccpx program.c (C language) – mpifrtpx program.f90(Fortran with MPI)

– there are many optimization options • explain them later 2016/05/30 37

Execute 1node OpenMP program (X threads, up to 16 threads)

job.sh

#!/bin/bash Information of this JOB #PJM -L "rscgrp=lecture" name of Queue #PJM -L "node=1" • number of nodes #PJM -L "elapse=10:00" • maximum time #PJM -g gt10 • project code, and etc. #PJM -j •

export OMP_NUM_THREADS=16 ./hello_omp.out commands to execute • set environment variables • execute

• some special & useful environment variables (PJM_~) are given • try to use env command in your job script 2016/05/30 38

Execute X-node MPI program (1 process on each nodes)

job.sh

#!/bin/bash Information of this JOB #PJM -L "rscgrp=lecture" name of Queue #PJM -L "node=X" • number of nodes #PJM -L "elapse=10:00" • maximum time #PJM -g xx1 • project code, and etc. #PJM -j •

mpiexec ./hello_mpi.out commands to execute • set environment variables • execute • nodefile (hostfile) is not required • in MPI execution, additional environment variables (FLIB_~ or OMPI_~) are given 2016/05/30 39

Execute X-node MPI program (Y threads per each nodes, total Z process)

job.sh

#!/bin/bash Information of this JOB #PJM -L "rscgrp=lecture" name of Queue #PJM -L "node=X“ • number of nodes #PJM --mpi “proc=Z” • total number of processes #PJM -L "elapse=10:00" • maximum time #PJM -g xx1 • project code, and etc. #PJM -j •

export OMP_NUM_THREADS=Y commands to execute mpiexec ./hello_hybrid.out • set environment variables • execute • number of nodes, total number of processes, number of threads per node are required 2016/05/30 40 2016/05/30 41

• Oakleaf-FX has some special aspects (hardware & software) – SPARC64 IXfx – FUJITSU’s compilers & libraries – Tofu network – hierarchical storages • By considering some special conditions, we have some chances to obtain more higher performance. – compiler options constructions of nodes today, introduce the brief overview of – some of them – performance analyzer – other features and techniques • read the「チューニングマニュアル」(on portal site) 2016/05/30 42

• Each compilers has own special options. • recommended option – -Kfast (without OpenMP) – -Kfast,openmp (with OpenMP) – good for almost all programs, try this option at the beginning • for SIMD-friendly programs – -Ksimd=2 – try this option when target program is good for SIMD calculation but compiler doesn’t use SIMD • for advanced prefetch – -Kprefetch_indirect NOT always obtain good performance – good for indirect memory access program • direct access : value = A[i] • indirect access : value = A[index[i]] • sparse numerical calculation requires indirect memory access 2016/05/30 43

• How to confirm whether compiler uses SIMD instructions or not? • - Nsrc option output the information of optimization process • If target regions are not optimized, try to modify the codes or use compiler options. – some options encourage to use optimization features 2016/05/30 44 $ fccpx -Kfast,openmp -Nsrc omp_2.c Fujitsu C/C++ Version 1.2.0 Mon Jun 16 10:26:07 2014 Compilation information Current directory : /PATH_TO_TARGET_DIRECTORY/omptest Source file : omp_2.c (line-no.)(optimize) very simple OpenMP 1 #include 2 #include parallel for program 3 4 int main() 5 { 6 int i; 7 double D1[10]; 8 double D2[10]; 9 10 fv for(i=0; i<10; i++){ 11 fv D1[i] = (double)i; 12 fv D2[i] = 0.0; 13 fv } 14 #pragma omp parallel for 15 #pragma omp parallel for is written in source code 16 p 8v for(i=0; i<10; i++){ 17 p 8v D2[i] = D1[i] * 2.0; p : parallelized 18 p 8v } v : SIMD optimized (vecterized) 19 return 0; 20 } 2016/05/30 45

• 1 Tofu group = 12 nodes • in the case of using a lot of nodes, have to consider the construction of Tofu network – user can indicate 3D assignment explicitly • some parallel programs require to use 2^n nodes, but such numbers may not good for Tofu network – be careful when doing very large jobs

• however, education user can use only 12 nodes 2016/05/30 46

• When optimizing programs, to know the detail time (execution time breakdown) is important. • It is not always easy to know the bottleneck of the target program.

• Oakleaf-FX provides profiler and analyzer. – basic profiler (CUI tool) : something like a gprof • fipppx command – detailed analyzer (GUI tool) : something like a vtune • install from portal site • analyzer (Java) and visualizer (Excel macro) 2016/05/30 47

• sector cache – user can control the behavior of the cache – can keep the specific array at cache line – #pragma statement cache_subsector_*

• SIMD instructions – specific SIMD instructions are provided • not same as GNU/Intel/MS SSE instruction, but some instructions have alias instructions (almost same name as GNU/Intel/MS)

• in addition… – tuning guide (see user portal) contains many information and examples 2016/05/30 48 2016/05/30 49 Of course, all many of knowledge on FX10 are useful in future FY ! 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22

Hitachi SR11K/J2 Yayoi: Hitachi SR16000/M1 IBM Power-5+ IBM Power-7 18.8TFLOPS, 16.4TB 54.9 TFLOPS, 11.2 TB

Hitachi HA8000 (T2K) Oakforest-PACS Post T2K AMD Opteron Fujitsu, Intel KNL JCAHPC: 25PFLOPS, 919.3TB U.Tsukuba & 140TFLOPS, 31.3TB U.Tokyo

Oakleaf-FX: Fujitsu PRIMEHPC FX10, Post FX10 SPARC64 IXfx 50+ PFLOPS (?) 1.13 PFLOPS, 150 TB

Oakbridge-FX 136.2 TFLOPS, 18.4 TB

Introduction to the Integrated Reedbush, SGI Supercomputer System for Data Analyses Broadwell + Pascal & Scientific Simulations 1.80-1.93 PFLOPS

Peta K Post-K 2016/05/30 50

If you have some questions, please send e- mail to [email protected].

もっと使ってみたい人は 「お試しアカウント付き並列プログラミング講習会」 を受講してみてください http://www.cc.u-tokyo.ac.jp/support/kosyu/