Performance migration from Westmere to Intel thru Advanced Vector Extensions (AVX) Nagarajan Kathiresan IBM India Presented by Giri Prabhakar Contact: [email protected] [email protected]

© 2012 IBM Corporation

Source: Intel MMX, SSE and AVX IBM India © 2012 IBM Corporation 3 . “I must have the Intel compiler, it has sped up our application by two.” - A customer when moving from version 9.1 to version 10 of the Intel compiler

Source: Intel IBM India © 2012 IBM Corporation 4

Source: Intel AVX IBM India © 2012 IBM Corporation 6 Source: Intel SSE & AVX IBM India © 2012 IBM Corporation 7 Source: Intel Compiler tunings IBM India © 2012 IBM Corporation 8 Following figure illustrates the data types used in the SSE and Intel® AVX instructions. Roughly, for Intel® AVX, any multiple of 32-bit or 64-bit floating-point type that adds to 128 or 256 bits is allowed as well as multiples of any integer type that adds to 128 bits.

Source: Intel MMX, SSE and AVX IBM India © 2012 IBM Corporation 9 About AVX Performance - Summary

. Doubling the 128 bit SSE registers to 256 bits . They introduce an entirely new instruction encoding (VEX) . The new encoding switches from 2 operand instructions to 3 operand instructions allowing the destination register to be different than the source registers. Example: addps r0, r1 # (r0 = r0 + r1) vs. vaddps r0, r1, r2 # (r0 = r1 + r2)

This new encoding is not only used for the new 256 bit instructions, but also for the 128 bit AVX versions of all the old SSE instructions. This means that existing SSE code can improved without requiring a switch to 256 bit registers. . switching to AVX is very easy; simply recompile with -mavx. In addition to using -mavx

IBM India © 2012 IBM Corporation 10 Source: Compiling for AVX, Intel IBM India © 2012 IBM Corporation 11 Intel and GNU compiler for AVX

. Intel's 12.1 uses OpenMP std. 3.1, while the CP2K source code uses OpenMP std. 2.5 . Some OpenMP classes could not be compiled with the Intel compiler . The GNU compiler is open source, and appears to be more 'in step' with the CP2K source. . However, it is “difficult” to get the system admin of a very large installation to make a root installation of the GNU compiler (4.3+ - later version) . Therefore, experiments were tried with a local build of GNU (Gfortran) . While -mavx does “work”, i.e., code compiles, it doesn't “AVX vectorize” - it was found that the flags -march=corei7-avx -mtune=corei7-avx were necessary to enable AVX

IBM India © 2012 IBM Corporation 12 How to build Gfortran compiler locally

. Gfortran Dependent libraries – GNU Multiple Precision Library (GMP) – MPFR Library (http://www.mpfr.org/. ) – MPC Library (http://www.multiprecision.org/ ) – Parma Polyhedra Library (PPL) – CLooG-PPL or CLooG (ftp://gcc.gnu.org/pub/gcc/infrastructure/ as cloog- ppl-0.15.tar.gz. )

IBM India © 2012 IBM Corporation 13 Gfortran Local Build

GFORTRAN

MPFR MPC

GMP

PPL ClooG(-PPL)

IBM India © 2012 IBM Corporation 14 FFTW_INC = /user/naga/hybrid/Endeavor/fftw/include FFTW_LIB = /user/naga/hybrid/Endeavor/fftw/lib CC = gcc CPP = FC = mpif90 LD = mpif90 AR = ar -r CPPFLAGS = DFLAGS = -D__GFORTRAN -D__FFTSG -D__LIBINT -D__parallel -D__SCALAPACK -D__BLACS -D__FFTW3 -D__MAX_CONTR=3 -D__GRID_CORE=2 FCFLAGS = -I$(FFTW_INC) –O3 -fopenmp -ffast-math -march=corei7-avx -mtune=corei7-avx -funroll-loops -ftree-vectorize -march=native -ffree-form $(DFLAGS) LDFLAGS = $(FCFLAGS) LIBS = /user/naga/hybrid/Endeavor/libint_cpp_wrapper.o \ /user/naga/hybrid/Endeavor/libint/lib/libderiv.a \ /user/naga/hybrid/Endeavor/libint/lib/libint.a \ /user/naga/hybrid/Endeavor/libs/libscalapack.a \ /user/naga/hybrid/Endeavor/libs/blacs_MPI-LINUX-0.a \ /user/naga/hybrid/Endeavor/libs/blacsCinit_MPI-LINUX-0.a \ /user/naga/hybrid/Endeavor/libs/blacsF77init_MPI-LINUX-0.a \ /user/naga/hybrid/Endeavor/libs/blacs_MPI-LINUX-0.a \ /user/naga/hybrid/Endeavor/libs/lapack_LINUX.a \ /user/naga/hybrid/Endeavor/libs/blas_LINUX.a \ $(FFTW_LIB)/libfftw3.a \ -lstdc++ -lpthread OBJECTS_ARCHITECTURE = machine_gfortran.o

IBM India © 2012 IBM Corporation 15 CP2K Build

CP2K

BLACS FFTW

BLAS

LAPACK SCALAPACK

IBM India © 2012 IBM Corporation 16 CP2K Execution time

1.4

Lower is better 1.2

1

0.8

0.6

0.4 Total execution time(in execution ratio)Total

0.2

0 1 SNB GF 4.5 SNB GF 4.7 SNB GF 4.7 OPT WSM GF 4.5

IBM India © 2012 IBM Corporation 17 MPI Synchronization time

2.5

Lower is better 2

1.5

1

0.5 MPI Synchronization time (in ratio)

0 GF 4.5 GF 4.7 GF 4.7 Opt GF 4.5

SNB SNB SNB WSM Category

IBM India © 2012 IBM Corporation 18 MPI PERFORMANCE

105000 100000 95000 Higher is better 90000 85000 80000 75000 70000 65000 60000 55000 50000 45000 40000

PERFORMANCE [MB/s] PERFORMANCE 35000 30000 25000 20000 15000 10000 5000 0 MP_Bcast MP_ISendRecv MP_ISend MP_IRecv MP_Recv MPI ROUTINE

SDB Gfortran 4.5 SDB Gfortran 4.7 SDB Gfortran 4.7 Optimized WSM Gfortran 4.5

IBM India © 2012 IBM Corporation 19 Acknowledgements / Technical advisory

Swamy Kandadai

Raj Panda

Luigi Brochard

IBM India © 2012 IBM Corporation 20 IBM India © 2012 IBM Corporation 21 Sandy Bridge vs Westmere

Sandy Bridge Westmere: · 32 kB data + 32 kB instruction ··L1 cache · Native six-core (··Gulftown) and ten-core ( (3 clocks) and 256 kB ··L2 cache (8 clocks) per core ··Westmere-EX) processors.··[8] · Shared L3 cache includes the processor · A new set of instructions that gives over 3x graphics (··LGA 1155) the encryption and decryption rate of · 64-byte ··cache line size ··Advanced Encryption Standard (AES) processes · Two load/store operations per ··CPU cycle compared to before.··[9] for each memory channel · Delivers seven new instructions ( · Decoded micro-operation cache and ··AES instruction set or ··AES-NI) that will be used enlarged, optimized ··branch predictor by the AES algorithm. Also an instruction called · Improved performance for PCLMULQDQ (see ··CLMUL instruction set) that will ··transcendental mathematics, ··AES encryption ( perform carry-less multiplication for use in ··AES instruction set), and ··SHA-1 hashing cryptography.··[10] These instructions will allow the · 256-bit/cycle ring bus interconnect processor to perform hardware-accelerated between cores, graphics, cache and System Agent encryption, not only resulting in faster execution but Domain also protecting against software targeted attacks. · ··Advanced Vector Extensions (AVX) 256- · Integrated graphics, added into the bit instruction set with wider vectors, new extensible processor package (dual core ··Arrandale and syntax and rich functionality ··Clarkdale only). · ··, hardware support · Improved virtualization latency.··[11] for video encoding and decoding · New virtualization capability: "VMX · Up to 8 physical cores or 16 logical cores Unrestricted mode support," which allows 16-bit through ··Hyper-threading guests to run (real mode and big real mode). · Support for "Huge Pages" of 1 GB in size.

Source: Wikipedia IBM India © 2012 IBM Corporation 22 Gfortran Local Build

MPFR MPC

GMP

PPL ClooG(-PPL)

IBM India © 2012 IBM Corporation 23 MPFR Install

export CC=gcc export CXX=g++ export F77=gfortran export FC=gfortran export F90=gfortran export CFLAGS="-m64 -O2 " export CXXFLAGS="-m64 -O2 " export FFLAGS="-m64 -O2 " export FCFLAGS="-m64 -O2 " export LDFLAGS="-m64 -O2 " ./configure –prefix=/user/naga/4.7.0/dlibs \ --with-gmp=/user/naga/4.7.0/dlibs 2>&1 \ | tee config.naga-64bit.log make -j8 2>&1 | tee make.naga-64bit.log make install

IBM India © 2012 IBM Corporation 24 MPC Install

export CC=gcc export CXX=g++ export F77=gfortran export FC=gfortran export F90=gfortran export CFLAGS="-m64 -O2 " export CXXFLAGS="-m64 -O2 " export FFLAGS="-m64 -O2 " export FCFLAGS="-m64 -O2 " export LDFLAGS="-m64 -O2 " ./configure --prefix=/user/naga/4.7.0/dlibs \ --with-mpfr=/user/naga/4.7.0/dlibs \ --with-gmp=/user/naga/4.7.0/dlibs 2>&1 \ | tee config.naga-64bit.log make -j8 2>&1 | tee make.naga-64bit.log make install

IBM India © 2012 IBM Corporation 25 PPL Install export CC=gcc export CXX=g++ export F77=gfortran export FC=gfortran export F90=gfortran export CFLAGS="-m64 -O2 " export CXXFLAGS="-m64 -O2 " export FFLAGS="-m64 -O2 " export FCFLAGS="-m64 -O2 " export LDFLAGS="-m64 -O2 " ./configure --prefix=/user/naga/4.7.0/dlibs \ --with-libgmp-prefix=/user/naga/4.7.0/dlibs/lib \ 2>&1 | tee config.naga-64bit.log

make -j8 2>&1 | tee make.naga-64bit.log make install

IBM India © 2012 IBM Corporation 26 cloog-ppl-0.15.11 export CC=gcc export CXX=g++ export F77=gfortran export FC=gfortran export F90=gfortran export CFLAGS="-m64 -O2 " export CXXFLAGS="-m64 -O2 " export FFLAGS="-m64 -O2 " export FCFLAGS="-m64 -O2 " export LDFLAGS="-m64 -O2 " ./configure --prefix=/user/naga/4.7.0/dlibs \ --with-ppl=/user/naga/4.7.0/dlibs \ --with-gmp=/user/naga/4.7.0/dlibs \

make -j8 2>&1 | tee make.naga-64bit.log

make install

IBM India © 2012 IBM Corporation 27 Sandy Bridge vs Westmere

Sandy Bridge Westmere: · 32 kB data + 32 kB instruction ··L1 cache · Native six-core (··Gulftown) and ten-core ( (3 clocks) and 256 kB ··L2 cache (8 clocks) per core ··Westmere-EX) processors.··[8] · Shared L3 cache includes the processor · A new set of instructions that gives over 3x graphics (··LGA 1155) the encryption and decryption rate of · 64-byte ··cache line size ··Advanced Encryption Standard (AES) processes · Two load/store operations per ··CPU cycle compared to before.··[9] for each memory channel · Delivers seven new instructions ( · Decoded micro-operation cache and ··AES instruction set or ··AES-NI) that will be used enlarged, optimized ··branch predictor by the AES algorithm. Also an instruction called · Improved performance for PCLMULQDQ (see ··CLMUL instruction set) that will ··transcendental mathematics, ··AES encryption ( perform carry-less multiplication for use in ··AES instruction set), and ··SHA-1 hashing cryptography.··[10] These instructions will allow the · 256-bit/cycle ring bus interconnect processor to perform hardware-accelerated between cores, graphics, cache and System Agent encryption, not only resulting in faster execution but Domain also protecting against software targeted attacks. · ··Advanced Vector Extensions (AVX) 256- · Integrated graphics, added into the bit instruction set with wider vectors, new extensible processor package (dual core ··Arrandale and syntax and rich functionality ··Clarkdale only). · ··Intel Quick Sync Video, hardware support · Improved virtualization latency.··[11] for video encoding and decoding · New virtualization capability: "VMX · Up to 8 physical cores or 16 logical cores Unrestricted mode support," which allows 16-bit through ··Hyper-threading guests to run (real mode and big real mode). · Support for "Huge Pages" of 1 GB in size.

Source: Wikipedia IBM India © 2012 IBM Corporation 28 CP2K Build

CP2K

BLACS FFTW

BLAS

LAPACK SCALAPACK

IBM India © 2012 IBM Corporation 29 BLAS Install Modify the make.inc

FORTRAN = gfortran OPTS = -O3 -ffast-math -funroll-loops -ftree-vectorize -march=corei7-avx -mtune=corei7-avx OPTS = -O3 DRVOPTS = $(OPTS) NOOPT = LOADER = gfortran LOADOPTS = Make Make install

IBM India © 2012 IBM Corporation 30 Modify the Bmake.inc file

BTOPdir = /user/naga/hybrid/Endeavor/BLACS BLACSdir = $(BTOPdir)/LIB BLACSDBGLVL = 0 BLACSFINIT = $(BLACSdir)/blacsF77init_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a BLACSCINIT = $(BLACSdir)/blacsCinit_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a BLACSLIB = $(BLACSdir)/blacs_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a MPIdir = /opt/intel/impi/4.0.3.008/intel64 MPILIBdir = $(MPIdir)/lib MPIINCdir = $(MPIdir)/include MPILIB = -L$(MPILIBdir) -lmpich F77 = mpif90 F77NO_OPTFLAGS = F77FLAGS = $(F77NO_OPTFLAGS) -O F77LOADER = $(F77) F77LOADFLAGS = CC = mpicc CCFLAGS = -O4 -ffast-math -funroll-loops \ -ftree-vectorize -march=corei7-avx -mtune=corei7-avx CCFLAGS = -O4 CCLOADER = $(CC) CCLOADFLAGS =

IBM India © 2012 IBM Corporation 31 fftw-3.2.2 export CC=gcc export CFLAGS="-O3 -ffast-math -funroll-loops -ftree-vectorize \ -ffree-form -march=corei7-avx -mtune=corei7-avx" export CFLAGS="-O3" export MPICC=mpicc export F77=gfortran export FFLAGS="-O3 -ffast-math -funroll-loops -ftree-vectorize \ -ffree-form -march=corei7-avx -mtune=corei7-avx" export FFLAGS="-O3" ./configure --prefix=/user/naga/4.7.0/cp2k-dlibs \ --enable-mpi 2>&1 | tee config.naga.log

IBM India © 2012 IBM Corporation 32 Install scalapack-2.0.1 Modify SLmake.inc file FC = mpif90 CC = mpicc NOOPT = -O0 FCFLAGS = -O3 -march=corei7-avx -mtune=corei7-avx CCFLAGS = -O3 -march=corei7-avx -mtune=corei7-avx FCLOADER = $(FC) CCLOADER = $(CC) FCLOADFLAGS = $(FCFLAGS) CCLOADFLAGS = $(CCFLAGS) BLASLIB = /user/naga/hybrid/Endeavor/BLAS/blas_LINUX.a LAPACKLIB = /user/naga/hybrid/Endeavor/lapack-3.4.0/liblapack.a LIBS = $(LAPACKLIB) $(BLASLIB)

IBM India © 2012 IBM Corporation 33