HPC-AI Competition NAMD Benchmark Guideline 1 About the applications and benchmarks

1.1 About NAMD NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems.

NAMD, is a parallel molecular dynamics code designed for high-performance simulation of large bio-molecular systems. Based on Charm++ parallel objects. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.

For more information about NAMD, please refer to http://www.ks.uiuc.edu/Research/namd/

1.2 About Charm++ Charm++ is a generalized approach to writing parallel programs, it is an alternative to the likes of MPI, UPC, GA etc.

Represents:

• The style of writing parallel programs • The • And the entire ecosystem that surrounds it Three design principles: Overdecomposition, Migratability, Asynchrony

For more information about Charm++, please refer to http://charm.cs.uiuc.edu/research/charm

1.3 About UCX UCX is a framework (collection of libraries and interfaces) that provides efficient and relatively easy way to construct widely used HPC protocols: MPI tag matching, RMA operations, rendezvous protocols, stream, fragmentation, remote atomic operations, etc.

For more information about OpenUCX, please refer to https://www.openucx.org/

1.4 About the STMV benchmark Developing biomolecular model inputs for Petascale simulations is an extensive intellectual effort in itself, often involving experimental collaboration. By using synthetic benchmarks for performance measurement, James C. Phillips and his

1

team have made a known stable simulation that is freely distributable, allowing others to replicate their work. Two synthetic benchmarks were assembled by replicating a fully solvated 1.06M atom satellite tobacco mosiac virus (STMV) model with a cubic periodic cell of dimension 216.832Å. The 20stmv benchmark is a 5 × 2 × 2 grid containing 21M atoms, representing the smaller end of Petascale simulations. The 210stmv benchmark is a 7 × 6 × 5 grid containing 224M atoms, representing the largest NAMD simulation. Both simulations employ a 2fs timestep, enabled by a rigid water model and constrained lengths for bonds involving hydrogen, and a 12Å cutoff. PME full electrostatics is evaluated every three steps and pressure control is enabled.

Reported timings of STMV benchmark case are the median of the last five NAMD "Benchmark time" outputs, which are generated at 120-step intervals after initial load balancing.

2

2 How to get NAMD codes and dependency files

2.1 Clone the code gits, get tar files To build NAMD binary files, codes of multiple dependency projects are required, including:

• Charm++ • NAMD • UCX • OpenMPI / Intel MPI

2.1.1 Clone code gits Charm git

git clone --bare https://github.com/UIUC-PPL/charm.git \ $HOME/github/charm.git NAMD git

git clone --bare https://charm.cs.illinois.edu/gerrit/namd.git \ $HOME/github/namd.git

2.1.2 Get tar files FFTW3 tar file

wget http://www.fftw.org/fftw-3.3.8.tar.gz \ -O $HOME/code/fftw-3.3.8.tar.gz HPC-X 2.6 tar file

wget http://content.mellanox.com/hpc/hpc-x/v2.6/hpcx-v2.6.0-gcc- MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64.tbz \ -O $HOME/code/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7- x86_64.tbz

2.2 Checkout Charm++ and NAMD codes. Untar HPC-X OpenMPI

2.2.1 Charm++ Check Charm branches and tags

CODE_NAME=charm \ GIT_DIR=$HOME/github/$CODE_NAME.git \ bash -c ' git --bare --git-dir=$GIT_DIR \ fetch --all --prune; git --bare --git-dir=$GIT_DIR \ show HEAD FETCH_HEAD --quiet --pretty=format:%H%n%cd; git --bare --git-dir=$GIT_DIR \ branch --list --all; git --bare --git-dir=$4GIT_DIR \ tag ' Checkout Charm++ v6.10.1 (proven workable) or FETCH_HEAD codes

CODE_NAME=charm \ CODE_GIT_TAG=FETCH_HEAD \ CODE_GIT_TAG=v6.10.1 \

3

GIT_DIR=$HOME/github/$CODE_NAME.git \ GIT_WORK_TREE=$HOME/cluster/thor/code \ CODE_DIR=$GIT_WORK_TREE/$CODE_NAME-$CODE_GIT_TAG-$(date +%y-%m-%d) \ bash -c ' git --bare --git-dir=$GIT_DIR \ fetch --all --prune; git --bare --git-dir=$GIT_DIR --work-tree=$GIT_WORK_TREE \ reset --mixed $CODE_GIT_TAG; git --bare --git-dir=$GIT_DIR --work-tree=$CODE_DIR \ clean -fxdn; git --bare --git-dir=$GIT_DIR --work-tree=$GIT_WORK_TREE \ checkout-index --force --all --prefix=$CODE_DIR/ ' Additional optimization options (not limited by this):

Code: Old release version of Charm++ code

2.2.2 NAMD Check NAMD branches and tags

CODE_NAME=namd \ GIT_DIR=$HOME/github/$CODE_NAME.git \ CODE_DIR=$GIT_WORK_TREE/$CODE_NAME-$CODE_GIT_TAG \ bash -c ' git --bare --git-dir=$GIT_DIR \ fetch --all --prune; git --bare --git-dir=$GIT_DIR \ show HEAD FETCH_HEAD --quiet --pretty=format:%H%n%cd; git --bare --git-dir=$GIT_DIR \ branch --list --all; git --bare --git-dir=$GIT_DIR \ tag ' Checkout NAMD 2.13

CODE_NAME=namd \ CODE_GIT_TAG=FETCH_HEAD \ GIT_DIR=$HOME/github/$CODE_NAME.git \ GIT_WORK_TREE=$HOME/cluster/thor/code \ CODE_DIR=$GIT_WORK_TREE/$CODE_NAME-$CODE_GIT_TAG-$(date +%y-%m-%d) \ bash -c ' git --bare --git-dir=$GIT_DIR \ fetch --all --prune; git --bare --git-dir=$GIT_DIR --work-tree=$GIT_WORK_TREE \ reset --mixed $CODE_GIT_TAG; git --bare --git-dir=$GIT_DIR --work-tree=$CODE_DIR \ clean -fxdn; git --bare --git-dir=$GIT_DIR --work-tree=$GIT_WORK_TREE \ checkout-index --force --all --prefix=$CODE_DIR/ ' Additional optimization options (not limited by this):

Code: NAMD FETCH_HEAD code

2.2.3 Untar HPC-X 2.6 APP_MPI_PATH=$HOME/cluster/thor/application/mpi \ HPCX_TAR=$HOME/code/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7- x86_64.tbz \ bash -c ' tar xf $HPCX_TAR -C $APP_MPI_PATH '

4

3 How to Build NAMD executable files

3.1 Build FFTW3 Build optimized FFTW3 libraries with GNU C and Intel C compiler

CODE_NAME=fftw \ CODE_TAG=3.3.8 \ PSXE_DIR=/global/software/centos-7/modules/langs/intel/2020.1.217 \ ICC_DIR=$PSXE_DIR/compilers_and_libraries_2020.1.217 \ INTEL_LICENSE_FILE+=:[email protected] \ CODE_BASE_DIR=$HOME/cluster/thor/code \ CODE_DIR=$CODE_BASE_DIR/$CODE_NAME-$CODE_TAG \ INSTALL_DIR=$HOME/cluster/thor/application/libs/fftw \ CMAKE_PATH=/global/software/centos-7/modules/tools/cmake/3.16.4/bin/cmake \ GCC_PATH=/global/software/centos-7/modules/langs/gcc/8.4.0/bin/gcc \ ICC_PATH=$ICC_DIR/linux/bin/intel64/icc \ NATIVE_GCC_FLAGS='"-march=native -mtune=native -mavx2 -msse4.2 -O3 - DNDEBUG"' \ GCC_FLAGS='"-march=broadwell -mtune=broadwell -mavx2 -msse4.2 -O3 -DNDEBUG"' \ ICC_FLAGS='"-xBROADWELL -axBROADWELL,CORE-AVX2,SSE4.2 -O3 -DNDEBUG"' \ bash -c '

CMD_REBUILD_CODE_DIR="rm -fr $CODE_DIR \ && tar xf $HOME/code/$CODE_NAME-$CODE_TAG.tar.gz -C $CODE_BASE_DIR"

### To build shared (single precision) with GNU Compiler BUILD_LABEL=$CODE_TAG-shared-gcc840-avx2-broadwell \ CMD_BUILD_SHARED_GCC=" \ mkdir $CODE_DIR/build-$BUILD_LABEL; \ cd $CODE_DIR/build-$BUILD_LABEL \ && $CMAKE_PATH .. \ -DBUILD_SHARED_LIBS=ON -DENABLE_FLOAT=ON \ -DENABLE_OPENMP=OFF -DENABLE_THREADS=OFF \ -DCMAKE_C_COMPILER=$GCC_PATH -DCMAKE_CXX_COMPILER=$GCC_PATH \ -DENABLE_AVX2=ON -DENABLE_AVX=ON \ -DENABLE_SSE2=ON -DENABLE_SSE=ON \ -DCMAKE_INSTALL_PREFIX=$INSTALL_DIR/$BUILD_LABEL \ -DCMAKE_C_FLAGS_RELEASE=$GCC_FLAGS \ -DCMAKE_CXX_FLAGS_RELEASE=$GCC_FLAGS \ && time -p make VERBOSE=1 V=1 install -j \ && cd $INSTALL_DIR/$BUILD_LABEL && ln -s lib64 lib | tee $BUILD_LABEL.log "

### To build shared library (single precision) with Intel C Compiler BUILD_LABEL=$CODE_TAG-shared-icc20-avx2-broadwell \ CMD_BUILD_SHARED_ICC=" \ mkdir $CODE_DIR/build-$BUILD_LABEL; \ cd $CODE_DIR/build-$BUILD_LABEL \ && $CMAKE_PATH .. \ -DBUILD_SHARED_LIBS=ON -DENABLE_FLOAT=ON \ -DENABLE_OPENMP=OFF -DENABLE_THREADS=OFF \ -DCMAKE_C_COMPILER=$ICC_PATH -DCMAKE_CXX_COMPILER=$ICC_PATH \ -DENABLE_AVX2=ON -DENABLE_AVX=ON \ -DENABLE_SSE2=ON -DENABLE_SSE=ON \ -DCMAKE_INSTALL_PREFIX=$INSTALL_DIR/$BUILD_LABEL \ -DCMAKE_C_FLAGS_RELEASE=$ICC_FLAGS \ -DCMAKE_CXX_FLAGS_RELEASE=$ICC_FLAGS \ && time -p make VERBOSE=1 V=1 install -j \ && cd $INSTALL_DIR/$BUILD_LABEL && ln -s lib64 lib | tee $BUILD_LABEL.log "

eval $CMD_REBUILD_CODE_DIR; eval $CMD_BUILD_SHARED_GCC &

5

eval $CMD_BUILD_SHARED_ICC & wait echo $CMD_REBUILD_CODE_DIR; echo $CMD_BUILD_SHARED_GCC echo $CMD_BUILD_SHARED_ICC

' | tee fftw3buildlog 2>&1 Run a small FFTW benchmark with/without SIMD

[pengzhiz@thor001 fftw-3.3.8]$ ./build-3.3.8-shared-icc20-avx2- broadwell/bench -o patient -o nosimd 10240 Problem: 10240, setup: 5.07 s, time: 91.15 us, ``mflops'': 7483.2 [pengzhiz@thor001 fftw-3.3.8]$ ./build-3.3.8-shared-gcc840-avx2- broadwell/bench -o patient -o nosimd 10240 Problem: 10240, setup: 5.29 s, time: 93.61 us, ``mflops'': 7286.5 [pengzhiz@thor001 fftw-3.3.8]$ ./build-3.3.8-shared-icc20-avx2- broadwell/bench -o patient 10240 Problem: 10240, setup: 8.97 s, time: 18.33 us, ``mflops'': 37219 [pengzhiz@thor001 fftw-3.3.8]$ ./build-3.3.8-shared-gcc840-avx2- broadwell/bench -o patient 10240 Problem: 10240, setup: 8.84 s, time: 18.26 us, ``mflops'': 37362 Additional optimization options (not limited by this):

SIMD: avx avx2 avx512 SSE4.2 FMA

3.2 Build Charm CODE_NAME=charm \ CODE_GIT_TAG=FETCH_HEAD \ CODE_GIT_TAG=v6.10.1 \ GIT_DIR=$HOME/github/$CODE_NAME.git \ GIT_WORK_TREE=$HOME/cluster/thor/code \ CHARM_CODE_DIR=$GIT_WORK_TREE/$CODE_NAME-$CODE_GIT_TAG-$(date +%y-%m-%d) \ CHARM_DIR=$CHARM_CODE_DIR \ APP_MPI_PATH=$HOME/cluster/thor/application/mpi \ HPCX_FILES_DIR=$APP_MPI_PATH/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1- redhat7.7-x86_64 \ HPCX_MPI_DIR=$HPCX_FILES_DIR/ompi \ HPCX_UCX_DIR=$HPCX_FILES_DIR/ucx \ UCX_DIR=$SELF_BUILT_DIR \ UCX_DIR=$HPCX_UCX_DIR \ GCC_DIR=/global/software/centos-7/modules/langs/gcc/8.4.0/bin \ NATIVE_GCC_FLAGS="-march=native -mtune=native -mavx2 -msse4.2 -O3 -DNDEBUG" \ GCC_FLAGS="-static-libstdc++ -static-libgcc -march=broadwell - mtune=broadwell -mavx2 -msse4.2 -O3 -DNDEBUG" \ ICC_FLAGS="-static-intel -xBROADWELL -axBROADWELL,CORE-AVX2,SSE4.2 -O3 - DNDEBUG" \ PSXE_DIR=/global/software/centos-7/modules/langs/intel/2020.1.217 \ INTEL_LICENSE_FILE+=:[email protected] \ INTEL_COMPILER_DIR=$PSXE_DIR/compilers_and_libraries_2020.1.217/linux/bin \ bash -c '

CMD_REBUILD_BUILD_DIR="rm -fr $CHARM_DIR/built && mkdir $CHARM_DIR/built;"

### To build UCX with HPC-X OpenMPI + GCC8.4.0 CMD_BUILD_UCX_CHARM_GCC=" module purge && module load gcc/8.4.0 \ && cd $CHARM_DIR/built \ && time -p ../build charm++ ucx-linux-x86_64 ompipmix \ -j --with-production \ --basedir=$HPCX_MPI_DIR \ --basedir=$UCX_DIR \ gcc gfortran $GCC_FLAGS \

6

&& module purge;"

### To build MPI executables with HPC-X OpenMPI + GCC8.4.0 CMD_BUILD_MPI_CHARM_GCC=" module purge && module load gcc/8.4.0 \ && . $HPCX_FILES_DIR/hpcx-mt-init-ompi.sh \ && hpcx_load \ && cd $CHARM_DIR/built \ && time -p ../build charm++ mpi-linux-x86_64 \ -j --with-production \ --basedir=$HPCX_MPI_DIR \ gcc gfortran $GCC_FLAGS \ && hpcx_unload && module purge;"

### To build UCX executables with HPC-X OpenMPI + ICC20u1 CMD_BUILD_UCX_CHARM_ICC=" . $INTEL_COMPILER_DIR/compilervars.sh -arch intel64 -platform linux \ && cd $CHARM_DIR/built \ && time -p ../build charm++ ucx-linux-x86_64 ompipmix \ -j --with-production \ --basedir=$HPCX_MPI_DIR \ --basedir=$UCX_DIR \ icc ifort $ICC_FLAGS;"

### To build MPI executables with HPC-X OpenMPI + ICC20u1 CMD_BUILD_MPI_CHARM_ICC=" . $INTEL_COMPILER_DIR/compilervars.sh -arch intel64 -platform linux \ && . $HPCX_FILES_DIR/hpcx-mt-init-ompi.sh \ && hpcx_load \ && cd $CHARM_DIR/built \ && time -p ../build charm++ mpi-linux-x86_64 \ -j --with-production \ --basedir=$HPCX_MPI_DIR \ icc ifort $ICC_FLAGS \ && hpcx_unload;"

eval $CMD_REBUILD_BUILD_DIR; eval $CMD_BUILD_UCX_CHARM_GCC & eval $CMD_BUILD_MPI_CHARM_GCC & eval $CMD_BUILD_UCX_CHARM_ICC & eval $CMD_BUILD_MPI_CHARM_ICC & wait echo $CMD_REBUILD_BUILD_DIR; echo $CMD_BUILD_UCX_CHARM_GCC; echo $CMD_BUILD_MPI_CHARM_GCC; echo $CMD_BUILD_UCX_CHARM_ICC; echo $CMD_BUILD_MPI_CHARM_ICC;

' | tee charmbuildlog 2>&1

Additional optimization options (not limited by this):

Targets: mpi-linux-x86_64 ucx-linux-x86_64 ompipmix Compiler: icc ifort SIMD: avx avx2 avx512

7

3.3 Build NAMD CHARM_ARCH_UCX_GCC=ucx-linux-x86_64-gfortran-ompipmix-gcc \ CHARM_ARCH_UCX_ICC=ucx-linux-x86_64-ifort-ompipmix-icc \ CHARM_ARCH_MPI_GCC=mpi-linux-x86_64-gfortran-gcc \ CHARM_ARCH_MPI_ICC=mpi-linux-x86_64-ifort-icc \ CODE_NAME=charm \ CODE_GIT_TAG=FETCH_HEAD \ CODE_GIT_TAG=v6.10.1 \ GIT_WORK_TREE=$HOME/cluster/thor/code \ CHARM_CODE_DIR=$GIT_WORK_TREE/$CODE_NAME-$CODE_GIT_TAG-$(date +%y-%m-%d) \ CHARM_BASE=$CHARM_CODE_DIR/built \ FFTW3_LIB_DIR=$HOME/cluster/thor/application/libs/fftw \ GCC_FFTW3_LIB_DIR=$FFTW3_LIB_DIR/3.3.8-shared-gcc840-avx2-broadwell \ ICC_FFTW3_LIB_DIR=$FFTW3_LIB_DIR/3.3.8-shared-icc20-avx2-broadwell \ APP_MPI_DIR=$HOME/cluster/thor/application/mpi \ HPCX_FILES_DIR=$APP_MPI_DIR/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1- redhat7.7-x86_64 \ PSXE_DIR=/global/software/centos-7/modules/langs/intel/2020.1.217 \ INTEL_LICENSE_FILE+=:[email protected] \ INTEL_COMPILER_DIR=$PSXE_DIR/compilers_and_libraries_2020.1.217/linux/bin \ MKL_DIR=$PSXE_DIR/compilers_and_libraries_2020.1.217/linux/mkl \ ICC_DIR=$PSXE_DIR/compilers_and_libraries_2020.1.217 \ ICC_PATH="$INTEL_COMPILER_DIR/intel64/icc" \ ICPC_PATH='"$INTEL_COMPILER_DIR/intel64/icpc -std=c++11"' \ ICC_FLAGS='"-static-intel -xBROADWELL -axBROADWELL,CORE-AVX2,SSE4.2 -O3 - DNDEBUG"' \ GCC_DIR=/global/software/centos-7/modules/langs/gcc/8.4.0 \ GCC_PATH='"$GCC_DIR/bin/gcc "' \ GXX_PATH='"$GCC_DIR/bin/g++ -std=c++0x"' \ NATIVE_GCC_FLAGS='"-static-libstdc++ -static-libgcc -march=native - mtune=native -mavx2 -msse4.2 -O3 -DNDEBUG"' \ GCC_FLAGS='"-static-libstdc++ -static-libgcc -march=broadwell - mtune=broadwell -mavx2 -msse4.2 -O3 -DNDEBUG"' \ CODE_NAME=namd \ CODE_GIT_TAG=FETCH_HEAD \ GIT_DIR=$HOME/github/$CODE_NAME.git \ GIT_WORK_TREE=$HOME/cluster/thor/code \ NAMD_CODE_DIR=$GIT_WORK_TREE/$CODE_NAME-$CODE_GIT_TAG-$(date +%y-%m-%d) \ NAMD_DIR=$NAMD_CODE_DIR \ bash -c ' cd $NAMD_DIR;

### To build NAMD with Charm++ HPC-X UCX + GCC8.4.0 + FFTW3 CMD_BUILD_UCX_NAMD_GCC_FFTW3=" PATH=$GCC_DIR/bin:$PATH \ module purge && module load gcc/8.4.0 && \ ./config Linux-x86_64-g++ --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_UCX_GCC \ --with-fftw3 --fftw-prefix $GCC_FFTW3_LIB_DIR \ --cc $GCC_PATH --cc-opts $GCC_FLAGS \ --cxx $GXX_PATH --cxx-opts $GCC_FLAGS \ && cd Linux-x86_64-g++ && time -p make -j \ && cd $NAMD_DIR && mv Linux-x86_64-g++ Linux-x86_64-g++-ucx-fftw3 \ && module purge"

### To build NAMD with Charm++ HPC-X UCX + GCC8.4.0 + MKL CMD_BUILD_UCX_NAMD_GCC_MKL=" PATH=$GCC_DIR/bin:$PATH \ module purge && module load gcc/8.4.0 && \ ./config Linux-x86_64-g++ --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_UCX_GCC \ --with-mkl --mkl-prefix $MKL_DIR \ --cc $GCC_PATH --cc-opts $GCC_FLAGS \ --cxx $GXX_PATH --cxx-opts $GCC_FLAGS \ && cd Linux-x86_64-g++ && time -p make -j \

8

&& cd $NAMD_DIR && mv Linux-x86_64-g++ Linux-x86_64-g++-ucx-mkl \ && module purge"

### To build NAMD with Charm++ HPC-X OpenMPI + GCC8.4.0 + FFTW3 CMD_BUILD_MPI_NAMD_GCC_FFTW3=" module purge && module load gcc/8.4.0 && \ . $HPCX_FILES_DIR/hpcx-mt-init-ompi.sh && hpcx_load \ && PATH=$GCC_DIR/bin:$PATH \ ./config Linux-x86_64-g++ --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_MPI_GCC \ --with-fftw3 --fftw-prefix $GCC_FFTW3_LIB_DIR \ --cc $GCC_PATH --cc-opts $GCC_FLAGS \ --cxx $GXX_PATH --cxx-opts $GCC_FLAGS \ && cd Linux-x86_64-g++ && time -p make -j \ && cd $NAMD_DIR && mv Linux-x86_64-g++ Linux-x86_64-g++-mpi-fftw3 \ && hpcx_unload && module purge"

### To build NAMD with Charm++ HPC-X OpenMPI + GCC8.4.0 + MKL CMD_BUILD_MPI_NAMD_GCC_MKL=" module purge && module load gcc/8.4.0 && \ . $HPCX_FILES_DIR/hpcx-mt-init-ompi.sh && hpcx_load \ && PATH=$GCC_DIR/bin:$PATH \ ./config Linux-x86_64-g++ --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_MPI_GCC \ --with-mkl --mkl-prefix $MKL_DIR \ --cc $GCC_PATH --cc-opts $GCC_FLAGS \ --cxx $GXX_PATH --cxx-opts $GCC_FLAGS \ && cd Linux-x86_64-g++ && time -p make -j \ && cd $NAMD_DIR && mv Linux-x86_64-g++ Linux-x86_64-g++-mpi-mkl \ && hpcx_unload && module purge"

### To build NAMD with Charm++ HPC-X UCX + ICC20u1 + FFTW3 CMD_BUILD_UCX_NAMD_ICC_FFTW3=" . $INTEL_COMPILER_DIR/compilervars.sh -arch intel64 -platform linux \ && ./config Linux-x86_64-icc --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_UCX_ICC \ --with-fftw3 --fftw-prefix $ICC_FFTW3_LIB_DIR \ --cc $ICC_PATH --cc-opts $ICC_FLAGS \ --cxx $ICPC_PATH --cxx-opts $ICC_FLAGS \ && cd Linux-x86_64-icc && time -p make -j \ && cd $NAMD_DIR && mv Linux-x86_64-icc Linux-x86_64-icc-ucx-fftw3;"

### To build NAMD with Charm++ HPC-X UCX + ICC20u1 + MKL CMD_BUILD_UCX_NAMD_ICC_MKL=" . $INTEL_COMPILER_DIR/compilervars.sh -arch intel64 -platform linux \ && ./config Linux-x86_64-icc --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_UCX_ICC \ --with-mkl --mkl-prefix $MKL_DIR \ --cc $ICC_PATH --cc-opts $ICC_FLAGS \ --cxx $ICPC_PATH --cxx-opts $ICC_FLAGS \ && cd Linux-x86_64-icc && time -p make -j \ && cd $NAMD_DIR && mv Linux-x86_64-icc Linux-x86_64-icc-ucx-mkl;"

### To build NAMD with Charm++ HPC-X OpenMPI + ICC20u1 + FFTW3 CMD_BUILD_MPI_NAMD_ICC_FFTW3=" . $INTEL_COMPILER_DIR/compilervars.sh -arch intel64 -platform linux \ && . $HPCX_FILES_DIR/hpcx-mt-init-ompi.sh && hpcx_load \ && ./config Linux-x86_64-icc --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_MPI_ICC \ --with-fftw3 --fftw-prefix $ICC_FFTW3_LIB_DIR \ --cc $ICC_PATH --cc-opts $ICC_FLAGS \ --cxx $ICPC_PATH --cxx-opts $ICC_FLAGS \ && cd Linux-x86_64-icc && time -p make -j \ && cd $NAMD_DIR && mv Linux-x86_64-icc Linux-x86_64-icc-mpi-fftw3 \ && hpcx_unload"

### To build NAMD with Charm++ HPC-X OpenMPI + ICC20u1 + MKL

9

CMD_BUILD_MPI_NAMD_ICC_MKL=" . $INTEL_COMPILER_DIR/compilervars.sh -arch intel64 -platform linux \ && . $HPCX_FILES_DIR/hpcx-mt-init-ompi.sh && hpcx_load \ && ./config Linux-x86_64-icc --with-memopt \ --charm-base $CHARM_BASE --charm-arch $CHARM_ARCH_MPI_ICC \ --with-mkl --mkl-prefix $MKL_DIR \ --cc $ICC_PATH --cc-opts $ICC_FLAGS \ --cxx $ICPC_PATH --cxx-opts $ICC_FLAGS \ && cd Linux-x86_64-icc && time -p make -j \ && cd $NAMD_DIR && mv Linux-x86_64-icc Linux-x86_64-icc-mpi-mkl \ && hpcx_unload"

eval $CMD_BUILD_UCX_NAMD_GCC_FFTW3 eval $CMD_BUILD_MPI_NAMD_GCC_FFTW3 eval $CMD_BUILD_UCX_NAMD_ICC_FFTW3 eval $CMD_BUILD_MPI_NAMD_ICC_FFTW3 eval $CMD_BUILD_UCX_NAMD_GCC_MKL; eval $CMD_BUILD_MPI_NAMD_GCC_MKL; eval $CMD_BUILD_UCX_NAMD_ICC_MKL; eval $CMD_BUILD_MPI_NAMD_ICC_MKL; wait echo $CMD_BUILD_UCX_NAMD_GCC_FFTW3 echo $CMD_BUILD_MPI_NAMD_GCC_FFTW3 echo $CMD_BUILD_UCX_NAMD_ICC_FFTW3 echo $CMD_BUILD_MPI_NAMD_ICC_FFTW3 echo $CMD_BUILD_UCX_NAMD_GCC_MKL; echo $CMD_BUILD_MPI_NAMD_GCC_MKL; echo $CMD_BUILD_UCX_NAMD_ICC_MKL; echo $CMD_BUILD_MPI_NAMD_ICC_MKL;

' | tee namdbuildlog 2>&1 Additional optimization options (not limited by this):

SIMD: avx avx2 avx512

3.4 Check the NAMD executable files Carefully check the shared library dependencies information of the build executables.

$ ldd */namd2 | grep -e mkl -e fft -e libuc -e mpi -e libopen Linux-x86_64-g++-mpi-fftw3/namd2: libfftw3f.so.3.5.7 => not found libmpi.so.40 => not found Linux-x86_64-g++-mpi-mkl/namd2: libmkl_intel_lp64.so => not found libmkl_sequential.so => not found libmkl_core.so => not found libmpi.so.40 => not found Linux-x86_64-g++-ucx-fftw3/namd2: libucp.so.0 => not found libuct.so.0 => not found libucs.so.0 => not found libucm.so.0 => not found libopen-pal.so.40 => not found libopen-rte.so.40 => not found libfftw3f.so.3.5.7 => not found Linux-x86_64-g++-ucx-mkl/namd2: libucp.so.0 => not found libuct.so.0 => not found libucs.so.0 => not found

10

libucm.so.0 => not found libopen-pal.so.40 => not found libopen-rte.so.40 => not found libmkl_intel_lp64.so => not found libmkl_sequential.so => not found libmkl_core.so => not found Linux-x86_64-icc-mpi-fftw3/namd2: Linux-x86_64-icc-mpi-fftw3/namd2: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by Linux-x86_64-icc-mpi-fftw3/namd2) Linux-x86_64-icc-mpi-fftw3/namd2: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by Linux-x86_64-icc-mpi-fftw3/namd2) Linux-x86_64-icc-mpi-fftw3/namd2: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by Linux-x86_64-icc-mpi-fftw3/namd2) libfftw3f.so.3.5.7 => not found libmpi.so.40 => not found Linux-x86_64-icc-mpi-mkl/namd2: Linux-x86_64-icc-mpi-mkl/namd2: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by Linux-x86_64-icc-mpi-mkl/namd2) Linux-x86_64-icc-mpi-mkl/namd2: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by Linux-x86_64-icc-mpi-mkl/namd2) Linux-x86_64-icc-mpi-mkl/namd2: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by Linux-x86_64-icc-mpi-mkl/namd2) libmkl_intel_lp64.so => not found libmkl_sequential.so => not found libmkl_core.so => not found libmpi.so.40 => not found Linux-x86_64-icc-ucx-fftw3/namd2: Linux-x86_64-icc-ucx-fftw3/namd2: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by Linux-x86_64-icc-ucx-fftw3/namd2) Linux-x86_64-icc-ucx-fftw3/namd2: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by Linux-x86_64-icc-ucx-fftw3/namd2) Linux-x86_64-icc-ucx-fftw3/namd2: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by Linux-x86_64-icc-ucx-fftw3/namd2) libucp.so.0 => not found libuct.so.0 => not found libucs.so.0 => not found libucm.so.0 => not found libopen-pal.so.40 => not found libopen-rte.so.40 => not found libfftw3f.so.3.5.7 => not found Linux-x86_64-icc-ucx-mkl/namd2: Linux-x86_64-icc-ucx-mkl/namd2: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by Linux-x86_64-icc-ucx-mkl/namd2) Linux-x86_64-icc-ucx-mkl/namd2: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by Linux-x86_64-icc-ucx-mkl/namd2) Linux-x86_64-icc-ucx-mkl/namd2: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by Linux-x86_64-icc-ucx-mkl/namd2) libucp.so.0 => not found libuct.so.0 => not found libucs.so.0 => not found libucm.so.0 => not found libopen-pal.so.40 => not found libopen-rte.so.40 => not found libmkl_intel_lp64.so => not found libmkl_sequential.so => not found libmkl_core.so => not found Because we used "-static-intel" and "-static-libstdc++ -static-libgcc" when building/linking namd2, so no GNU/Intel C compiler related libraries is needed to be load at runtime to run the built codes. MKL, FFT, libucx, and MPI shared object libraries are needed at runtime because these components are not statically linked to the generated namd2 executable file.

Note that at runtime, some of the shared libraries (in the above case, openmpi and ucx libraries) have dependencies on newer versions of libstdc++ and

11

libstdgcc even if the main body of the project is statically built. In this case, new releases of libraries must be load at runtime to provide appropriate environment for such shared libraries. A very simple way to do this is load relevant environment module files before running the application, because a module load will prepare and export LD_LIBRARY_PATH variable for the running bash shell and its daughter processes. Literally the command is as following, the name of the module need to be adjusted according to the configuration of your cluster

"module load gcc/8.4.0" For performance results to be reproducible, the detailed runtime libraries load information such as LD_LIBRARY_PATH must be clearly described. Or build every component of the final executable file statically.

More information about static build of libraries such as MKL, FFT and MPI can be found in the software vendor's user manual documents.

12

3.5 Optional operations

3.5.1 Build UCX from latest master source code on GitHub Clone UCX git

git clone --bare https://github.com/openucx/ucx.git \ $HOME/github/ucx.git Check UCX branches and tags

GIT_DIR=$HOME/github/ucx.git \ GIT_WORK_TREE=$HOME/github/ucx \ bash -c ' git --bare --git-dir=$GIT_DIR --work-tree=$GIT_WORK_TREE \ branch --list --all; git --bare --git-dir=$GIT_DIR --work-tree=$GIT_WORK_TREE \ tag ' Fetch and checkout UCX master

GIT_DIR=$HOME/github/ucx.git \ GIT_WORK_TREE=$HOME/cluster/thor/code \ CODE_DIR=$GIT_WORK_TREE/FETCH_HEAD \ UCX_GIT_TAG=FETCH_HEAD \ bash -c ' git --bare --git-dir=$GIT_DIR \ fetch --all --prune; git --bare --git-dir=$GIT_DIR --work-tree=$GIT_WORK_TREE \ reset --mixed $UCX_GIT_TAG; git --bare --git-dir=$GIT_DIR --work-tree=$CODE_DIR \ clean -fxdn; git --bare --git-dir=$GIT_DIR --work-tree=$GIT_WORK_TREE \ checkout-index --force --all --prefix=$CODE_DIR/ ' Build UCX master code

CODE_PATH=$HOME/cluster/thor/code/ \ UCX_CODE_PATH=$CODE_PATH/ucx170 \ UCX_LIB_PATH=$HOME/cluster/thor/application/libs/ucx \ bash -c 'cd $UCX_CODE_PATH; ./autogen.sh ./configure CC=gcc CXX=gcc \ --disable-logging --disable-debug --disable-assertions \ --disable-params-check --enable-devel-headers \ --without-java --with-knem --enable-mt --with-avx --with-march=avx2 \ --prefix=$UCX_LIB_PATH/FETCH_HEAD-gcc485-mt-avx time -p make -j install; make clean; ' | tee ucx170buildlog 2>&1 Additional optimization options (not limited by this):

Compiler: icc/gcc static/shared build SIMD: avx avx2 avx512

4 How to run NAMD simulations and benchmarks

4.1 Download the example simulations mkdir $HOME/benchmarks; cd $HOME/benchmarks

13

wget https://www.ks.uiuc.edu/Research/namd/utilities/apoa1.tar.gz wget https://www.ks.uiuc.edu/Research/namd/utilities/f1atpase.tar.gz wget https://www.ks.uiuc.edu/Research/namd/utilities/stmv.tar.gz wget https://www.ks.uiuc.edu/Research/namd/utilities/stmv_sc14.tar.gz wget https://www.ks.uiuc.edu/Research/namd/utilities/tiny.tar.gz wget https://www.ks.uiuc.edu/Research/namd/utilities/ramd-5-examples.tar.gz wget https://www.ks.uiuc.edu/Research/namd/utilities/ramd-4.1- examples.tar.gz wget https://www.ks.uiuc.edu/Research/namd/utilities/bpti_imd.tar.gz wget https://www.ks.uiuc.edu/Research/namd/utilities/er-gre.tar.gz wget https://www.ks.uiuc.edu/Research/namd/utilities/alanin.tar.gz wget https://www.ks.uiuc.edu/Research/namd/utilities/tclforces.tar.gz

4.2 Run in batch files The most advisable way to run NAMD on an HPC cluster is to use a batch script. Here is the SLURM script for running NAMD apoa1 simulation.

[pengzhiz@login01 ~]$ cat run.sh #!/bin/bash #SBATCH -J NAMD_apoa1 #SBATCH -N 8 #SBATCH --tasks-per-node=32 #SBATCH -o namd-apoa1-8n256T-%j.out #SBATCH -t 01:00:00 #SBATCH -p thor #SBATCH --exclusive #SBATCH -d singleton

cd $SLURM_SUBMIT_DIR

APP_MPI_PATH=$HOME/cluster/thor/application/mpi HPCX_FILES_DIR=$APP_MPI_PATH/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1- redhat7.7-x86_64 HPCX_MPI_DIR=$HPCX_FILES_DIR/ompi PSXE_PATH=/global/software/centos-7/modules/langs/intel/2020.0.166 MKL_PATH=$PSXE_PATH/compilers_and_libraries_2020.0.166/linux/mkl NAMD_DIR=$HOME/cluster/thor/code/namd$(date +%y-%m-%d) UCX_NAMD_DIR=$NAMD_DIR/Linux-x86_64-g++-ucx BENCHMARK_DIR=$HOME/benchmarks/ BENCHMARK_INPUT=$BENCHMARK_DIR/apoa1/apoa1.namd

. $MKL_PATH/bin/mklvars.sh intel64; . $HPCX_FILES_DIR/hpcx-mt-init-ompi.sh; hpcx_load;

echo "Running on $SLURM_JOB_NODELIST" echo "Nnodes = $SLURM_JOB_NUM_NODES" echo "Ntasks = $SLURM_NTASKS" echo "Launch command: time -p $HPCX_MPI_DIR/bin/mpirun -np $SLURM_NTASKS -- map-by core -report-bindings -x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 $UCX_NAMD_DIR/namd2 $BENCHMARK_INPUT"

time -p $HPCX_MPI_DIR/bin/mpirun -np $SLURM_NTASKS --map-by core -report- bindings -x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 $UCX_NAMD_DIR/namd2 $BENCHMARK_INPUT

4.3 Run in a "bash CLI" On a SLURM managed cluster, it's strongly recommended (sometimes mandatory) to run applications by issuing "sbatch batch_scripts" to the cluster manager. If a "bash Command Line Interface" style execution is still preferred,

14

here's a rough runnable example for running NAMD small simulations with different NAMD target binary files in a loop.

for NAMD_TARGET in Linux-x86_64-icc-ucx Linux-x86_64-icc-mpi; do for NAMD_CASE in apoa1 stmv tiny; do for NUM_OF_NODES in 1 2 3 4; do NAMD_CASE=${NAMD_CASE} \ NAMD_TARGET=${NAMD_TARGET} \ PSXE_PATH=/global/software/centos-7/modules/langs/intel/2020.0.166 \ MKL_PATH=$PSXE_PATH/compilers_and_libraries_2020.0.166/linux/mkl \ HPCX_FILES_DIR=~/cluster/helios/application/mpi/hpcx-v2.6.0-gcc- MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ \ MPI_OPT="\ --report-bindings \ --map-by core --rank-by core --bind-to core \ -mca io ompio -x UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1" \ MPI_EXE=\$HPCX_MPI_DIR/bin/mpirun \ APP_EXE=/global/home/users/pengzhiz/cluster/helios/code/namd20-03- 27/${NAMD_TARGET}/namd2 \ APP_OPT=/global/home/users/pengzhiz/benchmarks/namd/allinone/${NAMD_CASE}.mo d.namd \ CMD="$MPI_EXE $MPI_OPT $APP_EXE $APP_OPT" \ bash -c 'echo "#!/bin/bash . $HPCX_FILES_DIR/hpcx-mt-init-ompi.sh; . $MKL_PATH/bin/mklvars.sh intel64; hpcx_load; $CMD 2>&1 | tee ~/ascii/\${SLURM_JOB_PARTITION}-namd/${NAMD_CASE}- ${NAMD_TARGET}-\${SLURM_JOB_PARTITION}- \${SLURM_JOB_NUM_NODES}.\${SLURM_JOB_ID}.output "' | \ sbatch \ --partition=helios \ --output ${NAMD_CASE}-${NAMD_TARGET}-%j.out \ --open-mode=truncate \ --exclusive --requeue \ --job-name=mpiceshi \ --nodes=${NUM_OF_NODES}-${NUM_OF_NODES} \ \ --sockets-per-node=2 \ --cores-per-socket=20 \ --threads-per-core=1 \ --extra-node-info=2-2:10-10:1-1 \ --mincpus=20 \ --cpus-per-task=1 \ --ntasks-per-core=1 \ --ntasks-per-node=40 \ --ntasks-per-socket=20

done; done; done;

4.4 Run the simulations using UCX NAMD + HPC-X OpenMPI APP_MPI_PATH=$HOME/cluster/thor/application/mpi \ HPCX_FILES_DIR=$APP_MPI_PATH/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1- redhat7.7-x86_64 \ HPCX_MPI_DIR=$HPCX_FILES_DIR/ompi \ PSXE_PATH=/global/software/centos-7/modules/langs/intel/2020.0.166 \ MKL_PATH=$PSXE_PATH/compilers_and_libraries_2020.0.166/linux/mkl \ NAMD_DIR=$HOME/cluster/thor/code/namd$(date +%y-%m-%d) \ UCX_NAMD_DIR=$NAMD_DIR/Linux-x86_64-g++-ucx \ BENCHMARK_DIR=$HOME/benchmarks/ \ BENCHMARK_INPUT=$BENCHMARK_DIR/apoa1/apoa1.namd \ bash -c ' cd $NAMD_DIR; . $MKL_PATH/bin/mklvars.sh intel64; . $HPCX_FILES_DIR/hpcx-mt-init-ompi.sh;

15

hpcx_load; time -p $HPCX_MPI_DIR/bin/mpirun -n $(nproc) \ $UCX_NAMD_DIR/namd2 $BENCHMARK_INPUT ' Output:

Info: Benchmark time: 32 CPUs 0.0247267 s/step 0.286189 days/ns 761.227 MB memory TIMING: 500 CPU: 13.6592, 0.0245156/step Wall: 13.668, 0.0245319/step, 0 hours remaining, 761.226562 MB of memory in use. ======WallClock: 14.876899 CPUTime: 14.818383 Memory: 762.234375 MB [Partition 0][Node 0] End of program real 16.50 user 426.86 sys 84.39

Info: Benchmark time: 20 CPUs 0.0387181 s/step 0.448126 days/ns 746.914 MB memory TIMING: 500 CPU: 20.3969, 0.0385311/step Wall: 20.4048, 0.0385447/step, 0 hours remaining, 746.914062 MB of memory in use.

4.5 How to evaluate the simulation benchmarks As the "numsteps 500" and " outputtiming 20" in apoa1.namd indicate, the apoa1 example case finishes after 500 simulation steps. and days/ns performance is printed after each 20 steps.

At end of the simulation, the timing information of the whole benchmark run is printed to the runtime log.

In the above apoa1 example simulation, untuned HPC-X OpenMPI simulation finished in 14.876899 seconds, while wall time for the same simulation with CharmRun was 15.038658 seconds.

When using the WallClock and days/ns (day per nanosecond) performance metric, lower values are better.

4.6 About configuration and evaluation of stmv_sc14 benchmark cases Specific to STMV benchmark cases, including 1stmv2fs.namd, 20stmv2fs.namd, 210stmv2fs.namd, full information on how to run, how to evaluate and the memopt requirements is clarified in the STMV information link below:

https://www.ks.uiuc.edu/Research/namd/utilities/stmv_sc14/

When working on tuning the code performance, no need to run a too long benchmark. Adjust the numsteps to very small numbers such as 1, 2, 3, 4, 5, 10, 30, 60, 120 to save time for more valuable tuning and optimization ideas.

When using these files for benchmarking the code, it will likely be necessary to adjust the "numsteps" to have a sufficiently long, but not too long, simulation. There is a fixed startup cost before the actual simulation begins due to reading

16

input files and building internal data structures. Also, within the first 500 steps, the measurement-based load balancer is executed, which might slightly redistribute the computational workload to improve performance, after which the timings should begin to show representative performance.

In practice, we advise adjusting "numsteps" so that the entire execution takes around 300 seconds (5 minutes) of wall clock time, shown as "WallClock:" at the very end of the file. Note that this total timing includes the startup cost. Simulation performance should use the final "TIMING: Wall:" value which measures the time for just the simulation, and the startup cost can be assessed by subtracting the "TIMING: Wall:" from the final "WallClock:" value.

17