Overview of Cluster GPU Software Stack
Total Page:16
File Type:pdf, Size:1020Kb
GPU Login ● ssh glogin.dragon.kaust.edu.sa ● First login auto-generates keys & ssh config – .ssh/config ● Host glogin #GPU login nodes Hostname glogin.dragon.kaust.edu.sa User $USER IdentityFile ~/.ssh/ksl-internal StrictHostKeyChecking no ForwardX11 yes ForwardX11Trusted yes wiki.dragon.kaust.edu.sa/wiki/Tutorial0200LoggingIn GPU Software: Modules ● Modules – Customized to login node (GPU, Intel, AMD) – New & improved GPU App Stack is being built ● Expect changes. Make requests. Stay connected. – Prefer newest modules ● legacy will be deprecated ● Some modules might not be GPU optimized module avail module load module/version GPU Software: Modules ● CUDA MATLAB anaconda gromacs ansys tensorflow anaconda-R relion avizo beagle-lib schrodinger cst biobuilds torch lammps CST vmd mathematica cuda medea GPU Software: Modules ● OpenGL / EGL* vis/ParaView* adf caffe pymol adf ansys MATLAB anaconda CST python-canopy atk anaconda-base eman2 qiime avizo anaconda-R GAUSSVIEW R comsol anaconda3 genometools rstudio cst gamma ATK gnuplot schrodinger mathematica bandage molden smrtanalysis medea baps openbabel virtualgl molcas biobuilds vmd tecplot turbomole bluefish vesta xcrysden * NVIDIA EGL support coming to Cluster in future rollout... GPU Jobs + Constraints ● sinfo --partition=batch --format="%n %f" | fgrep gpu ● dgpu501-22-r cpu_intel_e5_2670,gpu,...,tesla_k40m dgpu502-01-l cpu_intel_e5_2670,gpu,...,tesla_k20m dgpu702-16 cpu_intel_e5_2699_v3,gpu,...,gtx1080ti dgpu703-01 cpu_intel_e5_2699_v3,gpu,...,p100 dgpu703-25 cpu_intel_e5_2699_v3,gpu,...,p6000 wiki.dragon.kaust.edu.sa/wiki/FAQConstraints GPU Jobs + Constraints ● srun --pty --time=1:00 --gres=gpu:p100:2 bash -l ● sbatch --time=1:00:00 --gres=gpu:1 --constraint="[p100|p6000]" runjob.sbat wiki.dragon.kaust.edu.sa/wiki/FAQConstraints#GPUs GPU Jobs + Constraints ● sbatch --time=1:00:00 runjob.sbat ● runjob.sbat #SBATCH --job-name=gpujob #SBATCH --gres=gpu:gtx1080i:4 #SBATCH --constraint="[local_500G]" #SBATCH --nodes=2 --ntasks-per-node=2 wiki.dragon.kaust.edu.sa/wiki/FAQConstraints#GPUs GPU Software: Modules & Compilers ● CMake – module load cmake ● C++ – System default: GCC v4.8.5 – module load gcc/6.4.0 – module load legacy intel/2017 GPU Software: Modules & Compilers ● CUDA – module load cuda/8.0.44 – nvcc -std=c++11 -o example example.cu ● CUDNN – module load applications-extra module load cuda/8.0.44-cudNN5.1 – nvcc -std=c++11 -o example example.cu GPU Apps ● tensorflow/1.3.0 – cudatoolkit=8.0, cudnn6.0.21, python=3.6.2 – module load tensorflow/1.3.0 – python >>> import tensorflow as tf GPU Tools ● General Information (not scalable) – nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 375.26 Driver Version: 375.26 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX TIT... On | 0000:0D:00.0 Off | N/A | | 37% 56C P2 153W / 189W | 135MiB / 6081MiB | 86% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX TIT... On | 0000:0E:00.0 Off | N/A | | 31% 47C P8 34W / 189W | 2MiB / 6082MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 72633 C ../../build.cudnntraining.teneen/trainlenet 133MiB | +-----------------------------------------------------------------------------+ KSL provides profiling training... .