GPU benchmarking & workloads

Techlab team

09/2/2018 Benchmarking WG GPU CPU CPU vs GPU STREAM High throughput Low latency oriented MULTIPROCESSOR GPUs introduced Single Instruction Multiple oriented Threads (SIMT) concept Compute cores occupy Compute cores occupy MOST part of a die SMALL part of a die A single stream of instructions drives multiple threads Simple cores (RISC) Complex cores (CISC) Why are GPUs extensively Stalls avoided using Stalls avoided with used for ML workloads? more compute many control resources techniques - Machine Learning massively Linear Algebra operations. - GPUs throughput oriented.

09/2/2018 Benchmarking WG 2 AMD vs Available at Techlab: Pascal architecture based: Available at Techlab: -Consumer: GeForce GTX 1080 Ti Vega 14nm based architecture: -Workstation: Tesla P100 -Radeon Vega Frontier Edition

2017 2018 2019 AMD2020

*Last Nvidia Volta architecture includes some ML specific TPU cores.

09/2/2018 Benchmarking WG 3 FP precision New generation computing units: Two type of GPUs used in Techlab:

Consumer card GPUs, e.g.: Nvidia GTX 1080 • TPU (Tensor Processing Unit) by Workstation computing ones, e.g.: Nvidia P100 . Max FLOPS (GFLOPS) • DPU (DNN Processing Unit) by 12000 Microsoft 10000 https://www.nextplatform.com/2017/08/24/drilling- 8000 microsofts-brainwave-soft-deep-leaning-chip/ 6000 • XPU (Baidu’s computing unit) 4000 2000 0 GTX 1080 P100 Single Precision Double Precision

09/2/2018 Benchmarking WG 4 SpMV (GFLOPS) Max Flops (GFLOPS) 12000 20 Single Precision 10000 Double Precision 15 8000 Single Precision 10 6000 Double Precision SpMV: Sparse Matrix- 4000 5 Vector multiplication, it’s 2000 0 a memory bound W8100 K20 K80 GTX P100 0 benchmark 1080 W8100 K20 K80 GTX P100 1080 S3D (GFLOPS) MD5 Hash 300 16 S3D is a computational 250 14 chemistry application 200 12 optimized for GPUs 10 150 Single Precision 8 100 Double Precision 6 GHash/sec 50 4 0 2 W8100 K20 K80 GTX P100 0 1080 • All results were gotten by means Scalable HeterOgeneous Computing (SHOC) benchmark suite. W8100 K20 K80 GTX P100 https://github.com/vetter/shoc 1080

09/2/2018 Benchmarking WG 5 ML workloads

HEP workloads General workloads GAN (Generative Adversarial Networks): CNN and RNN: • cosmoGAN: https://github.com/David-Levinthal/machine-learning https://github.com/MustafaMustafa/cosmoGAN •… Industry standars (not HEP workloads): CNN (Convolutional Neural Networks): • ResNet • INceptionV3 • hep_cnn_benchmark: • VGG16 https://github.com/NERSC/hep_cnn_benchmark Showcased in the framework benchmark area: *This was used in an interesting study about ML workloads https://www.tensorflow.org/performance/benchmarks (such as over tensorflow or caffe): http://www.nersc.gov/users/data-analytics/data-analytics- 2/deep-learning/deep-networks-for-hep/

09/2/2018 Benchmarking WG 6 Nvidia driver investigation

• Nvidia proprietary • Nvidia added in late 2017 an statement in their GeForce -Currently v390.25 (consumer card) product license. See last statement in section 2.1.3 of: • Open source http://www.nvidia.com/content/DriverDownload- March2009/licence.php?lang=us&type=GeForce -Nouveau https://nouveau.freedesktop.org/wiki/ • Investigation on comparing open- source and proprietary drivers will start soon.

09/2/2018 Benchmarking WG 7