Supercomputing Wales Hackathon Advances Scientific Codes

Supercomputing Wales Hackathon Advances Scientific Codes Supercomputing Wales, a consortium of four Welsh universities, is celebrating the results of its first hackathon, a design sprint-like event that brought computer programmers and others together to develop software code and improve performance. In the three-day event, regional software engineers from across Cardiff, Swansea, Aberystwyth and Bangor Universities, supported by mentors from NVIDIA, Atos and Dell Technologies, achieved performance breakthroughs on six scientific codes. The code hackathon, organized by Dell Technologies and hosted at Swansea University, was part of the Atos – Dell Community Benefits plan for Supercomputing Wales, which promotes University collaboration with the industry. Research Software Engineers (RSEs) Ed Bennett (Swansea University), Walter Columbo (Cardiff University), Mark Dawson (Swansea University), Pablo Ouro (Cardiff University), Iakov Polyak (Cardiff University), Michele Mesiti (Swansea University), Sachin Nanavati (Cardiff University), Colin Sauzé (Aberystwyth University), Ben Thorpe (Swansea University) Guests Sergio Chaves (Swansea University), Vladimir Khodygo (Aberystwyth University) Mentors Nick Allsopp (Atos), Gilles Civario (Dell Technologies), Paul Graham (NVIDIA), Martyn Foster (Atos) DELL TECHNOLOGIES WHITE PAPER TABLE OF CONTENTS DARKNET 1 Background. 1 Method . 1 Results . 1 Conclusions and Future Work . 2 HIREP 3 Background. 3 Optimisation approaches . 3 Headline results . 4 HYDRO3D 4 Background. 4 Optimisation approaches . 4 Results . 5 MAXWELL NEFEM 5 Background. 5 Optimisation approaches . 5 MOLPRO 7 Background. 7 Optimisation . 7 The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying and distribution of any software described in this publication require an applicable software license. Copyright © 2020 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, Dell Technologies, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Intel, Intel Logo are trademarks of Intel Corporation in the U.S. and/or other countries. Other trademarks may be the property of their respective owners. Dell Technologies believes the information in this document is accurate as of its publication date. The information is subject to change without notice. Published in the USA 7/20. DELL TECHNOLOGIES WHITE PAPER DARKNET Colin Sauzé (Supercomputing Wales RSE), Vladimir Khodygo (Aberystwyth Ph.D. student) BACKGROUND At the performance hackathon we decided to try and optimise the Darknet1 neural network framework. Darknet is written by Joseph Redmon for use with his YOLO (You Only Look Once) object-detection neural network, a leading network architecture for automatic object recognition. Darknet can work in CPU only or through the use of the CUDA/cdDNN libraries on NVIDIA GPUs. Darknet has been used for a robotic vision project at Aberystwyth University and several other projects might potentially use it. While working with it some performance anomalies had been observed, most notably that GPU utilisation was frequently less than 100% implying that GPU utilisation could be improved. METHOD On the GPU version of the code we used the NVIDIA Visual Profiler (nvvp) to measure the amount of time spent in various tasks. We introduced tags into the code to show how much time various functions were taking. In the CPU version we analysed the program with Intel® VTune™. Figure 1 Solution overview: SAP Leonardo – Blockchain and SAP HANA® for Trusted Computing. RESULTS From the nvvp results we found that a lot of time was spent transferring data to the GPU and that the processing elements of the GPU were underutilised. We found that the code for copying data to the GPU was already multi-threaded, but we had had been running with only one CPU allocated, believing that the GPU was doing most of the work. We tried increasing the number of CPUs allocated and increased the performance from 2.74 images per second with a single thread to 18.04 images per second with 16 threads. This is shown in a graph in figure 2. 1 https://github.com/pjreddie/darknet 1 DELL TECHNOLOGIES WHITE PAPER Increasing Thread Count 20 15 10 5 0 Images per second (more is better) 1 2 4 8 16 Number of Threads Figure 2 A graph showing the performance from increasing the thread count. We noticed that GPU RAM utilisation was peaking at just under 8 GB with our example dataset and decided to modify the code to double the amount of data being processed in each training batch to fill our 16 GB of GPU memory. Although specific to this particular example, a more general solution could be implemented by scaling the amount of data loaded to match the available memory. By doubling the data loaded we were able to slightly more than double the processing speed from 18.04 to 39.66 images per second. Through our mentor we tried a 32-GB V100 located at NVIDIA, which was running an older CPU architecture but allowed us to quadruple the amount of data. This increased throughput to 59 images per second. Overall, these changes gave us a 21.5-fold speed improvement when using the 32-GB GPU or 14.5-fold speed improvement on the Supercomputing Wales 16-GB GPU. Figure 3 shows a graph of all these results. Increasing Data per Batch 70 60 50 40 30 20 Images per second (more is better) 10 0 1 2 4 8 Amount of Data per Batch (The 3/4 values are from a different system with 32 GB of GPU memory) Figure 3 The performance of increasing the amount of data processed in each batch of the neural network training. In the CPU version we found that Darknet used its own Basic Linear Algebra Subroutines (BLAS), so we replaced these with the Intel® Math Kernel Library (Intel® MKL). We also experimented with different compiler optimisation options. CONCLUSIONS AND FUTURE WORK We were able to increase the performance of Darknet on a GPU 14.5-fold using Supercomputing Wales equipment and increase it further to 21.5-fold using a better GPU. We found that Darknet implements its own BLAS routines instead of using the NVIDIA library versions. Additional performance gains could be made by using the NVIDIA library. Near the end of the hackathon we found an optimised fork of Darknet. In the future 2 DELL TECHNOLOGIES WHITE PAPER we would like to compare this with the original Darknet. Due to differences in the way it operates it was difficult to directly compare performance. We would also like to compare the performance of other neural network frameworks, such as TensorFlow and PyTorch. HIREP Ed Bennett (Supercomputing Wales RSE), Michele Mesiti (Supercomputing Wales RSE), Sergio Chaves (Swansea Ph.D. student) BACKGROUND HiRep is a set of programs for performing calculations in the framework of Lattice Gauge Theory, a superset of Lattice QCD. Written in plain C99 with no compiler- or architecture- specific optimisations, it incorporates a Perl and C++-based code generator to write header files comprising macros for matrix-matrix and matrix-vector operations specific to a particular theory (i.e., set of allowed matrix and vector dimensions). Generated macros typically have fully unrolled loops. For the hackathon, we chose to focus on the HMC algorithm (the most computationally demanding application) for the Sp(4) theory, which is currently a research target at Swansea University. While matrices in this theory are 4^2 complex, only 2×4 complex matrices are stored, with the lower two rows being permutations of the upper two. Currently the code is compiled with icc and Intel® MPI. The code has been shown to scale very well with MPI. OPTIMISATION APPROACHES The first approach was a scan of the available compilers. We found that gcc 4.8.5 gave performance 10% faster than icc 2019.5 on the benchmark problem. Profiling results showed that almost no operations were being vectorised.The roofline diagram reported by Intel® Advisor showed that the problem fit in the L3 cache, and was limited by the bandwidth to this cache with an arithmetic intensity of 0.25, but had a potential 20% speedup from vectorisation. A reason for poor vectorisation identified by Intel Advisor is a lack of trip counts. By using profile-guided optimisation, the benchmark time of the Intel-compiled version was decreased by 3%, although the gcc version was unaffected. A reason for poor vectorisation identified by mentors is a lack of loops. Since the code generator removes all loops from the generated code (historically to avoid branching and keep a full pipeline), this was seen as a good target for optimisation. Adjusting the code generator to produce explicit loops for some simple loop macros that were previously unrolled provided a 6% performance boost to the Intel-compiled version, and a 0.5% boost to the gcc version. Turning our attention to the symplectic matrix-vector multiplier, we took two approaches: firstly, we tried to take the generated unrolled code and re-roll it into a compact more easily vectorisable loop. By the end of the hackathon this was able to deliver a 1% performance increase to the Intel-compiled version of the benchmark problem, although it doubled the execution time of the gcc-compiled version. The second approach used in parallel was to take the mathematical definition of the matrix multiplication and hand-port it to Intel AVX-512 intrinsic instructions. This currently performs equivalently to the unrolled version, but with significantly fewer instructions, with remaining potential improvements from prefetching. 3 DELL TECHNOLOGIES WHITE PAPER We found a more recent branch of the code using C99 complex numbers rather than a bespoke complex.h to be 8% faster in a different but related theory (using 4×4 complex matrices); work is ongoing to port the "Sp" (2N) implementation to this newer branch.

Load more