<<

Supercomputing Hackathon Advances Scientific Codes

Supercomputing Wales, a consortium of four Welsh universities, is celebrating the results of its first hackathon, a design sprint-like event that brought computer programmers and others together to develop software code and improve performance.

In the three-day event, regional software engineers from across , , and Bangor Universities, supported by mentors from NVIDIA, Atos and Dell Technologies, achieved performance breakthroughs on six scientific codes.

The code hackathon, organized by Dell Technologies and hosted at , was part of the Atos – Dell Community Benefits plan for Supercomputing Wales, which promotes University collaboration with the industry.

Research Software Engineers (RSEs) Ed Bennett (Swansea University), Walter Columbo (), Mark Dawson (Swansea University), Pablo Ouro (Cardiff University), Iakov Polyak (Cardiff University), Michele Mesiti (Swansea University), Sachin Nanavati (Cardiff University), Colin Sauzé (), Ben Thorpe (Swansea University)

Guests Sergio Chaves (Swansea University), Vladimir Khodygo (Aberystwyth University)

Mentors Nick Allsopp (Atos), Gilles Civario (Dell Technologies), Paul Graham (NVIDIA), Martyn Foster (Atos)

DELL TECHNOLOGIES WHITE PAPER TABLE OF CONTENTS

DARKNET 1 Background 1 Method ...... 1 Results ...... 1 Conclusions and Future Work ...... 2

HIREP 3 Background 3 Optimisation approaches 3 Headline results ...... 4

HYDRO3D 4 Background 4 Optimisation approaches 4 Results ...... 5

MAXWELL NEFEM 5 Background 5 Optimisation approaches 5

MOLPRO 7 Background 7 Optimisation ...... 7

The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.

Use, copying and distribution of any software described in this publication require an applicable software license.

Copyright © 2020 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, Dell Technologies, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Intel, Intel Logo are trademarks of Intel Corporation in the U.S. and/or other countries. Other trademarks may be the property of their respective owners.

Dell Technologies believes the information in this document is accurate as of its publication date. The information is subject to change without notice.

Published in the USA 7/20.

DELL TECHNOLOGIES WHITE PAPER DARKNET Colin Sauzé (Supercomputing Wales RSE), Vladimir Khodygo (Aberystwyth Ph.D. student)

BACKGROUND At the performance hackathon we decided to try and optimise the Darknet1 neural network framework. Darknet is written by Joseph Redmon for use with his YOLO (You Only Look Once) object-detection neural network, a leading network architecture for automatic object recognition. Darknet can work in CPU only or through the use of the CUDA/cdDNN libraries on NVIDIA GPUs.

Darknet has been used for a robotic vision project at Aberystwyth University and several other projects might potentially use it. While working with it some performance anomalies had been observed, most notably that GPU utilisation was frequently less than 100% implying that GPU utilisation could be improved.

METHOD On the GPU version of the code we used the NVIDIA Visual Profiler (nvvp) to measure the amount of time spent in various tasks. We introduced tags into the code to show how much time various functions were taking. In the CPU version we analysed the program with Intel® VTune™.

Figure 1. Solution overview: SAP Leonardo – Blockchain and SAP HANA® for Trusted Computing.

RESULTS From the nvvp results we found that a lot of time was spent transferring data to the GPU and that the processing elements of the GPU were underutilised. We found that the code for copying data to the GPU was already multi-threaded, but we had had been running with only one CPU allocated, believing that the GPU was doing most of the work. We tried increasing the number of CPUs allocated and increased the performance from 2.74 images per second with a single thread to 18.04 images per second with 16 threads. This is shown in a graph in figure 2.

1 https://github.com/pjreddie/darknet

1 DELL TECHNOLOGIES WHITE PAPER Increasing Thread Count 20

15

10

5

0

Images per second (more is better) 1 2 4 8 16 Number of Threads

Figure 2. A graph showing the performance from increasing the thread count.

We noticed that GPU RAM utilisation was peaking at just under 8 GB with our example dataset and decided to modify the code to double the amount of data being processed in each training batch to fill our 16 GB of GPU memory. Although specific to this particular example, a more general solution could be implemented by scaling the amount of data loaded to match the available memory. By doubling the data loaded we were able to slightly more than double the processing speed from 18.04 to 39.66 images per second. Through our mentor we tried a 32-GB V100 located at NVIDIA, which was running an older CPU architecture but allowed us to quadruple the amount of data. This increased throughput to 59 images per second.

Overall, these changes gave us a 21.5-fold speed improvement when using the 32-GB GPU or 14.5-fold speed improvement on the Supercomputing Wales 16-GB GPU. Figure 3 shows a graph of all these results.

Increasing Data per Batch 70

60

50

40

30

20

Images per second (more is better) 10

0 1 2 4 8 Amount of Data per Batch (The 3/4 values are from a different system with 32 GB of GPU memory)

Figure 3. The performance of increasing the amount of data processed in each batch of the neural network training.

In the CPU version we found that Darknet used its own Basic Linear Algebra Subroutines (BLAS), so we replaced these with the Intel® Math Kernel Library (Intel® MKL). We also experimented with different compiler optimisation options.

CONCLUSIONS AND FUTURE WORK We were able to increase the performance of Darknet on a GPU 14.5-fold using Supercomputing Wales equipment and increase it further to 21.5-fold using a better GPU.

We found that Darknet implements its own BLAS routines instead of using the NVIDIA library versions. Additional performance gains could be made by using the NVIDIA library. Near the end of the hackathon we found an optimised fork of Darknet. In the future

2 DELL TECHNOLOGIES WHITE PAPER we would like to compare this with the original Darknet. Due to differences in the way it operates it was difficult to directly compare performance. We would also like to compare the performance of other neural network frameworks, such as TensorFlow and PyTorch.

HIREP Ed Bennett (Supercomputing Wales RSE), Michele Mesiti (Supercomputing Wales RSE), Sergio Chaves (Swansea Ph.D. student)

BACKGROUND HiRep is a set of programs for performing calculations in the framework of Lattice Gauge Theory, a superset of Lattice QCD. Written in plain C99 with no compiler- or architecture- specific optimisations, it incorporates a Perl and C++-based code generator to write header files comprising macros for matrix-matrix and matrix-vector operations specific to a particular theory (i.e., set of allowed matrix and vector dimensions). Generated macros typically have fully unrolled loops.

For the hackathon, we chose to focus on the HMC algorithm (the most computationally demanding application) for the Sp(4) theory, which is currently a research target at Swansea University. While matrices in this theory are 4^2 complex, only 2×4 complex matrices are stored, with the lower two rows being permutations of the upper two. Currently the code is compiled with icc and Intel® MPI. The code has been shown to scale very well with MPI.

OPTIMISATION APPROACHES The first approach was a scan of the available compilers. We found that gcc 4.8.5 gave performance 10% faster than icc 2019.5 on the benchmark problem.

Profiling results showed that almost no operations were being vectorised.The roofline diagram reported by Intel® Advisor showed that the problem fit in the L3 cache, and was limited by the bandwidth to this cache with an arithmetic intensity of 0.25, but had a potential 20% speedup from vectorisation.

A reason for poor vectorisation identified by Intel Advisor is a lack of trip counts. By using profile-guided optimisation, the benchmark time of the Intel-compiled version was decreased by 3%, although the gcc version was unaffected.

A reason for poor vectorisation identified by mentors is a lack of loops. Since the code generator removes all loops from the generated code (historically to avoid branching and keep a full pipeline), this was seen as a good target for optimisation. Adjusting the code generator to produce explicit loops for some simple loop macros that were previously unrolled provided a 6% performance boost to the Intel-compiled version, and a 0.5% boost to the gcc version.

Turning our attention to the symplectic matrix-vector multiplier, we took two approaches: firstly, we tried to take the generated unrolled code and re-roll it into a compact more easily vectorisable loop. By the end of the hackathon this was able to deliver a 1% performance increase to the Intel-compiled version of the benchmark problem, although it doubled the execution time of the gcc-compiled version. The second approach used in parallel was to take the mathematical definition of the matrix multiplication and hand-port it to Intel AVX-512 intrinsic instructions. This currently performs equivalently to the unrolled version, but with significantly fewer instructions, with remaining potential improvements from prefetching.

3 DELL TECHNOLOGIES WHITE PAPER We found a more recent branch of the code using C99 complex numbers rather than a bespoke complex.h to be 8% faster in a different but related theory (using 4×4 complex matrices); work is ongoing to port the "Sp" (2N) implementation to this newer branch.

Increasing Data per Batch

8

7

6

5

4

3

2 Time /s (lower is better) Time 1

0 Baseline Profile-guided Rolled Loops gcc 4.8.5 Rolled Loops (Intel) optimization (Intel) (Intel) (Intel MPI) (gcc)

Figure 4. Comparison of the progressive optimisations made to HiRep during the hackathon.

HEADLINE RESULTS An overall speedup of over 10% on the benchmark problem was achieved by switching compiler from icc 2019.5 to gcc 4.8.5 and rolling previously unrolled loops. Additional rolling improved the Intel performance to be approximately competitive with gcc’s, and use of Intel intrinsic instructions also gives approximately competitive performance, but with greater room for additional improvement.

HYDRO3D Pablo Ouro (Supercomputing Wales RSE), Ben Thorpe (Supercomputing Wales RSE)

BACKGROUND Hydro3D is a finite difference Navier Stokes solver that permits accurate and efficient Large Eddy Simulation (LES) of turbulent flows, developed for use in the simulation of environmental fluid dynamics.

For this work, we exclusively focused on two benchmarks which are currently being used for a journal publication in which we are comparing Intel Skylake with ARM ThunderX2. Benchmark one is a lid-driven cavity flow that adopts central differences to compute the fluxes, and benchmark two is the Taylor-Green vortex in which high-order schemes are adopted.

OPTIMISATION APPROACHES Over the three days of duration of this event, our aims to optimise Hydro3D were initially on the following two parts: (i) 5th order WENO scheme to compute advection and (ii) MPI communication.

As first step, we compiled the code and ran the benchmarks usingTAU profiler, in order to identify the subroutines of the code that consumed most of the simulation time. Whilst doing this, we observed that few, very widely used subroutines were taking more time than originally thought. Supported by the mentors, we realised that the current implementation of few loops was highly inefficient, due to having IF statements within loops, not correct index sequence in DO loops or one function being sub-optimally coded.

These changes took all of our time and no optimisation was done on the MPI communication in which we aimed to remove the blocking MPI_SEND directives.

4 DELL TECHNOLOGIES WHITE PAPER RESULTS With these changes we achieved the following results: • 2nd order Central Differences scheme for convection: reduction of 60-70%. • 2nd order Central Differences scheme for diffusion: reduction of 70-80%. • 4th order Central Differences scheme for convection: reduction of 40-50%. • 4th order Central Differences scheme for diffusion: reduction of 60-70%. • 5th order WENO for convection: reduction of 20-40%. • Pressure solver with multi-grid: reduction of 15-40%.

These increases in performance were tested in up to four nodes.

Overall, we managed to compute the benchmarks between 17% and 50% faster, comparing the version before the hackathon and the latest one. Thus, we can say that we successfully accomplished our initial objectives for the computing part of the code by a large margin.

The optimised performance has large implications for the Hydro3D community, being that this code is one of the most used in Supercomputing Wales. For instance, during 2019 we simulated the performance of tidal stream turbine arrays using more than 12 combinations of arrays with four flow conditions, leading to a number of combinations, all of which had to be simulated at an expense of 18,000 CPU-hours each. From now on, we can perform the same simulations about 25% faster at an expense of 14,000 CPU-hours approximately, which would contribute to reducing the electricity bill for Supercomputing Wales.

We hope to see more events as this hackathon is organised in 2020. Few of the parts that need to be looked at are those related to fluid-structure interaction, MPI communication, immersed boundary method, Lagrangian particle tracking or level-set method. Moreover, we are keen on implementing MPI-GPU and MPI-OpenMP schemes for the WENO subroutine, netCDF for I/O operations, and explore the performance with other compilation flags and compilers, e.g. gnu.

MAXWELL NEFEM Mark Dawson (Supercomputing Wales RSE)

BACKGROUND The code being considered was a discontinuous Galerkin finite element code. In particular, the code is highly flexible, allowing a large number of different types of elements. One of the challenges inherent in this code is the challenge of ensuring that each processor is kept busy. Previously, this was done by estimating the length of time required to complete the processing for each element, a priori. This means, however, that a processor which finishes work before another will be idle whilst other processors are waiting. In addition, the code made only minimal use of AVX-512 vector instructions.

OPTIMISATION APPROACHES The code was profiled with Arm Forge, and also previous profiled runs with Scalasca and Score-P where reviewed.

Since the code is often with a large number of smaller jobs, or a large number of long- running jobs, it was desirable to have some method of load-balancing operations more appropriately to minimise the processor idle time. This was done by implementing threading using OpenMP over the critical parts of the code.

5 DELL TECHNOLOGIES WHITE PAPER Implementation

Figure 5. Comparison of the implementations using atomic operations and OpenMP reductions for elements of order 3.

Firstly, OpenMP pragmas were added to the relevant sections, and the code was tested to ensure results were reproduced. Three approaches were then used, and the results compared for the different types of problems of relevance.

Several examples were run for different element approximations and mesh sizes. One of the benchmarking runs is shown in Figure 5 for elements of order 3.

Similar benchmarks where run for order 1 to 12. A good speedup was achieved using the implementation which used the atomic speedup for these elements. Scaling improved with larger approximation orders and was less effective for lower orders. For problems of interest, around approximation order 3 is considered the most relevant case.

Much of performance-intensive parts of the code were making very minimal use of AVX512. Getting optimal utilisation of AVX512 operations for this problem wasn't practical in the three-day period, as it would involve rewriting large portions of the code and possibly a lack of flexibility (without extensive work). The vectorisation was increased to almost 20% in some sections of the code which previously had no vectorisation.

Following the OpenMP implementation, work was started on accelerating with GPUs rather than OpenMP using OpenACC pragmas. This work was not finalised during the hackathon; therefore performance metrics are not yet available. The work is expected to continue in early 2020, with a view to be able to run hybrid simulations which are both MPI, OpenMP and GPU-accelerated simulations, with elements with higher orders of approximation being computed on CPU and lower orders on GPU.

6 DELL TECHNOLOGIES WHITE PAPER About Supercomputing Wales Supercomputing Wales provides researchers across Wales with access to powerful computing facilities for science and innovation projects. The programme’s facilities are used by research groups at Cardiff, Swansea, Bangor and Aberystwyth universities, along with companies and other partners working on collaborative projects. At the heart of the Supercomputing Wales initiative is a Supercomputing Centre of Excellence spearheaded by two global leaders in digital transformation, Atos and Dell Technologies. This Centre of Excellence provides Welsh researchers with a full suite of leading- edge HPC equipment, software and services. With its high-powered HPC resources, Supercomputing Wales is facilitating a step change in supercomputing activity across strategically important sectors of the Welsh economy — including nano-scale materials and advanced engineering; energy and the environment; and life sciences and health.

MOLPRO Iakov Polyak (Supercomputing Wales RSE)

BACKGROUND The original goal for the hackathon was to better understand and, if possible, improve scaling of one specific subroutine in a quantum-chemistry program, Molpro.The target subroutine constitutes a core part of a larger module, calculating the full configuration interaction molecular energies, using an iterative solver to diagonalize huge sparse square matrices (with general sizes of tens of millions). Specifically, it performs a matrix-vector multiplication, whereby a sparse matrix is divided into multiple small blocks, allowing for highly load-balanced, task-based distributed-memory parallelisation. Within each task, the algorithm includes several loops where simple operations are performed on non-zero elements, generated on-the-fly, and one small matrix-matrix multiplication operation, done by a LAPACK (DGEMM) subroutine. The baseline scaling of the subroutine was considered to be too low on a single node, with the efficiency dropping down to 60% on 40 cores, taken the near lack of communication between the parallel tasks. This has been attributed to hitting the bandwidth limit due to obvious memory-bound nature of the most time-consuming operations.

OPTIMISATION During the hackathon, under the supervision of Martyn Foster, the algorithm has been re-analysed and a conclusion has been drawn that while indeed being memory-bound, it is more latency, rather than bandwidth-limited, which is primarily reflected in the speed-up, monotonically (but not linearly) growing with the number of cores on a given node. The code was studied to see if it can be accordingly re-structured (e.g., via data prefetching and cache blocking) in order to achieve better performance. While this has been found not to be straightforward due to the nature of the underlying physics, several important suggestions have been made by Martyn, which I am going to try and implement in the future.

Importantly though, it has been noted that a significant drop of efficiency can be due to the CPU utilising turbo frequency when the small number of cores is used, which is gradually reduced with the number of active cores increasing. Switching the turbo mode off indeed resulted in the much slower decrease in efficiency, which in this case only drops down to 80% on 40 cores. This further proves that the code is not heavily bandwidth-bound and in fact displays a rather satisfactory performance on one node. Further improvements will, however, be sought according to the recommendations obtained.

To learn more about Dell Technologies HPC & AI Centers of Excellence, visit DellTechnologies.com/coe.

7 DELL TECHNOLOGIES WHITE PAPER