Vector Speedup of mkFit: Effects of Different SIMD Options & Turbo

Steve Lantz

4/17/2020 1 What’s the Point?

• Is it really best to measure mkFit’s vectorization scaling by increasing MPT_SIZE with the code optimized for AVX-512? – It means that the “serial” (MPT_SIZE=1) code is still being vectorized by the compiler, to the extent that it can do so – Previously we found that turning off vectorization entirely increases the serial time by about 15% – To make scaling tests more consistent, maybe we should also match intermediate MPT_SIZEs (4, 8) to the right ISA extensions (SSE, AVX2)? • There is also a question if AVX-512 is really faster than AVX2 when Turbo Boost is enabled and all cores are active

4/17/2020 2 Compiling for the Test Runs on phi3

• The benchmark script and Makefile.config were altered to use different “vOpts” at compile time, depending on MPT_SIZE: -DMPT_SIZE=1 -g -O3 -qopenmp -no-vec -qno-- -DMPT_SIZE=2 -g -O3 -qopenmp -march=core2 # = -xssse3 -DMPT_SIZE=4 -g -O3 -qopenmp -march=core2 -DMPT_SIZE=8 -g -O3 -qopenmp -march=haswell # = -xcore-avx2 -DMPT_SIZE=16 -g -O3 -qopenmp -xHost -qopt-zmm-usage=high • Surprising (?) result: MPT_SIZE=4 became anomalously slow – Hypothesis: this is due to the lack of an FMA instruction in SSE – Tests were repeated with -no-fma option for all MPT_SIZE values

4/17/2020 3 Vectorization Scaling Test Results on phi3

4/17/2020 4 Reasons Not to Publish That Figure

• The Amdahl fit becomes worse because FMAs add about 3% to the performance of AVX2 and AVX-512 – Not something we want to have to explain! • Probably there are also other shortcuts and improvements in the later instruction sets – The performance improvement from using them goes beyond the degree of vectorization

4/17/2020 5 Overall Vectorization Scaling Test Results

Before: AVX-512 (or AVX) options for all After: Options matched to MPT_SIZE

??

4/17/2020 6 Reasons Not to Publish That Figure Either

• KNL suddenly looks like it has superpowers with AVX-512 – The real reason: only 1 of its 2 VPUs can AVX2 and earlier – Again, not something we want to have to explain! • SNB sticks out like a sore thumb due to flat speedup at size 8 – SNB predates AVX2, so AVX must be used for MPT_SIZE=8 – Therefore SNB has no real FMA (though it can MUL and ADD) – We have never gotten more than 2x vectorization speedup on SNB – To save the apology, Giuseppe and I opted to eliminate SNB from the mkFit paper

4/17/2020 7 Turbo Boost for AVX2 vs. AVX-512 on phi3

Turbo ISA ext. Events MEIF Threads GHz range Time, s Evt. loop, s ON AVX2 640 32 64 2.2-2.4 1.12 1 ON AVX-512 640 32 64 1.9-2.0 1.13 1.02 off AVX2 640 32 64 2.0-2.1 1.26 1.11 off AVX-512 640 32 64 1.9-1.95 1.22 1.08 ON AVX2 6400 32 64 2.2-2.4 8.21 8.06 ON AVX-512 6400 32 64 1.9-2.0 8.16 8.02 off AVX2 6400 32 64 2.0-2.1 8.86 8.69 off AVX-512 6400 32 64 1.9-1.95 8.44 8.24

ON AVX2 6400 32 32 2.2-2.4 9.43 9.25 ON AVX-512 6400 32 32 1.9-1.95 9.78 9.65 off AVX2 6400 32 32 2.0-2.1 10.6 10.44 off AVX-512 6400 32 32 1.9-1.95 9.98 9.83 Conclusion: when all cores run mkFit, Turbo lets AVX2 perform as well as AVX-512 – intuitively, if vectors are narrower, clocks run faster, and the work is done just as fast

4/17/2020 8 Backup

4/17/2020 9 Results for Serial Baseline on phi3, 1 core

Options for serial run Build time for 20 events, s -xHost -qopt-zmm-usage=high 1.531 -no-simd -no-vec 1.738 -qno-openmp-simd -no-vec 1.760

– The more restrictive options lead to ~15% slower times – Speedups are nearly the same if first event is discarded

4/17/2020 10 Speedup Curves with Different Baselines

4/17/2020 11