Intel Haswell vs. IBM Power8 review Martin Čuma, CHPC, University of Utah

In the spring 2016, we received a demo of IBM Power S822LC­8355GTA server. This machine features two Power8 CPUs with 8 cores each clocked at 3.32 GHz, 256 GB of 1333 DDR3 RAM and two NVidia Tesla K80 accelerators. This machine is marketed mainly for GPU accelerated technical computing, as compared to its counterpart without the GPUs which is marketed for data analytics. Details on the server are in IBM Redpaper http://www.redbooks.ibm.com/redpapers/pdfs/redp5283.pdf.

Each Tesla K80 consists of two Kepler GK210 GPUs, so, to the user it appears like the machine has four accelerators. Each of the four GPUs have 2496 CUDA cores, base clock speed 560 MHz, top clock speed 875 MHz, top double and single precision performance 1864 and 5591 GFlops, respectively, and 12 GB of GDDR5 RAM.

The Power8 CPU is manufactured on 22 nm SOI process and contains 8 cores. For a good description and review of the processor, see http://www.anandtech.com/show/9567/the­power­8­review­challenging­the­intel­xeon­ The processor features several innovations, including memory access (Intelligent L3 cache, L4 cache and transactional memory), and Coherent Accelerator Processor Interface (CAPI), which allows for more direct incorporation of accelerators. However, as we will see later, in the current version this does not include NVidia's GPUDirect, but there will be support for NVlink in the Pascal series of GPUs.

We compare the Power8 CPU to the previous generation Intel Xeon known under code name Haswell­EP. It is a representative of the tick cycle where new features are added to existing manufacturing process. It is made on the 22 nm process. Just recently (3/31/2016), Intel released the follow up CPU code named Broadwell­EP, which is a 14 nm process shrink of Haswell with a few new features. According to reviews (http://www.anandtech.com/show/10158/the­intel­xeon­e5­v4­review), the performance of Broadwell is not very different from Haswell, therefore using it for comparison is still appropriate. We are using 12 core Intel Xeon E5­2680 v3 @ 2.50GHz in dual socket configuration as featured on CHPC Kingspeak cluster.

Table below compares the Power8 with Xeon E5­2680 v3

Power 8 Haswell­EP Core count 8 12 Base clock speed 3.758 GHz 2.5 GHz Turbo clock speed 4.123 GHz 3.3 GHz Max sustained instructions 8 6 (4) per core Vector Unit Size 128 bit 256 bit SMT (hyperthreads) per 8 2 core L1­I/L1­D cache 32kB/64kB 32kB/32kB L2 cache 512 kB/core 256 kB/core L3 cache 8 MB eDRAM/core 2.5 MB SRAM/core L4 cache 16 MB eDRAM per module none Theoretical memory 204 GB/s 102 GB/s bandwidth Power (TDP) 190W/247W (Turbo) 120 W

From the table, leaving out the architectural differences details, the strength of the Power8 seems to be in the clock speed and memory subsystem (caches, memory bandwidth), and larger simultaneous multi threading (SMT) capabilities. Haswell's strengths are in wider vector unit size, core count and power needs.

Price­wise, according to the Anandtech article referenced above, the Power8 CPUs should be cheaper than the Xeons, but, the choice of memory makes the Power8 systems overall more expensive than the Intel machines ­ at least those machines coming from IBM. The article argues that Power 8 systems from vendors that choose more standard memory may be priced equally or cheaper than the Intel systems.

Operating system and software stack

We installed CentOS7 ppc64 LE (little endian) on the system, which to an ordinary user behaves very similar to ­64. From the software stack perspective, we were given a trial license of IBM XL compilers and the ESSL library. We also installed the free IBM Advance Toolchain (AT) 8.0 and 9.0, which is a port of the GNU toolchain to PowerLinux together with optimizations for the platform. We also installed NVidia CUDA 7.5.

Simple programs and libraries, such as MPICH and HPL and NAS benchmarks, have built fairly straightforwardly with the AT GNU or XL compilers. Mixing in CUDA showed a minor problem ­ the latest AT 9.0 uses GNU 5.3.1, while CUDA 7.5 only supports GNU up to 4.9. We thus had to also install AT 8.0 which uses GNU 4.9.4 in order to use it with CUDA. More complex codes gave more trouble. Also note that we did not spend a lot of time trying different compiler optimization options, but rather used options that generally give good performance on the Haswell, and those suggested by IBM's Optimization and Tuning Techniques for the Power 8, accessible at http://www.redbooks.ibm.com/redbooks/pdfs/sg248171.pdf. We haven't succeeded in building AMBER molecular dynamics code using ESSL library for BLAS/LAPACK, since ESSL does not include full LAPACK. This was a surprise since these days vendor libraries like MKL or ACML package full LAPACK. Furthermore, it was not possible to link reference LAPACK to ESSL since ESSL is missing certain BLAS utility functions, such as "xerbla". This could be overcome by hand including these missing functions from reference BLAS sources but at this point we figured this was not worth the time and went with full reference LAPACK and BLAS as supplied by AMBER.

Additional problematic piece was the fact that the IBM XL compilers do not add underscore suffix to Fortran symbols by default ­ something that most Linux compilers do. There is a flag, ­qextname, which tells compiler to add underscore, but, it turned out to be complicated to make sure it's only used with Fortran calls and not C code (e.g. would require changes in the Amber configure scripts to build NetCDF­C without underscoring and NetCFD­F with). Using no underscoring was not an option since Amber has several functions in both C and Fortran with the same name (e.g. min), differentiated only by the trailing underscore, which broke the compilation.

At that point we ditched the AMBER XL compilation attempt and only used GCC (AT 8.0 = GCC 4.9.4).

After this experience, we opted for using the GNU compilers for the other real applications that we benchmarked.

Raw and synthetic performance benchmarks

STREAM benchmarks

As usual, we first take a look at the memory throughput, which gains more importance as the number of cores on the CPUs increases. The STREAM benchmark tests the bandwidth from CPU to the main memory by performing four different operations on large sequential data arrays. We have compiled STREAM using the Intel 2016.2 compiler on the Haswell and GCC AT 5.3.1 and XLC 13.1.4 on the Power8. Figure 1 shows the results. We also add few previous Intel CPU generations in the mix, the Westmere (CHPC's Ember cluster) and Sandy Bridge (CHPC's early Kingspeak cluster) Copy Westmere Copy Sandybridge Copy Haswell Copy Power8

120000

100000

80000 ] s / B 60000 M [

e t a R 40000

20000

0 1 2 4 8 12 16 24 32

# of cores

Figure 1a. STREAM Copy results

Triadd Westmere Triadd Sandybridge Triadd Haswell Triadd Power8

120000

100000

80000 ] s / B 60000 M [ e t a R 40000

20000

0 1 2 4 8 12 16 24 32

# of cores Figure 1b. STREAM Triadd results

Due to a busy nature of the many core results, we only show plots for two of the four STREAM benchmarks. The other two exhibit similar behavior as Triadd. The Haswell chip provides a significant improvement over its predecessors as it significantly improved the memory access speed. The performance of both CPUs is fairly mixed, with lower thread counts the Power8 leads, the Haswell shows better performance in the middle and at higher thread count ­ which is where most applications would run, utilizing all the physical cores, the performance is about the same.

We therefore are not seeing the double memory bandwidth on the Power8 that the references above claim. Either they specify a different characteristic or the Stream benchmark is not appropriate.

High Performance Computing Challenge (HPCC) benchmark

HPCC benchmark is a synthetic benchmark suite geared at assessing HPC performance from different angles. It consists of seven main benchmarks that stress various computer subsystems, such as raw performance, memory access and communication. For detailed description of the benchmark see http://icl.cs.utk.edu/hpcc/.

We have built the HPCC with MPICH 3.2, and Intel 2016.2 compilers and MKL on the Haswell, and with either GCC 5.3.1 or xlc13.1.4/xlf15.1.4 and ESSL on the Power8. The GCC compiler on the Power8 performed the same or better than the XL compilers, therefore we only report the GCC results.

In Table 1 we compare the results, note the table is split since otherwise it would be too wide for a page. There are several different angles to look at the table.

First is obviously comparison of the two CPUs. The Haswell leads in every aspect except for the MPIFFT. I speculate that the MPIFFT lead is due to the better memory cache structure on the P8.

Second is the effect of simultaneous multi­threading (SMT). The Intel chip allows 2 logical threads on a physical core, while the Power8 allows 8 threads. SMT is useful to hide memory latencies thus it performs best when memory access of the application is irregular. Most HPL applications exhibit regular memory access pattern at least to some degree, thus only the MPIFFT and PTRANS benchmarks seem to show some benefit of the SMT, and that mainly for the Haswell CPU. I have to admit that I don't understand the behavior of the PTRANS on the Power8, where it performs very poorly on 16 processes ­ both with the GCC and XL compilers. Bottom line here is that at least for HPCC, the SMT provides little or no benefit.

Finally we look at the maximum raw performance. Benchmarks with friendly memory cache behavior, such as the HPL or DGEMM, perform the best at the same process count as number of physical cores. Those with more irregular memory access, such as the FFT, benefit from SMT. In Table 3 we summarize this result. The HPL shows a significant, 70% faster, performance of the Haswell. Most of the other benchmarks show more equal performance.

Hasw 48p Hasw 24p Hasw 12p HPL_Tflops 0.6701 0.7299 0.3654 StarDGEMM_Gflops 15.121 31.834 32.748 SingleDGEMM_Gflops 23.117 41.721 41.783 PTRANS_GBs 10.267 7.392 2.685 MPIRandomAccess_GUPs 0.022 0.027 0.016 StarRandomAccess_GUPs 0.007 0.026 0.025 SingleRandomAccess_GUPs 0.040 0.078 0.066 StarSTREAM_Triad 1.719 2.547 2.111 SingleSTREAM_Triad 12.814 12.928 13.063 StarFFT_Gflops 0.848 1.530 1.264 SingleFFT_Gflops 1.899 2.384 2.418 MPIFFT_Gflops 14.854 8.530 5.166

P8 128p P 48p P8 16p P8 8p HPL_Tflops 0.2166 0.2497 0.4253 0.2118 StarDGEMM_Gflops 1.813 7.606 27.509 27.639 SingleDGEMM_Gflops 5.080 11.323 28.377 28.049 PTRANS_GBs 13.060 13.444 1.888 4.634 MPIRandomAccess_GUPs 0.013 0.030 0.032 0.017 StarRandomAccess_GUPs 0.003 0.007 0.014 0.016 SingleRandomAccess_GUPs 0.013 0.015 0.034 0.034 StarSTREAM_Triad 1.072 3.408 11.303 11.312 SingleSTREAM_Triad 3.559 5.265 31.910 31.881 StarFFT_Gflops 0.347 0.883 1.660 1.781 SingleFFT_Gflops 0.764 1.041 2.593 2.548 MPIFFT_Gflops 16.251 13.403 16.976 10.774

Table 1. HPCC results, the higher the value the better

HPL_Tflops High Performance Linpack benchmark ­ the one that's used for Top500 ­ measures the floating point rate of execution for solving a linear system of equations. StarDGEMM_Gflops Parallel DGEMM ­ measures the floating point rate of execution of double precision real matrix­matrix multiplication. SingleDGEMM_Gflops Serial DGEMM ­ on single processor Parallel Matrix Transpose ­ exercises the communications where pairs of processors communicate with each other PTRANS_GBs simultaneously. It is a useful test of the total communications capacity of the network. MPIRandomAccess_GUP MPI Parallel Random Access s UPC Parallel Random Access ­ measures the rate of integer StarRandomAccess_GUPs random updates of memory (GUPS). SingleRandomAccess_GU Serial Random Access Ps Parallel STREAM ­ a simple synthetic benchmark program StarSTREAM_Triad that measures sustainable memory bandwidth (in GB/s) and the corresponding computation rate for simple vector kernel. SingleSTREAM_Triad Serial STREAM Parallel FFT ­ measures the floating point rate of execution of StarFFT_Gflops double precision complex one­dimensional Discrete Fourier Transform (DFT). SingleFFT_Gflops Serial FFT MPIFFT_Gflops MPI FFT Table 2. HPCC explanations

Haswell value Power8 value Hasw/P8 HPL_Tflops 0.730 0.425 1.716 StarDGEMM_Gflops 32.748 27.639 1.185 SingleDGEMM_Gflops 41.783 28.377 1.472 PTRANS_GBs 10.267 13.444 0.764 MPIRandomAccess_GUPs 0.027 0.032 0.843 StarRandomAccess_GUPs 0.026 0.016 1.566 SingleRandomAccess_GUPs 0.078 0.034 2.270 StarSTREAM_Triad 2.547 11.312 0.225 SingleSTREAM_Triad 13.063 31.910 0.409 StarFFT_Gflops 1.530 1.781 0.859 SingleFFT_Gflops 2.418 2.593 0.933 MPIFFT_Gflops 14.854 16.976 0.875

Table 3. HPCC highest performance comparison.

HP Linpack

To assess the value of the commercial IBM software tools, we also looked at open source BLAS alternatives to the IBM ESSL library. Therefore we compiled the HPL benchmark separately with the ESSL, OpenBLAS and ATLAS­ the latter two libraries being open source and having Power8 support. Finally, we also tried the stock BLAS coming with CentOS7. In Table 4 we summarize the performance we obtained. We set the problem size N=20,000, which corresponds to 3.2 GB of used RAM, less than the 64 GB we used above for the HPCC, thus slightly lower result for the ESSL (379.4 vs. 425.3 GFlops).

The open source alternatives, ATLAS and OpenBLAS, only provide only about a half of the performance of the IBM's ESSL library. Therefore they don't appear to be tuned that well to the Power 8 CPU. The stock CentOS7 BLAS (essentially reference BLAS with no CPU architecture tuning), is at 10% of ESSL which exemplifies that it should not be considered in any HPC application.

BLAS Library GFlops ESSL 379.4 ATLAS 191.1 OpenBLAS 190.5 Stock CentOS BLAS 35.7

Table 4. HPL performance with different BLAS libraries

NAS Parallel Benchmarks

NAS Parallel Benchmarks are a set of programs derived from computational fluid dynamics (CFD) applications. Some basic information about the benchmarks is here: https://en.wikipedia.org/wiki/NAS_Parallel_Benchmarks. Each of these benchmarks can be run with different problem sizes. Class A is a small problem, Class B is medium size, Class C is a large problem, and Class D is a very large problem (needing about 12 GB of RAM). There are also even larger classes E and F, but, neither of them fitted into the systems memory. We have ran Classes A­C and present results for Class C. In general, we used fairly aggressive and mostly comparable optimization flags to build the NAS codes. On the Haswell, we have used the Intel 2016.2 compiler with flags “­O3 ­ipo ­openmp ­axCORE­AVX2”. On the Power8 XL compiler we used flags " ­qsmp=omp ­qarch=pwr8 ­O3 ­qhot ­qipa ­qtune=pwr8:balanced" and for the GCC we used flags "­O3 ­fomit­frame­pointer ­mtune=native ­fstrict­aliasing ­fno­schedule­insns ­ffast­math ­mcpu= ­mpower8­fusion ­mpower8­vector ­mdirect­move ­mvsx ­fopenmp"

All the NAS benchmark plots compare the performance in Mops/sec or Mops/sec/thread. As we are looking at comparing maximum performance on the whole multi­core machine, and also evaluating the SMP capabilities, below we look at the Mops/sec. The higher is the Mops/sec count, the better. We present the benchmarks in two graphs broken by the Mops/sec value for better comparison.

First, in Figures 5 and 6, we look at the SMT performance. On the Haswell (Figure 5), the SMT performance is quite mixed, ranging from modest improvement for FT, IS and EP going all the way to significant slowdown in LU. On the Power8, the situation is quite different, the SMT provides at least some speedup in all cases except for the IS, and oftentimes the speedup is considerable.

In Figure 6 we also compare performance of the XL and GNU compilers. Except for a few cases (SP, BT, IS), the XL produces somewhat faster executable. Although we have to admit that we did not spend any time adjusting the compilation flags, which may have at least some difference on the performance.

Haswell SMT performance

50000 45000 40000 35000 30000 48 th 24 th 25000 16 th 20000 15000 10000 5000 0 FT MG SP LU BT

Figure 5a. NAS FT,MG, SP, LU and BT benchmarks on Haswell on different thread count Haswell SMT performance

7000

6000

5000 48 th 4000 24 th 16 th 3000

2000

1000

0 IS EP CG UA Figure 5b. IS, EP, CG and UA benchmarks on Haswell on different thread count NPB Power8 SMP and compiler performance

128 th XL 128 th GNU 64 th XL 64 th GNU 32 th XL 32 th GNU 16 th XL 16 th GNU 8 th XL 8 th GNU

60000

50000

40000

30000

20000

10000

0 FT MG SP LU BT CG Figure 6a. NAS FT,MG, SP, LU, BT and CG benchmarks on the P8 on different thread count and compiler NPB Power8 SMP and compiler performance

128 th XL 128 th GNU 64 th XL 64 th GNU 32 th XL 32 th GNU 16 th XL 16 th GNU 8 th XL 8 th GNU

2500

2000

1500

1000

500

0 IS EP UA Figure 6b. IS, EP and UA benchmarks on P8 on different thread count and compiler

Next we turn to direct comparison of the P8 and Haswell. In Figure 7 we do that for the largest SMT thread count (128 on P8, 48 on Haswell), the physical core count (16 on P8 and 24 on Haswell), and, on one core ­ to evaluate single core performance. On the P8 we use the XL compiler as it overall gave better results. When considering the SMT, the Power8 is doing quite well here, it beats the Haswell on all benchmarks, except BT, IS and EP. Per core performance of the P8 is also better, I would speculate that is thanks to faster memory access at one core and smaller ratio of FMA instructions (after examining some of the Haswell programs in the Intel AdvisorXE vectorization tool which shows decent vectorization). Power8 XL vs. Haswell

60000

50000 P8 XL 128 th 40000 HW Intel 48 th P8 XL 16 th 30000 HW Intel 24 th P8 XL 1 th HW Intel 1 th 20000

10000

0 FT MG SP LU BT Figure 7a. NAS FT,MG, SP, LU and BT benchmarks on the P8 (XL) and Haswell

Power 8 XL vs. Haswell

16000

14000

12000 P8 GNU 128 th HW Intel 48 th 10000 P8 GNU 16 th 8000 HW Intel 24 th P8 GNU 1 th 6000 HW Intel 1 th

4000

2000

0 IS EP CG UA

Figure 7b. IS, EP, CG and UA benchmarks on P8 on different thread count and compiler

Synthetic benchmarks conclusion

From the synthetic benchmarks it appears that the Intel chip outperforms the P8 at dense linear algebra, while with other applications the bag is more mixed. Staying purely open source with the P8 is possible without a big reduction in performance, although especially for the Fortran based NAS benchmarks the XL compiler tends to outperform gfortran. Also, for dense linear algebra using BLAS and LAPACK, the IBM's ESSL significantly outperforms the open source alternatives.

Real applications benchmarks

VASP

VASP is a plane wave electronic structure program that is widely used in solid state physics and materials science. CHPC has several heavy users of VASP. We have compiled VASP with Intel compilers 2016 on the Haswell and with GNU 4.9.4 on the Power8. Note that from 2015, VASP also has a GPU version, but, I did not succeed in building it on the Power8 with GNU compilers due to compiler complaining on missing references. This was probably due to use of inappropriate preprocessor flags as the VASP distribution does not come with build options for GNU with CUDA. The CHPC production systems have the GPU version built with the Intel compilers.

We present two benchmarks of semiconductor based systems, Si and SiO, the SiO being several times larger. The smaller system is slowly becoming less relevant as both the hardware and the software improve, so, in our explanations we focus on the larger problem. As with the HPCC, we include results we obtained on previous generation of processors in Table 2, though, beware that the older CPUs were run with older VASP version which was potentially less optimized. The results are runtime in seconds, so, the smaller number the better.

VASP is very memory and compute intensive and it uses linear algebra libraries underneath which lends to a fair amount of vectorization. The Intel Haswell chip performs 2.35x faster per core than the 3 generations back Westmere­EP and 4.2x faster when accounting for all the cores. The Power8 is 33% slower on one core and 25% slower for the whole node, which is reasonable for a code that's not completely vectorized like the HPL (where the Haswell is 75% faster). So, overall, albeit slower, the Power8 is not doing too bad here. The SMT is not helpful, the performance peaks at the physical core count.

(Si 12 layer, 24 at., 16 kpts, 60 bnds) 1 CPU 2 CPU 4 CPU 8 CPU 12 CPU 16 CPU 24 CPU 32 CPU 48 CPU Westmere-EP 2.8 233.49 123.05 68.79 51.73 47.13 57.08 65.75 * VASP 4.x Haswell 2.5 118.02 56.7 34.58 22.13 20.48 15.74 27.06 15.12 Power8 3.x 197.69 103.84 57.9 36.77 27.67 24.56 32.36 43.1 (Si192+O, 4 kpts, 484 bnds) Westmere-EP 2.8 999.36 514.66 330.2 210.14 175.22 199.77 210.59 * VASP 4.x Haswell 2.5 424.72 187.93 116.83 76.69 66.32 57.79 41.52 55.89 Power8 3.x 565.68 277.84 150.44 84.35 61.88 52.1 58.96 70.07 Table 5. VASP performance (time in seconds, lower is better).

AMBER

AMBER is a molecular dynamics program used extensively by many research groups. It has both CPU and GPU support, and, of interest was how well AMBER performs in the multi­GPU configuration of the Power8 server. Amber building problems were described in the initial section of this document, here we present results of official Amber benchmarks for Amber14 built with gcc 4.9.3 from the IBM Advance Toolchain 8.0 and MPICH 3.2. On the Haswell, we used production build with Intel 2015 compilers and Intel MPI.

In Table 6, we show the select Amber benchmarks with the result in nanoseconds of simulation per day (ns/day), i.e. the higher is better. First, we look at the CPU performance, comparing the CPU speed of the 24 core Haswell node to that of the 16 core Power8 server. The Power8 does not do very well here, we presume this is due to hand optimizations of the Amber kernels to the x86­64 vectorization, and potentially less optimized FFTW on the Power8 (ESSL contains limited FFTW compatibility, but, as mentioned above, we were not successful in adding ESSL to the Amber build). The lack of optimized BLAS should not play a major role as the main kernels should not use it, the same thing with MPI, which should perform very similarly on a shared memory machine.

Amber performs faster on the GPUs, and since the Power8 server has a good GPU configuration, its GPU performance was of an interest. We compare the Power8 server performance to those published at http://ambermd.org/gpus/benchmarks.htm#Benchmarks for the NVidia Tesla K80 accelerator. On a single GPU, the performance is quite comparable ­ slightly lower on the Power8 but that's probably because we did not turn on the clock boost. On two and more GPUs, we see a dramatic performance decrease. This is due to the fact that the Power8 system does not support GPU Direct communication between the GPUs.

Amber 14 HW P8 CPU P8 1GPU K80 publ P8 2 GPU 2K80 publ P8 4 GPU 4K80 publ JAC_PRODUCTION_NVE - 23,558 atoms PME 4fs 69.42 15.02 223.19 229.29 109.58 334.05 186.17 423.69 JAC_PRODUCTION_NPT - 23,558 atoms PME 4fs 69.99 16.27 219.35 221.76 176.14 319.3 25.91 395.74 FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME 9.77 2.06 33.96 32.99 32.49 51.56 3.92 62.49 FACTOR_IX_PRODUCTION_NPT - 90,906 atoms PME 9.49 1.87 31.96 31.94 20.07 49.15 9.49 58.4 CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME 1.98 0.55 7.34 7.61 3.58 12.18 5.79 16.41 CELLULOSE_PRODUCTION_NPT - 408,609 atoms PME 1.98 0.5 7.28 7.37 5.55 11.48 1.44 15.17

Table 6. Select Amber 14 benchmarks on the Haswell (HW), the Power8 server (P8) and published K80 performance (K80* publ). The higher is better.

Another problem that we had when running the Amber benchmarks was occasional segmentation faults, which tended to occur after several hours of running at a full load. Subsequent restarting of the same calculation ran without an error. Therefore, we are suspecting potential hardware issues, possibly overheating, though, the GPU temperatures seemed to stay in the appropriate range. More detailed investigation would be needed to find the cause of these problems.

GRMAG3DTOPO

GRMAG3DTOPO is a geophysical inversion program that CHPC develops in collaboration with the Consortium for Electromagnetic Modeling and Inversion (CEMI). This program inverts gravitational and magnetic fields. All the data are being recomputed at every iteration in a series of heavily nested loops, with a case statement at the very center of these loops that decides which one of ca. 60 different computational kernels to use depending on what kind of data is being inverted. Due to this case statement, the code does not vectorize well. On the other hand, there is very small amount of communication, and the multiple­nested computational kernel can be threaded at a computationally efficient level so as such the code scales very well both across MPI processes and OpenMP threads.

Runtime Haswell/PGI P8/GNU SP-DP 1597.14 8902.26 DP 1549.95 10580.46 Table 7. GRMAG3DTOPO runtime for mixed single­double precision (SP­DP) and double precision (DP)

In Table 7 we list runtimes for our benchmark model of the gravity gradient survey of Vredefort crater in South Africa, which contains ca. 5.6 mil. cells and 160000 data points. The code supports mixed single­double precision, calculating the kernels in single precision but doing the accumulation in double precision. As the code does not vectorize, on the Intel platform, the FPU is being used which results in no speedup with the SP­DP approach. We see a 40­60% speedup in the GPU implementation, which we can't build on the Power8 as we are not aware of any OpenACC compiler for the PowerLinux. However, the Portland Group reports that they are planning a beta version of their compiler in July 2016.

One the Power8 system, the SP­DP version performs 10% faster than the DP, which suggests some differences in the scalar FPU as compared to the Intel processor. However, disappointingly, the Power8 server is 5.5x slower. Without profiling the code with advanced tools, like Intel's VTune, it is hard to speculate on the reason behind this.

Conclusions The Power 8 CPU is labeled as a major challenger to the Intel CPU dominance and our comparison to the second most recent Intel CPU to certain extent backs up that claim ­ at least in the synthetic benchmarks. Unfortunately, with the production applications, the picture for the P8 is less rosy. We suspect this is to a large part due to hand optimizations of the codes for the Intel platform, which is lacking for the Power 8. Overall, the limited software stack support may be the largest detriment to the Power platform deployment. More available tools (e.g. the PGI compilers) and increased penetration of the platform should help with this.

Another disappointment was the lack of the GPU Direct support in the tested Power 8 server. Being marketed as a GPU accelerated server, this is a serious issue. The upcoming Power 8 servers with NVidia Pascal GPUs are supposed to include GPU Direct which should be a major improvement.

While a good start, we hope that this first generation of the Power CPUs and servers geared to the commodity Linux market will provide feedback for the next generations of product that will improve its competitiveness with the x86­64 world.

Acknowledgements

I would like to thank Guy Adams and the rest of the systems team for setting up the test machine.