SequenceL Optimization Results on IBM POWER8 with

This report is a compilation of the final performance results of TMT optimizing the auto- parallelizing SequenceL multicore programming tool set to the IBM POWER8 platform.

SequenceL is a compact, powerful language and auto-parallelizing tool set that quickly and easily converts algorithms to robust, massively parallel code. SequenceL was designed to work in concert with existing programming languages and legacy code, making it a powerful tool for software engineers to modernize code for modern platforms. Customers have nicknamed it “Matlab on steroids” since it enables scientists and engineers to easily explore different algorithms and innovations, then quickly convert them to robust, performant production code that runs on a multitude of modern hardware platforms.

This effort included:  Performance optimizations to utilize SIMD (Altivec, VSX) vectorization and perform cache optimization on POWER8 systems in both the SequenceL runtime environment and code generated by the SequenceL .  Ensuring SequenceL is able to run with recent open source library versions that have had optimizations added for POWER8, including gcc 4.9+, glibc 2.20+  Tested performance and correctness with Advanced Toolchain for PowerLinux v8 and v9.  Ran and tested SequenceL on multiple POWER8 Linux distributions, including: o RHEL 7.2 o CentOS 7 o 14, 15.10, and 16.04  TMT performance testing benchmarks on multiple IBM POWER8 and Intel platforms.

Comparisons of POWER8 to Intel x86 Xeon v3 After completing the optimization work on IBM POWER8, TMT ran its suite of heatmap programs on two POWER8 configurations and three Intel x86 “Haswell” configurations for comparison and verification purposes. TMT uses these programs, which are written in SequenceL, to stress the hardware platforms and ensure the SequenceL tools are operating optimally on the platforms. An overview of the tested server configurations are below and results are on subsequent pages.

Tested server configurations (all using RHEL 7.2 Linux): IBM S822: POWER8E 3.6GHz 20 cores (160 threads), 256GB memory, 1TB disk IBM S824: POWER8E 4.32GHz 16 cores (128 threads), 512GB memory, 780GB SSD Dell R730: Intel Xeon E5-2687W v3 3.10GHz 20 cores (40 threads), 256GB memory, 200GB SSD Dell R730: Intel Xeon E5-2699 v3 2.3GHz 36 cores (72 threads), 256GB memory, 200GB SSD Dell R730: Intel Xeon E5-2650 v3 2.3GHz 20 cores (40 threads), 256GB memory, 200GB SSD

June 2016

Final Results, IBM POWER8 vs. Intel x86

Matrix Multiply

7

6

5 3.6 Ghz P8

4.3 Ghz P8 4 Xeon 2687 v3 3 Seconds Xeon 2699 v3 Xeon 2650 v3 2 Fastest

1

0 1 2 4 8 16 20 40 80 140 160 Threads

Matrix Multiplication of two 2000 X 2000 matrices of double precision floating point numbers. Less time is better.

Game of Life

10 9 8

7 3.6 Ghz P8

6 4.3 Ghz P8 5 Xeon 2687 v3

Seconds 4 Xeon 2699 v3 3 Fastest Xeon 2650 v3 2 1 0 1 2 4 8 16 20 40 80 140 160 Threads

Game of life calculation on a 2000 X 2000 board. Stresses memory system and integer arithmetic. Less time is better.

June 2016

FFT

2.5

2

3.6 Ghz P8

1.5 4.3 Ghz P8 Xeon 2687 v3

Seconds 1 Xeon 2699 v3 Xeon 2650 v3 Fastest 0.5

0 1 2 4 8 16 20 40 80 140 160 Threads

Two dimensional Fast Fourier Transformation of a 1024 X 1024 matrix of double precision floating point numbers. Less time is better.

Quick Sort

0.6

0.5

0.4 3.6 Ghz P8

4.3 Ghz P8 0.3 Xeon 2687 v3

Seconds Xeon 2699 v3 0.2 Fastest Xeon 2650 v3

0.1

0 1 2 4 8 16 20 40 80 140 160 Threads

Sorts a list of 350,000 double precision floating point numbers. Less time is better.

June 2016

Substring Search

5 4.5 4

3.5 3.6 Ghz P8

3 4.3 Ghz P8 2.5 Xeon 2687 v3

Seconds 2 Xeon 2699 v3 Fastest 1.5 Xeon 2650 v3 1 0.5 0 1 2 4 8 16 20 40 80 140 160 Threads

Search for a substring within a list of 124,000,000 characters. Less time is better.

Barnes Hut

35

30

25

3.6 Ghz P8

20 4.3 Ghz P8 Xeon 2687 v3 15 Seconds Xeon 2699 v3

Xeon 2650 v3 10 Fastest

5

0 1 2 4 8 16 20 40 80 140 160 Threads

The Barnes-Hut calculation is an approximation of an N-Body simulation. Less time is better. Barnes-Hut relies heavily on vector instructions, so Intel x86 does better due to 256 bit SIMD width vs. 128 bit on POWER8.

June 2016

Matrix Inverse

3.5

3

2.5

3.6 Ghz P8

2 4.3 Ghz P8 Xeon 2687 v3 1.5 Seconds Xeon 2699 v3 Fastest 1 Xeon 2650 v3

0.5

0 1 2 4 8 16 20 40 80 140 160 Threads

Performs a matrix inverse calculation on a 1000 X 1000 matrix of double precision floating point numbers. Less time is better.

Sparse Compression

0.3

0.25

0.2 3.6 Ghz P8

4.3 Ghz P8

0.15 Xeon 2687 v3

Seconds Xeon 2699 v3 0.1 Fastest Xeon 2650 v3

0.05

0 1 2 4 8 16 20 40 80 140 160 Threads

Converts a 7000 X 7000 matrix into a sparse matrix. Generally a memory bound problem, less time is better.

June 2016

Sparse Multiplication

0.16

0.14

0.12 3.6 Ghz P8 0.1 4.3 Ghz P8

0.08 Xeon 2687 v3

Seconds Xeon 2699 v3 0.06 Fastest Xeon 2650 v3 0.04

0.02

0 1 2 4 8 16 20 40 80 140 160 Threads

Multiply a sparse matrix of 5000 X 5000 double precision floating point values with a vector of 5000 double precision floating point values. This is a memory bound problem, less time is better.

Sparse Decompression

0.09

0.08

0.07

0.06 3.6 Ghz P8

4.3 Ghz P8 0.05 Xeon 2687 v3 0.04 Seconds Xeon 2699 v3

0.03 Fastest Xeon 2650 v3 0.02

0.01

0 1 2 4 8 16 20 40 80 140 160 Threads

Convert a 5000 X 5000 sparse matrix of double precision floating point values into a matrix. This is a memory bound problem, less time is better.

June 2016

Autodesk

6

5

4 3.6 Ghz P8

4.3 Ghz P8

3 Xeon 2687 v3

Seconds Xeon 2699 v3 2 Fastest Xeon 2650 v3

1

0 1 2 4 8 16 20 40 80 140 160 Threads

Customer problem. Tests threading capability and FPU performance. Less time is better.

Find Pi

0.7

0.6

0.5

3.6 Ghz P8

0.4 4.3 Ghz P8 Xeon 2687 v3 0.3 Seconds Xeon 2699 v3

0.2 Xeon 2650 v3

0.1

0 1 2 4 8 16 20 40 80 140 160 Threads

Calculate the value of Pi. Tests floating point arithmetic. Less time is better.

June 2016

Mandelbrot

16

14

12 3.6 Ghz P8

10 4.3 Ghz P8 8 Xeon 2687 v3

Seconds Xeon 2699 v3 6 Xeon 2650 v3 4 Fastest

2

0 1 2 4 8 16 20 40 80 140 160 Threads

Use Monte Carlo sampling to calculate the Mandelbrot set area. This problem is bound by CPU performance. Less time is better.

Spectral Norm

10 9 8 7

3.6 Ghz P8

6 4.3 Ghz P8 5 Xeon 2687 v3

Seconds 4 Xeon 2699 v3 Fastest 3 Xeon 2650 v3 2 1 0 1 2 4 8 16 20 40 80 140 160 Threads

Numerical analysis calculations. Bound by a combination of memory and CPU performance. Less time is better.

June 2016

Jacobi (Cells / Sec) Fastest

18 16 14 12 3.6 Ghz 4.3 Ghz 10 Xeon 2687 v3 8

Cells/sec Xeon 2699 v3 6 Xeon 2650 v3 4 2 0 1 2 4 8 16 20 40 80 140 160 Threads Jacobi iteration on a 3 dimensional array of 250 X 250 X 250 double precision floating point numbers. This follows a pattern that is very common in HPC fields. It is generally memory bound, although there are also many floating point calculations. The most cells/sec is best.

Discrete Cosine Transform

7

6

5

3.6 Ghz

4 4.3 Ghz Xeon 2687 v3 3 Seconds Xeon 2699 v3 Fastest 2 Xeon 2650 v3

1

0 1 2 4 8 16 20 40 80 140 160 Threads

Discrete cosine transformation of a 4096 X 4096 matrix of double precision floating point numbers. This calculation is commonly used in compression algorithms. Less time is better.

June 2016

Semblance

10

9

8 3.6 Ghz

7 3.6Ghz (Limited SMT)

6 4.3 Ghz

5 4.3Ghz (Limited SMT)

Seconds 4 Xeon 2687 v3 Fastest 3 Xeon 2699 v3

2 Xeon 2650 v3

1

0 1 2 4 8 16 20 40 80 140 160 Threads

Customer problem in the Oil & Gas industry. Memory bound problem that also includes many floating point calculations and tests the threading capability of a machine. Includes results of reduced SMT settings on POWER8 (1@16 cores, 1@20 cores, 2@40 cores, 4@80 cores, 7@140 cores). Less time is better.

WirelessHART

12

10 3.6 Ghz

3.6Ghz (Limited SMT) 8

4.3 Ghz

4.3Ghz (Limited SMT) 6 Xeon 2687 v3

Seconds Fastest 4 Xeon 2699 v3 Xeon 2650 v3 2

0 1 2 4 8 16 20 40 80 140 160 Threads Customer problem that creates a schedule for a network gateway. Less time is better. Includes results of reduced SMT settings on POWER8 (1@16 cores, 1@20 cores, 2@40 cores, 4@80 cores, 7@140 cores).

June 2016