SequenceL Optimization Results on IBM POWER8 with Linux
This report is a compilation of the final performance results of TMT optimizing the auto- parallelizing SequenceL multicore programming tool set to the IBM POWER8 platform.
SequenceL is a compact, powerful functional programming language and auto-parallelizing tool set that quickly and easily converts algorithms to robust, massively parallel code. SequenceL was designed to work in concert with existing programming languages and legacy code, making it a powerful tool for software engineers to modernize code for modern platforms. Customers have nicknamed it “Matlab on steroids” since it enables scientists and engineers to easily explore different algorithms and innovations, then quickly convert them to robust, performant production code that runs on a multitude of modern hardware platforms.
This effort included: Performance optimizations to utilize SIMD (Altivec, VSX) vectorization and perform cache optimization on POWER8 systems in both the SequenceL runtime environment and code generated by the SequenceL compiler. Ensuring SequenceL is able to run with recent open source library versions that have had optimizations added for POWER8, including gcc 4.9+, glibc 2.20+ Tested performance and correctness with Advanced Toolchain for PowerLinux v8 and v9. Ran and tested SequenceL on multiple POWER8 Linux distributions, including: o Red Hat RHEL 7.2 o CentOS 7 o Ubuntu 14, 15.10, and 16.04 TMT performance testing benchmarks on multiple IBM POWER8 and Intel x86 platforms.
Comparisons of POWER8 to Intel x86 Xeon v3 After completing the optimization work on IBM POWER8, TMT ran its suite of heatmap programs on two POWER8 configurations and three Intel x86 “Haswell” configurations for comparison and verification purposes. TMT uses these programs, which are written in SequenceL, to stress the hardware platforms and ensure the SequenceL tools are operating optimally on the platforms. An overview of the tested server configurations are below and results are on subsequent pages.
Tested server configurations (all using RHEL 7.2 Linux): IBM S822: POWER8E 3.6GHz 20 cores (160 threads), 256GB memory, 1TB disk IBM S824: POWER8E 4.32GHz 16 cores (128 threads), 512GB memory, 780GB SSD Dell R730: Intel Xeon E5-2687W v3 3.10GHz 20 cores (40 threads), 256GB memory, 200GB SSD Dell R730: Intel Xeon E5-2699 v3 2.3GHz 36 cores (72 threads), 256GB memory, 200GB SSD Dell R730: Intel Xeon E5-2650 v3 2.3GHz 20 cores (40 threads), 256GB memory, 200GB SSD
June 2016
Final Results, IBM POWER8 vs. Intel x86
Matrix Multiply
7
6
5 3.6 Ghz P8
4.3 Ghz P8 4 Xeon 2687 v3 3 Seconds Xeon 2699 v3 Xeon 2650 v3 2 Fastest
1
0 1 2 4 8 16 20 40 80 140 160 Threads
Matrix Multiplication of two 2000 X 2000 matrices of double precision floating point numbers. Less time is better.
Game of Life
10 9 8
7 3.6 Ghz P8
6 4.3 Ghz P8 5 Xeon 2687 v3
Seconds 4 Xeon 2699 v3 3 Fastest Xeon 2650 v3 2 1 0 1 2 4 8 16 20 40 80 140 160 Threads
Game of life calculation on a 2000 X 2000 board. Stresses memory system and integer arithmetic. Less time is better.
June 2016
FFT
2.5
2
3.6 Ghz P8
1.5 4.3 Ghz P8 Xeon 2687 v3
Seconds 1 Xeon 2699 v3 Xeon 2650 v3 Fastest 0.5
0 1 2 4 8 16 20 40 80 140 160 Threads
Two dimensional Fast Fourier Transformation of a 1024 X 1024 matrix of double precision floating point numbers. Less time is better.
Quick Sort
0.6
0.5
0.4 3.6 Ghz P8
4.3 Ghz P8 0.3 Xeon 2687 v3
Seconds Xeon 2699 v3 0.2 Fastest Xeon 2650 v3
0.1
0 1 2 4 8 16 20 40 80 140 160 Threads
Sorts a list of 350,000 double precision floating point numbers. Less time is better.
June 2016
Substring Search
5 4.5 4
3.5 3.6 Ghz P8
3 4.3 Ghz P8 2.5 Xeon 2687 v3
Seconds 2 Xeon 2699 v3 Fastest 1.5 Xeon 2650 v3 1 0.5 0 1 2 4 8 16 20 40 80 140 160 Threads
Search for a substring within a list of 124,000,000 characters. Less time is better.
Barnes Hut
35
30
25
3.6 Ghz P8
20 4.3 Ghz P8 Xeon 2687 v3 15 Seconds Xeon 2699 v3
Xeon 2650 v3 10 Fastest
5
0 1 2 4 8 16 20 40 80 140 160 Threads
The Barnes-Hut calculation is an approximation of an N-Body simulation. Less time is better. Barnes-Hut relies heavily on vector instructions, so Intel x86 does better due to 256 bit SIMD width vs. 128 bit on POWER8.
June 2016
Matrix Inverse
3.5
3
2.5
3.6 Ghz P8
2 4.3 Ghz P8 Xeon 2687 v3 1.5 Seconds Xeon 2699 v3 Fastest 1 Xeon 2650 v3
0.5
0 1 2 4 8 16 20 40 80 140 160 Threads
Performs a matrix inverse calculation on a 1000 X 1000 matrix of double precision floating point numbers. Less time is better.
Sparse Compression
0.3
0.25
0.2 3.6 Ghz P8
4.3 Ghz P8
0.15 Xeon 2687 v3
Seconds Xeon 2699 v3 0.1 Fastest Xeon 2650 v3
0.05
0 1 2 4 8 16 20 40 80 140 160 Threads
Converts a 7000 X 7000 matrix into a sparse matrix. Generally a memory bound problem, less time is better.
June 2016
Sparse Multiplication
0.16
0.14
0.12 3.6 Ghz P8 0.1 4.3 Ghz P8
0.08 Xeon 2687 v3
Seconds Xeon 2699 v3 0.06 Fastest Xeon 2650 v3 0.04
0.02
0 1 2 4 8 16 20 40 80 140 160 Threads
Multiply a sparse matrix of 5000 X 5000 double precision floating point values with a vector of 5000 double precision floating point values. This is a memory bound problem, less time is better.
Sparse Decompression
0.09
0.08
0.07
0.06 3.6 Ghz P8
4.3 Ghz P8 0.05 Xeon 2687 v3 0.04 Seconds Xeon 2699 v3
0.03 Fastest Xeon 2650 v3 0.02
0.01
0 1 2 4 8 16 20 40 80 140 160 Threads
Convert a 5000 X 5000 sparse matrix of double precision floating point values into a matrix. This is a memory bound problem, less time is better.
June 2016
Autodesk
6
5
4 3.6 Ghz P8
4.3 Ghz P8
3 Xeon 2687 v3
Seconds Xeon 2699 v3 2 Fastest Xeon 2650 v3
1
0 1 2 4 8 16 20 40 80 140 160 Threads
Customer problem. Tests threading capability and FPU performance. Less time is better.
Find Pi
0.7
0.6
0.5
3.6 Ghz P8
0.4 4.3 Ghz P8 Xeon 2687 v3 0.3 Seconds Xeon 2699 v3
0.2 Xeon 2650 v3
0.1
0 1 2 4 8 16 20 40 80 140 160 Threads
Calculate the value of Pi. Tests floating point arithmetic. Less time is better.
June 2016
Mandelbrot
16
14
12 3.6 Ghz P8
10 4.3 Ghz P8 8 Xeon 2687 v3
Seconds Xeon 2699 v3 6 Xeon 2650 v3 4 Fastest
2
0 1 2 4 8 16 20 40 80 140 160 Threads
Use Monte Carlo sampling to calculate the Mandelbrot set area. This problem is bound by CPU performance. Less time is better.
Spectral Norm
10 9 8 7
3.6 Ghz P8
6 4.3 Ghz P8 5 Xeon 2687 v3
Seconds 4 Xeon 2699 v3 Fastest 3 Xeon 2650 v3 2 1 0 1 2 4 8 16 20 40 80 140 160 Threads
Numerical analysis calculations. Bound by a combination of memory and CPU performance. Less time is better.
June 2016
Jacobi (Cells / Sec) Fastest
18 16 14 12 3.6 Ghz 4.3 Ghz 10 Xeon 2687 v3 8
Cells/sec Xeon 2699 v3 6 Xeon 2650 v3 4 2 0 1 2 4 8 16 20 40 80 140 160 Threads Jacobi iteration on a 3 dimensional array of 250 X 250 X 250 double precision floating point numbers. This follows a pattern that is very common in HPC fields. It is generally memory bound, although there are also many floating point calculations. The most cells/sec is best.
Discrete Cosine Transform
7
6
5
3.6 Ghz
4 4.3 Ghz Xeon 2687 v3 3 Seconds Xeon 2699 v3 Fastest 2 Xeon 2650 v3
1
0 1 2 4 8 16 20 40 80 140 160 Threads
Discrete cosine transformation of a 4096 X 4096 matrix of double precision floating point numbers. This calculation is commonly used in compression algorithms. Less time is better.
June 2016
Semblance
10
9
8 3.6 Ghz
7 3.6Ghz (Limited SMT)
6 4.3 Ghz
5 4.3Ghz (Limited SMT)
Seconds 4 Xeon 2687 v3 Fastest 3 Xeon 2699 v3
2 Xeon 2650 v3
1
0 1 2 4 8 16 20 40 80 140 160 Threads
Customer problem in the Oil & Gas industry. Memory bound problem that also includes many floating point calculations and tests the threading capability of a machine. Includes results of reduced SMT settings on POWER8 (1@16 cores, 1@20 cores, 2@40 cores, 4@80 cores, 7@140 cores). Less time is better.
WirelessHART
12
10 3.6 Ghz
3.6Ghz (Limited SMT) 8
4.3 Ghz
4.3Ghz (Limited SMT) 6 Xeon 2687 v3
Seconds Fastest 4 Xeon 2699 v3 Xeon 2650 v3 2
0 1 2 4 8 16 20 40 80 140 160 Threads Customer problem that creates a schedule for a network gateway. Less time is better. Includes results of reduced SMT settings on POWER8 (1@16 cores, 1@20 cores, 2@40 cores, 4@80 cores, 7@140 cores).
June 2016