Sequencel Optimization Results on IBM POWER8 with Linux
Total Page:16
File Type:pdf, Size:1020Kb
SequenceL Optimization Results on IBM POWER8 with Linux This report is a compilation of the final performance results of TMT optimizing the auto- parallelizing SequenceL multicore programming tool set to the IBM POWER8 platform. SequenceL is a compact, powerful functional programming language and auto-parallelizing tool set that quickly and easily converts algorithms to robust, massively parallel code. SequenceL was designed to work in concert with existing programming languages and legacy code, making it a powerful tool for software engineers to modernize code for modern platforms. Customers have nicknamed it “Matlab on steroids” since it enables scientists and engineers to easily explore different algorithms and innovations, then quickly convert them to robust, performant production code that runs on a multitude of modern hardware platforms. This effort included: Performance optimizations to utilize SIMD (Altivec, VSX) vectorization and perform cache optimization on POWER8 systems in both the SequenceL runtime environment and code generated by the SequenceL compiler. Ensuring SequenceL is able to run with recent open source library versions that have had optimizations added for POWER8, including gcc 4.9+, glibc 2.20+ Tested performance and correctness with Advanced Toolchain for PowerLinux v8 and v9. Ran and tested SequenceL on multiple POWER8 Linux distributions, including: o Red Hat RHEL 7.2 o CentOS 7 o Ubuntu 14, 15.10, and 16.04 TMT performance testing benchmarks on multiple IBM POWER8 and Intel x86 platforms. Comparisons of POWER8 to Intel x86 Xeon v3 After completing the optimization work on IBM POWER8, TMT ran its suite of heatmap programs on two POWER8 configurations and three Intel x86 “Haswell” configurations for comparison and verification purposes. TMT uses these programs, which are written in SequenceL, to stress the hardware platforms and ensure the SequenceL tools are operating optimally on the platforms. An overview of the tested server configurations are below and results are on subsequent pages. Tested server configurations (all using RHEL 7.2 Linux): IBM S822: POWER8E 3.6GHz 20 cores (160 threads), 256GB memory, 1TB disk IBM S824: POWER8E 4.32GHz 16 cores (128 threads), 512GB memory, 780GB SSD Dell R730: Intel Xeon E5-2687W v3 3.10GHz 20 cores (40 threads), 256GB memory, 200GB SSD Dell R730: Intel Xeon E5-2699 v3 2.3GHz 36 cores (72 threads), 256GB memory, 200GB SSD Dell R730: Intel Xeon E5-2650 v3 2.3GHz 20 cores (40 threads), 256GB memory, 200GB SSD June 2016 Final Results, IBM POWER8 vs. Intel x86 Matrix Multiply 7 6 5 3.6 Ghz P8 4.3 Ghz P8 4 Xeon 2687 v3 3 Seconds Xeon 2699 v3 Xeon 2650 v3 2 Fastest 1 0 1 2 4 8 16 20 40 80 140 160 Threads Matrix Multiplication of two 2000 X 2000 matrices of double precision floating point numbers. Less time is better. Game of Life 10 9 8 7 3.6 Ghz P8 6 4.3 Ghz P8 5 Xeon 2687 v3 Seconds 4 Xeon 2699 v3 3 Fastest Xeon 2650 v3 2 1 0 1 2 4 8 16 20 40 80 140 160 Threads Game of life calculation on a 2000 X 2000 board. Stresses memory system and integer arithmetic. Less time is better. June 2016 FFT 2.5 2 3.6 Ghz P8 1.5 4.3 Ghz P8 Xeon 2687 v3 Seconds 1 Xeon 2699 v3 Xeon 2650 v3 Fastest 0.5 0 1 2 4 8 16 20 40 80 140 160 Threads Two dimensional Fast Fourier Transformation of a 1024 X 1024 matrix of double precision floating point numbers. Less time is better. Quick Sort 0.6 0.5 0.4 3.6 Ghz P8 4.3 Ghz P8 0.3 Xeon 2687 v3 Seconds Xeon 2699 v3 0.2 Fastest Xeon 2650 v3 0.1 0 1 2 4 8 16 20 40 80 140 160 Threads Sorts a list of 350,000 double precision floating point numbers. Less time is better. June 2016 Substring Search 5 4.5 4 3.5 3.6 Ghz P8 3 4.3 Ghz P8 2.5 Xeon 2687 v3 Seconds 2 Xeon 2699 v3 Fastest 1.5 Xeon 2650 v3 1 0.5 0 1 2 4 8 16 20 40 80 140 160 Threads Search for a substring within a list of 124,000,000 characters. Less time is better. Barnes Hut 35 30 25 3.6 Ghz P8 20 4.3 Ghz P8 Xeon 2687 v3 15 Seconds Xeon 2699 v3 Xeon 2650 v3 10 Fastest 5 0 1 2 4 8 16 20 40 80 140 160 Threads The Barnes-Hut calculation is an approximation of an N-Body simulation. Less time is better. Barnes-Hut relies heavily on vector instructions, so Intel x86 does better due to 256 bit SIMD width vs. 128 bit on POWER8. June 2016 Matrix Inverse 3.5 3 2.5 3.6 Ghz P8 2 4.3 Ghz P8 Xeon 2687 v3 1.5 Seconds Xeon 2699 v3 Fastest 1 Xeon 2650 v3 0.5 0 1 2 4 8 16 20 40 80 140 160 Threads Performs a matrix inverse calculation on a 1000 X 1000 matrix of double precision floating point numbers. Less time is better. Sparse Compression 0.3 0.25 0.2 3.6 Ghz P8 4.3 Ghz P8 0.15 Xeon 2687 v3 Seconds Xeon 2699 v3 0.1 Fastest Xeon 2650 v3 0.05 0 1 2 4 8 16 20 40 80 140 160 Threads Converts a 7000 X 7000 matrix into a sparse matrix. Generally a memory bound problem, less time is better. June 2016 Sparse Multiplication 0.16 0.14 0.12 3.6 Ghz P8 0.1 4.3 Ghz P8 0.08 Xeon 2687 v3 Seconds Xeon 2699 v3 0.06 Fastest Xeon 2650 v3 0.04 0.02 0 1 2 4 8 16 20 40 80 140 160 Threads Multiply a sparse matrix of 5000 X 5000 double precision floating point values with a vector of 5000 double precision floating point values. This is a memory bound problem, less time is better. Sparse Decompression 0.09 0.08 0.07 0.06 3.6 Ghz P8 4.3 Ghz P8 0.05 Xeon 2687 v3 0.04 Seconds Xeon 2699 v3 0.03 Fastest Xeon 2650 v3 0.02 0.01 0 1 2 4 8 16 20 40 80 140 160 Threads Convert a 5000 X 5000 sparse matrix of double precision floating point values into a matrix. This is a memory bound problem, less time is better. June 2016 Autodesk 6 5 4 3.6 Ghz P8 4.3 Ghz P8 3 Xeon 2687 v3 Seconds Xeon 2699 v3 2 Fastest Xeon 2650 v3 1 0 1 2 4 8 16 20 40 80 140 160 Threads Customer problem. Tests threading capability and FPU performance. Less time is better. Find Pi 0.7 0.6 0.5 3.6 Ghz P8 0.4 4.3 Ghz P8 Xeon 2687 v3 0.3 Seconds Xeon 2699 v3 0.2 Xeon 2650 v3 0.1 0 1 2 4 8 16 20 40 80 140 160 Threads Calculate the value of Pi. Tests floating point arithmetic. Less time is better. June 2016 Mandelbrot 16 14 12 3.6 Ghz P8 10 4.3 Ghz P8 8 Xeon 2687 v3 Seconds Xeon 2699 v3 6 Xeon 2650 v3 4 Fastest 2 0 1 2 4 8 16 20 40 80 140 160 Threads Use Monte Carlo sampling to calculate the Mandelbrot set area. This problem is bound by CPU performance. Less time is better. Spectral Norm 10 9 8 7 3.6 Ghz P8 6 4.3 Ghz P8 5 Xeon 2687 v3 Seconds 4 Xeon 2699 v3 Fastest 3 Xeon 2650 v3 2 1 0 1 2 4 8 16 20 40 80 140 160 Threads Numerical analysis calculations. Bound by a combination of memory and CPU performance. Less time is better. June 2016 Jacobi (Cells / Sec) Fastest 18 16 14 12 3.6 Ghz 4.3 Ghz 10 Xeon 2687 v3 8 Cells/sec Xeon 2699 v3 6 Xeon 2650 v3 4 2 0 1 2 4 8 16 20 40 80 140 160 Threads Jacobi iteration on a 3 dimensional array of 250 X 250 X 250 double precision floating point numbers. This follows a pattern that is very common in HPC fields. It is generally memory bound, although there are also many floating point calculations. The most cells/sec is best. Discrete Cosine Transform 7 6 5 3.6 Ghz 4 4.3 Ghz Xeon 2687 v3 3 Seconds Xeon 2699 v3 Fastest 2 Xeon 2650 v3 1 0 1 2 4 8 16 20 40 80 140 160 Threads Discrete cosine transformation of a 4096 X 4096 matrix of double precision floating point numbers. This calculation is commonly used in compression algorithms. Less time is better. June 2016 Semblance 10 9 8 3.6 Ghz 7 3.6Ghz (Limited SMT) 6 4.3 Ghz 5 4.3Ghz (Limited SMT) Seconds 4 Xeon 2687 v3 Fastest 3 Xeon 2699 v3 2 Xeon 2650 v3 1 0 1 2 4 8 16 20 40 80 140 160 Threads Customer problem in the Oil & Gas industry. Memory bound problem that also includes many floating point calculations and tests the threading capability of a machine. Includes results of reduced SMT settings on POWER8 (1@16 cores, 1@20 cores, 2@40 cores, 4@80 cores, 7@140 cores). Less time is better. WirelessHART 12 10 3.6 Ghz 3.6Ghz (Limited SMT) 8 4.3 Ghz 4.3Ghz (Limited SMT) 6 Xeon 2687 v3 Seconds Fastest 4 Xeon 2699 v3 Xeon 2650 v3 2 0 1 2 4 8 16 20 40 80 140 160 Threads Customer problem that creates a schedule for a network gateway.