Fortran-C Interoperability

Levels of Parallelism Hans Pabst Application Engineer Acknowledgements: Martyn Corden, Steve “Dr. Fortran” Lionel, and others Notices and Disclaimers Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel processors of the same SKU may vary in frequency or power as a result of natural variability in the production process. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. The benchmark results reported above may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user’s components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804. No computer system can be absolutely secure. Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo. Intel® Hyper-Threading Technology available on select Intel® processors. Requires an Intel® HT Technology-enabled system. Your performance varies depending on the specific hardware and software you use. Learn more by visiting http://www.intel.com/info/hyperthreading. © 2018 Intel Corporation. Intel, the Intel logo, Xeon and Xeon logos are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Contents • Parallelism on Intel Architecture • Vectorization of code: overview and terminology • Introduction to SIMD ISA for Intel® processors • Intel® AVX, Intel® AVX2, Intel® AVX-512 • SIMD instruction set switches for Intel® compilers • SIMD vectorization Opening statement and Motivation “Parallelism == Performance” (leads to) Optimization – making sure the above statement is true! Speedup using Parallelism 1 Analyze 2 Implement Compiler Cilk Plus Debug OpenMP 3 Amplifier XE Amplifier EBS EBS (XE only) Hotspot Libraries MKL TBB Tune IPP 4 Composer XE Composer Inspector XE Inspector Threads Memory Four-Step Development Amplifier XE Amplifier concurrency & waits Locks Speedup by (re-)using Libraries ▪ Highly optimized, threaded, & vectorized math functions that maximize performance on each processor family ▪ Utilizes industry-standard C and Fortran APIs for compatibility with popular BLAS, LAPACK, and FFTW functions—no code changes required ▪ Dispatches optimized code for each processor automatically without the need to branch code ® What’s New in the 2018 edition ▪ Improved small matrix multiplication performance in GEMM & LAPACK ▪ Improved ScaLAPACK performance for distributed computation ▪ 24 new vector math functions ▪ Simplified license for easier adoption & More about Intel MKL: redistribution software.intel.com/mkl ▪ Additional distributions via YUM, APT-GET, & Conda Seven Levels of Parallelism #1 – Node-level Parallelism Levels of Parallelism Node #1 – Node-level Parallelism Levels of Parallelism Node What can I do? Increase per-node perf., identify scalability issues (Intel Trace Analyzer & Collector), employ comm-avoiding algorithms, etc. (more nodes ;-) #2 Socket-level Parallelism Levels of Parallelism Node Socket #2 Socket-level Parallelism Levels of Parallelism Node Socket What can I do? Do I have Perform data init. in NUMA-issues? parallel, separate data structures, take control Check with with NUMA allocator 'numactl -i all‘, or (libnuma). Intel Vtune. 2S Configurations 4S Configurations 8S Configuration LBG LBG SKL SKL SKL SKL SKL SKL Intel® UPI ** DMI x4 LBG LBG LBG 3x16 3x16 PCIe* 1x100G PCIe* 1x100G SKL Intel® OP Fabric Intel® OP Fabric SKL SKL SKL (2S-2UPI & 2S-3UPI shown) DMI SKL SKL LBG 3x16 PCIe* (4S-2UPI & 4S-3UPI shown) SKL SKL Intel® Xeon® Scalable Processor DMI 3x16 PCIe* supports up to 8 sockets LBG LBG Without the need for an additional Node Controller #3 Core / Thread Level Parallelism Levels of Parallelism Node Core 1 Core 2 Socket Core / Thread-Level Core 3 Core 4 #4 Thread-Level Parallelism (with Hyperthreading aka SMT) Levels of Parallelism Thread 1 Thread 1 Node Core 1 Core 2 Socket Thread 2 Thread 2 Core / Thread-Level (Hyperthreading) Thread 1 Thread 1 Core 3 Core 4 Thread 2 Thread 2 Hyperthreading • Buffers keep arch. state Architectural Architectural State A State B • Shared execution blocks • Enabled by BIOS settings Background • Extracts Instruction Level Parallelism (ILP) Single Execution Block • Complements out-of-order execution • Intel Core: intra-core slowdowns due to HT are eliminated, BUT slowdown may happen due missing thread-affinization or because of synchronization (locks). Parallelism on Intel® Architecture (Xeon Server Cores) Intel® Intel® Intel® Xeon Intel® Xeon Intel® Xeon Xeon Xeon E5 scalable Phi Phi processors Coprocessor 2nd gen. HSW BDW SKX KNC KNL 64- SNB IVB 5100 5500 5600 bit Cores 1 2 4 6 8 10 18 22 28 61 72 Threads 2 2 8 12 16 20 36 44 56 244 288 IVB: Ivy Bridge BDW: Broadwell KNC: Knights Corner SKX: Skylake Server core refines HSW: Haswell(SKL: Skylake ) KNL: Knights Landing SKL client core significantly Parallelism on Intel® Architecture (Xeon Server Cores) Intel® Intel® Intel® Xeon Intel® Xeon Intel® Xeon Xeon Xeon E5 scalable Phi Phi processors Coprocessor 2nd gen. HSW BDW SKX KNC KNL 64- SNB IVB 5100 5500 5600 bit Cores 1 2 4 6 8 10 18 22 28 61 72 Threads 2 2 8 12 16 20 36 44 56 244 288 IVB: Ivy Bridge BDW: Broadwell KNC: Knights Corner SKX: Skylake Server core refines HSW: Haswell(SKL: Skylake ) KNL: Knights Landing SKL client core significantly #5 GPU-CPU Parallelism Levels of Parallelism Node Socket Core / Thread-Level (Hyperthreading) GPU-CPU Integrated Processor Graphics eDRAM eDRAM Intel HD Intel® HD Graphics Intel HD Graphics Graphics Intel HD Graphics 2nd Gen Intel® 3rd Gen Intel 4th Gen Intel Core 5th Gen Intel Core Core™ Processor Core Processor Processor Processor Lots of compute power for data-parallel (client-)applications #6 Instruction-level Parallelism Levels of Parallelism Node Socket Core 1 Core / Thread-Level (Hyperthreading) GPU-CPU Instruction Execution Units on Haswell/Broadwell Newly introduced Unified Reservation Station Port 0 Port 1 Port 5 Port 6 Port 2 Port 3 Port 4 Port 7 ALU ALU ALU ALU LD LD FMA, FP FMA, FP MUL ADD/MUL JMP Prim. Branch STA STA STD STA Vec Int MUL Vec Int ALU Vec/FP Shuf FDIV Vec LOG Vec Int ALU New Branch Unit BranchBranch 2 Vec LOG EU - Reduces Port 0 conflicts Vec LOG Shift - 2nd EU for high branch code 0 63 127 255 Memory Control 2xFMA - Double peak FLOPS th New AGU for stores - Two FP MUL benefits legacy 4 ALU - Great of int workloads - Leaves Port 2 & 3 for Loads - Frees Port 0&1 for vec L1 Data Cache • Race to higher frequencies slowed down • New logic introduced with new generations, leading to higher complexity • Fused Multiply Add for performance • Separate Address Generation Unit for memory address calculations Execution Units on Skylake Server #7 Data Level Parallelism Levels of Parallelism Node Socket Core / Thread-Level (Hyperthreading) GPU-CPU Instruction Data (Vectorisation) Contents • Parallelism on Intel Architecture • Vectorization of code: overview and terminology • Introduction to SIMD ISA for Intel® processors • Intel® AVX, Intel® AVX2, Intel® AVX-512 • SIMD instruction set switches for Intel® compilers • SIMD vectorization Vectorization of code • Transform sequential code to exploit SIMD processing capabilities of Intel® processors • Calling a vectorized library • Automatically by tools like a compiler • Manually by explicit syntax a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i] + for(i = 0; i <= MAX; i++) c[i] = a[i] + b[i]; b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i] = c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i] Vectorization terminology • Single Instruction Multiple Data (SIMD) • Processing vector with a single operation • Provides data level parallelism (DLP) • More efficient than scalar processing due to DLP • Vector • Consists of more than one element • Elements are of same scalar data types (e.g.

Fortran-C Interoperability

AMD Athlon™ Processor X86 Code Optimization Guide

Benchmark Evaluations of Modern Multi Processor Vlsi Ds Pm Ps

(12) Patent Application Publication (10) Pub. No.: US 2011/0231717 A1 HUR Et Al

Computer Architecture Out-Of-Order Execution

Thread Scheduling in Multi-Core Operating Systems Redha Gouicem

Reconfigurable Accelerators in the World of General-Purpose Computing

Dsp56156 Overview

Study, Design and Implementation of an Application Specific Instruction Set Processor for a Specific DSP Task

Characterizing X86 Processors for Industry-Standard Servers: AMD Opteron and Intel Xeon Technology Brief

DSP56300 Family Manual

Exploring Early and Late Alus for Single-Issue In-Order Pipelines

Tms320c55x V3.X DSP Mnemonic Instruction Set