Intel® Architecture and Tools
Total Page:16
File Type:pdf, Size:1020Kb
Klaus-Dieter Oertel Intel-SSG-Developer Products Division FZ Jülich, 22-05-2017 The “Free Lunch” is over, really Processor clock rate growth halted around 2005 Source: © 2014, James Reinders, Intel, used with permission Software must be parallelized to realize all the potential performance 4 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Changing Hardware Impacts Software More cores More Threads Wider vectors Intel® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Future Intel® Xeon Intel® Xeon Phi™ Future Intel® Xeon® Processor Processor Processor Processor Xeon® Intel® Xeon® Phi™ x100 x200 Processor Xeon Phi™ Processor 5100 series 5500 series 5600 series E5-2600 v2 Processor Processor1 Coprocessor & Coprocessor (KNH) 64-bit series E5-2600 (KNC) (KNL) v3 series v4 series Up to Core(s) 1 2 4 6 12 18-22 TBD 61 72 TBD Up to Threads 2 2 8 12 24 36-44 TBD 244 288 TBD SIMD Width 128 128 128 128 256 256 512 512 512 TBD Intel® Intel® Intel® Intel® Intel® Intel® Intel® Intel® Vector ISA IMCI 512 TBD SSE3 SSE3 SSE4- 4.1 SSE 4.2 AVX AVX2 AVX-512 AVX-512 Optimization Notice Product specification for launched and shipped products available on ark.intel.com. 1. Not launched or in planning. Copyright © 2016, Intel Corporation. All rights reserved. 9 *Other names and brands may be claimed as the property of others. Changing Hardware Impacts Software More cores More Threads Wider vectors Intel® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Future Intel® Xeon Intel® Xeon Phi™ Future Intel® Xeon® Processor Processor Processor Processor Xeon® Intel® Xeon® Phi™ x100 x200 Processor Xeon Phi™ Processor 5100 series 5500 series 5600 series E5-2600 v2 Processor Processor1 Coprocessor & Coprocessor (KNH) 64-bit series E5-2600 (KNC) (KNL) v3 series v4 series Up to Core(s) 1 2 4 6 12 18-22 TBD 61 72 TBD Up to Threads 2 High2 performance8 12 24software36-44 mustTBD be 244both: 288 TBD SIMD Width 128 . Parallel128 128(multi-128thread,256 multi-256process)512 512 512 TBD . Vectorized Intel® Intel® Intel® Intel® Intel® Intel® Intel® Intel® Vector ISA IMCI 512 TBD SSE3 SSE3 SSE4- 4.1 SSE 4.2 AVX AVX2 AVX-512 AVX-512 Optimization Notice Product specification for launched and shipped products available on ark.intel.com. 1. Not launched or in planning. Copyright © 2016, Intel Corporation. All rights reserved. 10 *Other names and brands may be claimed as the property of others. Untapped Potential Can Be Huge! Threaded + Vectorized can be much faster than either one alone The Difference Is Growing With Each New Generation of Hardware Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should Optimization Notice consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with Copyright © 2016, Intel Corporation. All rights reserved. 11 *Otherother names products. and brands For may more be claimed information as the property go to ofhttp://www.intel.com/performance others. Configurations at the end of this presentation. Performance Scaling from the Core, to Multicore, to Many Core and Beyond – to Cluster MKL MKL Extracting performance + + from the computing OpenMP Intel® MPI resources . Core: vectorization, Sequential prefetching, cache utilization Intel® MKL . Multi-Many core (processor/socket) level parallelization Many Core . Multi-socket (node) level Intel® Xeon PhiTM parallelization Coprocessor . Clusters scaling Optimization Notice Copyright © 2016, Intel Corporation. All rights reserved. 12 *Other names and brands may be claimed as the property of others. Moore’s “Law” “The number of transistors on a chip will double approximately every two years.” [Gordon Moore] Moore's Law graph, 1965 Source: http://en.wikipedia.org/wiki/Moore's_law 13 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Desktop, Mobile & Server Tick/Tock Model 2nd Generation 3rd Generation 4th Generation Intel® Core™ Intel® Core™ Intel® Core™ Intel® Core™ Nehalem Westmere Sandy Ivy Bridge Haswell (2008) (2010) Bridge (2012) (2013) (2011) New New Process New New Process New Microarchitecture Technology Microarchitecture Technology Microarchitecture 45nm 32nm 32nm 22nm 22nm Tock Tick Tock Tick Tock Future Broadwell Skylake Tock: (2014) Innovate New Process New New Process Tick:New Technology Microarchitecture Technology Microarchitecture 14nm 14nm 11nm 11nmShrink Tick Tock Tick Tock 22-May-17 14 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Desktop, Mobile & Server Your Source for Intel® Product Information Naming schemes: • Desktop & Mobile: . Intel® Core™ i3/i5/i7 processor family . 4 generations, e.g.: 4th Generation Intel® Core™ i7-XXXX • Server: . Intel® Xeon® E3/E5/E7 processor family . 3 generations, e.g.: Intel® Xeon® Processor E3-XXXX v3 Information about available Intel products can be found here: http://ark.intel.com/ 15 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Haswell Processor at JURECA: E5-2680v3 See ark.intel.com for more Details # Cores 12 Non-AVX Reference Frequency 2500 MHz Non-AVX Max Turbo Frequency 3300 MHz AVX Reference Frequency 2100 MHz AVX Max Turbo Frequency 3100 MHz L3 Cache Size 30 MB QPI 9.6 GT/s E5-2680v3: Turbo bins in GHz for number of cores being used ( see here for more ) Cores 1-2 3 4 5 6 7 8 9 10 11+ Non- 3.3 3.1 3 2.9 2.9 2.9 2.9 2.9 2.9 2.9 AVX AVX 3.1 2.9 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 16 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Desktop, Mobile & Server Performance • Following Moore’s Law: Microarchitecture Instruction Set SP FLOPs DP FLOPs L1 Cache Bandwidth L2 Cache per Cycle per Cycle (bytes/cycle) Bandwidth per Core per Core (bytes/cycle) Nehalem SSE 8 4 32 32 (128-bits) (16B read + 16B write) Sandy Bridge Intel® AVX 16 8 48 32 (256-bits) (32B read + 16B write) Haswell Intel® AVX2 32 16 96 64 (256-bits) (64B read + 32B write) • Example of theoretic peak FLOP rates: . Intel® Core™ i7-2710QE (Sandy Bridge): 2.1 GHz * 16 SP FLOPs * 4 cores = 134.4 SP GFLOPs . Intel® Core™ i7-4765T (Haswell): 2.0 GHz * 32 SP FLOPs * 4 cores = 256 SP GFLOPs 17 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Desktop, Mobile & Server Caches Cache hierarchy: Processor Core 1 Core 2 Core n I D I D I D L2 L2 L2 Cache Cache Cache L3 Cache Level Latency (cycles) Bandwidth Size (per core per cycle) L1-D 4 2x 16 bytes 32KiB L2 (unified) 12 1x 32 bytes 256KiB L3 (LLC) 26-31 1x 32 bytes varies (≥ 2MiB per core) L2 and L1 D-Cache in other cores 43 (clean hit), 60 (dirty hit) Example for Haswell 19 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Core vs. Uncore • Core: Processor Processor core’s logic: . Execution units Core Core Core 1 Core 2 . Core caches (L1/L2) DDR3 Core n . Buffers & registers or . … DDR4 L3 Cache Clock Uncore MC QPI & Graphics • Uncore: Power All outside a processor core: . Memory controller/channels (MC) and Intel® QuickPath Interconnect (QPI) . L3 cache shared by all cores . Type of memory . Power management and clocking . Optionally: Integrated graphics Only uncore is differentiation within same processor family! 21 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics NUMA - Memory, Bandwidth & Latency Memory allocation: • Differentiate: implicit vs. explicit memory allocation • Explicit allocation with NUMA aware libraries, e.g. libnuma (Linux*) or tbbmalloc • Bind memory (SW) thread, and (SW) thread processor • More information on optimizing for performance: https://software.intel.com/de-de/articles/optimizing-applications-for-numa Memory Memory Socket 1 Processor 1 Processor 2 Socket 2 MC QPI QPI MC Local Remote Performance: Access Access • Remote memory access latency ~1.7x greater than local memory • Local memory bandwidth can be up to ~2x greater than remote 23 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics NUMA - Thread Affinity & Enumeration Non-NUMA: Thread affinity might be beneficial (e.g. cache locality) but not required NUMA: Thread affinity is required: • Improve accesses to local memory vs. remote memory • Ensure 3rd party components support affinity mapping, e.g.: . Intel® OpenMP* via $KMP_AFFINITY or $OMP_PLACES . Intel® MPI via $I_MPI_PIN_DOMAIN (default may be OK) tool cpuinfo provides additional information . … 24 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Documentation Intel® 64 and IA-32 Architectures Software Developer Manuals: https://software.intel.com/en-us/articles/intel-sdm • Intel® 64 and IA-32 Architectures Software Developer’s Manuals . Volume 1: Basic Architecture . Volume 2: Instruction Set Reference . Volume 3: System Programming Guide • Software Optimization Reference Manual • Related Specifications, Application Notes, and White Papers Intel® Processor Numbers (how type names are encoded): http://www.intel.com/products/processor_number 26 Optimization Notice Copyright © 2014, Intel Corporation.