Intel Technology Platform for HPC Intel® Xeon® and Intel® Xeon Phi™ Processor Update

Intel technology platform for HPC Intel® Xeon® and Intel® Xeon Phi™ processor update Manel Fernández Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel® Xeon and Xeon Phi™ February 17th 2016, Barcelona Today’s Intel solutions for HPC The multi- and many-core era Multi-core Many integrated core (MIC) C/C++/Fortran, OMP/MPI/Cilk+/TBB C/C++/Fortran, OMP/MPI/Cilk+/TBB Bootable, native execution model PCIe coprocessor, native and offload execution models Up to 18 cores, 3 GHz, 36 threads Up to 61 cores, 1.2 GHz, 244 threads Up to 768 GB, 68 GB/s, 432 GFLOP/s DP Up to 16 GB, 352 GB/s, 1.2 TFLOP/s DP 256-bit SIMD, FMA, gather (AVX2) 512-bit SIMD, FMA, gather/scatter, EMU (IMCI) Targeted at general purpose applications Targeted at highly parallel applications Single thread performance (ILP) High parallelism (DLP, TLP) Memory capacity High memory bandwidth 2 Processor Core 1 Core 2 Core n Core L3 Cache DDR3 Clock & MC QPI Graphics or Power DDR4 Intel® CoreTM architecture 3 Intel® CoreTM architecture roadmap The “Tick-Tock” roadmap model Intel® Core™ 2nd 3rd 4th 5th 6th “Nehalem” microarchitecture generation generation generation generation generation Sandy Ivy Nehalem Westmere Haswell Broadwell Skylake Bridge Bridge 45nm 32nm 22nm 14nm 2008 2009 2011 2012 2013 Sep 2014 2015 ADX and 3 other MPX, SGX SSE4.2 AES AVX RNRAND, etc. AVX2 new instructions AVX-512 (Xeon only) Tick (shrink, new process technology) Tock (innovate, new microarchitecture) 4 Haswell execution unit overview Unified Reservation Station Port 4 Port Port 2 Port Port 1 Port Port 0 Port Port 6 Port Port 5 Port Port 7 Port Port 3 Port Integer IntegerInteger LoadLoad && Store Integer Integer Store ALU & Shift ALUALU && LEALEA StoreStore AddressAddress Data ALU & LEA ALU & Shift Address FMA FMA FP Mult Vector FP Multiply FP Add 2xFMA Shuffle th Vector Int VectorVector IntInt • Doubles peak FLOPs Vector Int 4 ALU • Multiply ALUALU • Two FP multiplies ALU Great for integer workloads • Frees Port0 & 1 for vector Vector VectorVector benefits legacy Vector Logicals LogicalsLogicals Logicals Branch Branch Divide Vector Additional Branch Unit Shifts • Reduces Port0 Conflicts Additional AGU for Stores • 2nd EU for high branch code • Leaves Port 2 & 3 open for Loads 6 Microarchitecture buffer sizes Extract more parallelism on every generation Nehalem Sandy Bridge Haswell Skylake Out-of-order 128 168 192 224 Window In-flight Loads 48 64 72 72 In-flight Stores 32 36 42 56 Scheduler Entries 36 54 60 97 Integer Register N/A 160 168 180 File FP Register File N/A 144 168 168 Allocation Queue 28/thread 28/thread 56 64/thread 7 Fused Multiply and Add (FMA) instruction Example: polynomial evaluation Micro- SP FLOPs DP FLOPs Architecture Instruction Set per cycle per cycle Nehalem SSE (128-bits) 8 4 Sandy Bridge AVX (256-bits) 16 8 Haswell AVX2 (FMA) (256-bits) 32 16 2x peak FLOPs/cycle (throughput) Latency Xeon Xeon Ratio (clocks) E5 v2 E5 v3 (lower is better) MulPS, PD 5 5 AddPS, PD 3 3 Mul+Add /FMA 8 5 0.625 >37% reduced latency (5-cycle FMA latency same as an FP multiply) Improves accuracy and performance for commonly used class of algorithms 9 Broadwell: 5th generation Intel® Core™ architecture Microarchitecture changes FP instructions performance improvements • Decreased latency and increased throughput for most divider (radix-1024) uops • Pseudo-double bandwidth for scalar divider uops • Vector multiply latency decrease (from 5 to 3 cycles) STLB improvements • Native, 16-entry 1G STLB array • Increased size of STLB (from 1kB to 1.5kB) Enabled two simultaneous page miss walks Other ISA performance improvements • ADC, CMOV – 1 uop flow • PCLMULQDQ – 2 uop/7 cycles to 1 uop/5 cycles • VCVTPS2PH (mem form) – 4 uops to 3 uops 10 Skylake: 6th generation Intel® Core™ architecture Dedicated server and client IP configurations Improved microarchitecture • Higher capacity front-end (up to 6 instr/cycle) • Improved branch predictor • Deeper Out-of-Order buffers • More execution units, shorter latencies • Deeper store, fill, and write-back buffers • Smarter prefetchers • Improved page miss handling • Better L2 cache miss bandwidth • Improved Hyper-Threading • Performance/watt enhancements New instructions supported • Software Guard Extensions (SGX) • Memory Protection Extensions (MPX) • AVX-512 (Xeon versions only) 11 Processor Core 1 Core 2 Core n Core L3 Cache DDR3 Clock & MC QPI Graphics or Power DDR4 Intel® Xeon® processor architecture 12 Intel® Xeon® processors and platforms Intel® Xeon® E3 Intel® Xeon® E5 Intel® Xeon® E7 E3 E5 2x QPI E7 3x QPI Memory Memory Memory PCIe3 PCIe3 PCIe3 Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® E5-1xxx E5-2xxx E5-4xxx E7-4xxx E7-xxxx E3-1xxx CPU/Socket QPI >4S E5-2600 v3 (Haswell EP) E5-2600 v4 (Broadwell EP) 13 Intel® Xeon® E5-2600 v3 processor overview Power Management 22nm Process (Tock) Per Core P-State (PCPS) PCIe3.0 DMI Uncore Frequency Scaling (UFS) PCI Express 3.0 Energy Efficient Turbo (EET) EP: 40 Lanes System Agent IMC Memory Technology: Intel® Hyper-Threading Core LLC Socket R3 Technology (2 threads/core) Core LLC 4xDDR4 channels Core LLC 1333, 1600, 1866, 2133 MTS Intel® Turbo Boost Core LLC Technology . ~2.5 MB Last Level Cache/Core . Up to 45 MB total LLC Up to 18 Cores . ® Core LLC Intel AVX 2.0 / Haswell New Instruction (HNI) Core LLC Integrated Voltage Regulator Intel® QuickPath Interface Intel® QuickPath Interface (x2) 9.6GT/s New Feature Existing Feature 14 Haswell EP die configurations 14-18 Core (HCC) 10-12 Core (MCC) 4-8 Core (LCC) QPI QPI PCI-E PCI-E PCI-E PCI-E QPI QPI PCI-E PCI-E PCI-E PCI-E QPI QPI PCI-E PCI-E PCI-E PCI-E Link Link X16 X16 X8 X4 (ESI) Link Link X16 X16 X8 X4 (ESI) Link Link X16 X16 X8 X4 (ESI) UBox PCU UBox PCU UBox PCU CB DMA CB DMA CB DMA R3QPI R2PCI R3QPI R2PCI R3QPI R2PCI IOAPIC IOAPIC IOAPIC QPI Agent IIO QPI Agent IIO QPI Agent IIO UP UP UP UP UP DN DN DN DN DN U D D U U D D U LLC Cache D U IDI Core P N QPII N P P N N P / Core 2.5MB Bo N P Bo SAD IDI CBO IDI IDI IDI I I D D I I I I IDI IDI IDI I / / / P / / I I D QPII QPII QPII Q Core Cache LLC LLC Cache Core Core U D Cache LLC LLC Cache D U Core Core U D Cache LLC LLC Cache D U Core Core U D Q Cache LLC Core U D Cache LLC LLC Cache D U Core QPII QPII QPII Q D D I / / / / P P IDI IDI Core Core Core IDI Core Core Core I Core Core Core I Bo Bo Bo Bo Bo P N Bo Bo N P Bo Bo P N Bo Bo I N P Bo Bo P N Bo Bo P N Bo Bo N P Bo I 2.5MB 2.5MB 2.5MB 2.5MB 2.5MB 2.5MB I 2.5MB 2.5MB 2.5MB D I I IDI IDI IDI SAD SAD SAD SAD SAD SAD I SAD SAD SAD CBO CBO CBO CBO CBO CBO CBO CBO CBO IDI IDI IDI I I D D I I I I IDI IDI IDI I P / / / / / I I D QPII QPII QPII Q Core Cache LLC LLC Cache Core Core U D Cache LLC LLC Cache D U Core Core U D Cache LLC LLC Cache D U Core Core U D Q Cache LLC Core U D Cache LLC LLC Cache D U Core QPII QPII QPII Q D D I / / / / P P IDI IDI I I Core Core Core Core Core Core Core Core IDI Core Bo Bo Bo Bo Bo P N Bo Bo N P Bo Bo P N Bo Bo I N P Bo Bo P N Bo Bo P N Bo Bo N P Bo I 2.5MB 2.5MB 2.5MB 2.5MB 2.5MB 2.5MB I 2.5MB 2.5MB 2.5MB D I I IDI IDI I SAD SAD SAD SAD SAD SAD SAD SAD SAD IDI CBO CBO CBO CBO CBO CBO CBO CBO CBO I I IDI IDI IDI D D I I I I I IDI IDI IDI P / / / / / I I D Q Q QPII QPII Core Cache LLC LLC Cache Core Core U D Cache LLC LLC Cache D U Core Core U D Cache LLC LLC Cache D U Core Core U D Cache LLC Core U D QPII Cache LLC LLC Cache D U Core Q QPII QPII QPII D D I / / / P P / I I IDI Core Core Core IDI Core Core Core Core Core IDI Core Bo Bo Bo Bo Bo P N Bo Bo N P Bo Bo P N Bo Bo I N P Bo Bo P N Bo Bo P N Bo Bo N P Bo I 2.5MB 2.5MB 2.5MB 2.5MB 2.5MB 2.5MB I 2.5MB 2.5MB 2.5MB D I I I IDI IDI SAD SAD SAD SAD SAD SAD SAD SAD SAD IDI CBO CBO CBO CBO CBO CBO CBO CBO CBO I I IDI IDI IDI D D I I I I I IDI IDI P IDI / / / / / I I D Q Q QPII QPII Core Cache LLC LLC Cache Core Core U D Cache LLC LLC Cache D U Core Core U D Cache LLC LLC Cache D U Core Core U D Cache LLC Core U D QPII Cache LLC LLC Cache D U Core Q QPII QPII QPII D D I / / / P P / I Core Core Core I IDI Core Core Core IDI Core IDI Bo Bo Bo Bo Bo P N Bo Bo N P Bo Bo P N Bo Bo I N P Bo Bo P N Bo Core Bo Bo Core I 2.5MB 2.5MB 2.5MB 2.5MB 2.5MB 2.5MB I 2.5MB Bo P N 2.5MB 2.5MB N P Bo D I I I IDI IDI SAD SAD SAD SAD SAD SAD SAD SAD SAD IDI CBO CBO CBO CBO CBO CBO CBO CBO CBO LLC Cache D U IDI Core U D QPII D U U D / Core D U 2.5MB Bo N P Bo P N SAD IDI N P P N N P CBO UP DN UP DN DN UP DN UP Home Agent Home Agent Home Agent Home Agent Home Agent DDR DDR DDR DDR DDR DDR DDR DDR DDR DDR Mem Ctlr Mem Ctlr Mem Ctlr Mem Ctlr Mem Ctlr Chop Columns Home Agents Cores Power (W) Transitors (B) Die Area (mm2) HCC 4 2 14-18 110-145 5.69 662 MCC 3 2 6-12 65-160 3.84 492 LCC 2 1 4-8 55-140 2.60 354 15 Cluster on die (COD) mode Supported on 1S/2S HSW-EP SKUs with 2 Home Agents (10+ cores) COD Mode for 18C E5-2600 v3 Cluster0 QPI In memory directory bits and directory cache IIO Cluster1 used on 2S to reduce coherence traffic and 0/1 Cbo Core cache-to-cache transfer latencies Sbo LLC Cbo Cbo Cbo Cbo Targeted at NUMA optimized workloads Core Core Core Core LLC LLC LLC LLC where latency is more important than Cbo Cbo Cbo Cbo Core Core Core Core sharing across Caching Agents (Cbo) LLC LLC LLC LLC Cbo • Reduces average LLC hit and local memory Cbo Cbo Cbo Core Core Core Core LLC latencies LLC LLC LLC Cbo Core • HA sees most requests from reduced set of threads Cbo Cbo Cbo LLC Core Core Core which can lead to higher memory bandwidth LLC LLC LLC Cbo Core OS/VMM own NUMA and process affinity Sbo LLC decisions HA0 HA1 • Exported as 2 NUMA nodes 16 Intel® Xeon® E5-2600 v3 product family Intel® Xeon® E5-2699 v3 (18C, 2.3GHz) vs.

Intel Technology Platform for HPC Intel® Xeon® and Intel® Xeon Phi™ Processor Update

IEEE Paper Template in A4

Computer Science 246 Computer Architecture Spring 2010 Harvard University

Dynamic Scheduling

United States Patent (19) 11 Patent Number: 5,680,565 Glew Et Al

Identifying Bottlenecks in a Multithreaded Superscalar

Advanced Processor Designs

Computer Architecture Out-Of-Order Execution

CSE 586 Computer Architecture Lecture 4

Superscalar Instruction Dispatch with Exception Handling

A Fast, Verified, Cross-Platform Cryptographic Provider

MIPS Architecture with Tomasulo Algorithm [12]

Extracting and Mapping Industry 4.0 Technologies Using Wikipedia