4th Generation ® Core™ Processors

Instruction Set Extensions and Developer Product

Support ) April 2013

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 1 and/or other countries. *Other names and brands may be claimed as the property of others. Entdecken Sie weitere interessante Artikel und News zum Thema auf all-electronics.de!

Hier klicken & informieren! Intel “Tick-Tock” Roadmap – Part I

2nd Generation 3nd Generation Intel ® Core ™ Micro Architecture Intel® Core™ Intel® Core™ MicroArchitecture Codename “Nehalem” Micro Micro Architecture Architecture

Merom Penryn Nehalem Westmere Sandy Bridge Ivy Bridge

NEW NEW NEW NEW NEW NEW Micro architecture Process Technology Micro architecture Process Technology Micro architecture Process Technology 65nm 45nm 45nm 32nm 32nm 22nm TOCK TICK TOCK TICK TOCK TICK

2006 2007 2008 2009 2011 2012 SSSE-3 SSE4.1 SSE4.2 AES AVX RDRAND etc

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Intel “Tick-Tock” Roadmap – Part II

4nth Generation Intel® Core™ Micro TBD TBD TBD TBD TBD Architecture TICK + TOCK = SHRINK + INNOVATE Haswell Broadwell TBD TBD TBD TBD

NEW NEW NEW NEW NEW NEW Micro architecture Process Technology Micro architecture Process Technology Micro architecture Process Technology 22nm 14nm 14nm 10nm 10nm 7nm TICK TOCK TICK TOCK TICK TOCK

2013 >= 2014 ??? ??? ??? ??? AVX-2

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Instruction Set Architecture (ISA) Extensions

199x MMX, CMOV, Multiple new instruction sets added to the initial 32bit instruction PAUSE, XCHG, set of the Intel® 386 processor … 1999 Intel® SSE 70 new instructions for 128-bit single-precision FP support 2001 Intel® SSE2 144 new instructions adding 128-bit integer and double-precision FP support 2004 Intel® SSE3 13 new 128-bit DSP-oriented math instructions and thread synchronization instructions 2006 Intel SSSE3 16 new 128-bit instructions including fixed-point multiply and horizontal instructions 2007 Intel® SSE4.1 47 new instructions improving media, imaging and 3D workloads 2008 Intel® SSE4.2 7 new instructions improving text processing and CRC 2010 Intel® AES-NI 7 new instructions to speedup AES 2011 Intel® AVX 256-bit FP support, non-destructive (3-operand) 2012 Ivy Bridge NI RNG, 16 Bit FP 2013 Haswell NI AVX-2, TSX, FMA, Gather, Bit NI A long history of ISA Extensions !

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 4 and/or other countries. *Other names and brands may be claimed as the property of others. New Instructions in Haswell Group Description Count ***

SIMD Integer Adding vector integer operations to 256-bit Instructions promoted to 256 bits Gather Load elementsusing a vector of indices, vectorization 170 / 124

AVX2 enabler Shuffling / Data Blend, element shift and permute instructions Rearrangement

FMA Fused Multiply-Add operation forms ( FMA-3) 96/ 60

Bit Manipulation and Improving performance of bit stream manipulation and 15 / 15 Cryptography decode, large integer arithmetic and hashes

TSX=RTM+HLE 4/4

Others MOVBE:Load and Store of Big Endian forms 2 / 2 INVPCID: Invalidate processor context ID

* Total instructions / different mnemonics

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Haswell Execution Unit Overview

Unified Reservation Station Port4 Port2 Port1 Port0 Port7 Port6 Port5 Port3

Integer Integer Load & Store Integer Integer Store ALU & Shift ALU & LEA Store Address Data ALU & LEA ALU & Shift Address

FMA FMA FP Mult 2xFMA Vector FP Multiply FP Add • Doubles peak FLOPs Shuffle Vector Int Vector Int • Two FP multiplies benefits Vector Int Multiply ALU legacy ALU Vector Vector Vector Logicals Logicals Logicals 4th ALU Branch Branch • Great for integer workloads • Frees Port0 & 1 for vector Divide

Vector Shifts New Branch Unit New AGU for Stores • Reduces Port0 Conflicts • Leaves Port 2 & 3 open • 2nd EU for high branch code for Loads Intel ® Microarchitecture (Haswell)

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Core Cache Size/Latency/Bandwidth

Metric Nehalem Sandy Bridge Haswell

L1 Instruction Cache 32K, 4-way 32K, 8-way 32K, 8-way

L1 Data Cache 32K, 8-way 32K, 8-way 32K, 8-way

Fastest Load-to-use 4 cycles 4 cycles 4 cycles 32 Bytes/cycle Load bandwidth 16 Bytes/cycle 64 Bytes/cycle (banked) Store bandwidth 16 Bytes/cycle 16 Bytes/cycle 32 Bytes/cycle

L2 Unified Cache 256K, 8-way 256K, 8-way 256K, 8-way

Fastest load-to-use 10 cycles 11 cycles 11 cycles

Bandwidth to L1 32 Bytes/cycle 32 Bytes/cycle 64 Bytes/cycle

4K: 128, 4-way 4K: 128, 4-way 4K: 128, 4-way L1 Instruction TLB 2M/4M: 7/thread 2M/4M: 8/thread 2M/4M: 8/thread 4K: 64, 4-way 4K: 64, 4-way 4K: 64, 4-way L1 Data TLB 2M/4M: 32, 4-way 2M/4M: 32, 4-way 2M/4M: 32, 4-way 1G: fractured 1G: 4, 4-way 1G: 4, 4-way 4K+2M shared: L2 Unified TLB 4K: 512, 4-way 4K: 512, 4-way 1024, 8-way All caches use 64-byte lines © 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Intel ® Microarchitecture (Haswell); Intel ® Microarchitecture (Sandy Bridge); Intel ® Microarchitecture (Nehalem) FMA & Peak FLOPS Instruction Set SP FLOPs DP FLOPs • 2 new FMA units provide 2x per cycle per cycle peak FLOPs/cycle of previous Nehalem SSE (128-bits) 8 4 Sandy generation AVX (256-bits) 16 8 Bridge Haswell AVX2 & FMA32 3216 16 • 2X cache bandwidth to feed wide vector units – 32-byte load/store for L1 70 – 2x L2 bandwidth 60 Haswell 50 • 5-cycle FMA latency same as an 40 FP multiply 30 Sandy Bridge – Latency of MUL + ADD down from 20 8 cycles to 5 Merom 10 Peak Peak Cache bytes/clock Banias 0 0 5 10 15 20 25 30 35 Peak FLOPS/clock

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. © 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Intel ® Microarchitecture (Haswell); Intel ® Microarchitecture (Sandy Bridge); Intel ® Microarchitecture (Meron); Intel ® Microarchitecture (Banias) Intel ® Advanced Vector Extensions (Intel® AVX) Extends all 16 XMM registers to 256bits “YMM” registers; works on either – The whole 256-bits – for FP instructions – The lower 128-bits (like existing Intel ® SSE instructions) • A drop-in replacement for all existing scalar/128-bit SSE instructions • The upper part of the register is zeroed out Provides non-destructive functionality – Three-operand instructions with two sources and a different destination 128 bits (Intel® SSE) Sandy Bridge (code name) architecture introduced “Intel® AVX” instructions set XMM0 • 360 instructions (245 mnemonics) YMM0 256 bits (Intel® AVX) YMM regs are 256 bits (only for FP), with low 128 bits corresponding to existing XMM registers © 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 9 and/or other countries. *Other names and brands may be claimed as the property of others. Intel ® AVX2: 256-bit Integer Vector • Extends Intel® AVX to cover integer operations • Uses same AVX (256-bit) register set • Nearly all 128-bit integer vector instructions are ‘promoted’ to 256 – Including Intel ® SSE2, Intel ® SSSE3, Intel ® SSE4 – Exceptions: GPR moves (MOVD/Q) ; Insert and Extracts <32b, Specials (STTNI instructions, AES, PCLMULQDQ) • New 256b Integer vector operations (not present in Intel ® SSE) – Cross-lane element permutes – Element variable shifts – Gather • Haswell doubles the cache bandwidth • Two 256-bit loads per cycle • Helps both Intel® AVX2 and legacy Intel® AVX performance

Intel® AVX2 completes the 256-bit extensions started with Intel® AVX: provides 256-bit integer, cross-lane permutes, gather…

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 10 and/or other countries. *Other names and brands may be claimed as the property of others. Integer Instructions Promoted to 256

VMOVNTDQA VPHADDSW VPSUBQ VPMOVZXDQ VPMULHW VPABSB VPHADDW VPSUBSB VPMOVZXWD VPMULLD VPABSD VPHSUBD VPSUBSW VPMOVZXWQ VPMULLW VPABSW VPHSUBSW VPSUBUSB VPSHUFB VPMULUDQ VPADDB VPHSUBW VPSUBUSW VPSHUFD VPSADBW VPADDD VPMAXSB VPSUBW VPSHUFHW VPSLLD VPADDQ VPMAXSD VPACKSSDW VPSHUFLW VPSLLDQ VPADDSB VPMAXSW VPACKSSWB VPUNPCKHBW VPSLLQ VPADDSW VPMAXUB VPACKUSDW VPUNPCKHDQ VPSLLW VPADDUSB VPMAXUD VPACKUSWB VPUNPCKHQDQ VPSRAD VPADDUSW VPMAXUW VPALIGNR VPUNPCKHWD VPSRAW VPADDW VPMINSB VPBLENDVB VPUNPCKLBW VPSRLD VPAVGB VPMINSD VPBLENDW VPUNPCKLDQ VPSRLDQ VPAVGW VPMINSW VPMOVSXBD VPUNPCKLQDQ VPSRLQ VPCMPEQB VPMINUB VPMOVSXBQ VPUNPCKLWD VPSRLW VPCMPEQD VPMINUD VPMOVSXBW VMPSADBW VPAND VPCMPEQQ VPMINUW VPMOVSXDQ VPCMPGTQ VPANDN VPCMPEQW VPSIGNB VPMOVSXWD VPMADDUBSW VPOR VPCMPGTB VPSIGND VPMOVSXWQ VPMADDWD VPXOR VPCMPGTD VPSIGNW VPMOVZXBD VPMULDQ VPMOVMSKB VPCMPGTW VPSUBB VPMOVZXBQ VPMULHRSW VPHADDD VPSUBD VPMOVZXBW VPMULHUW

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 11 and/or other countries. *Other names and brands may be claimed as the property of others. Gather Instruction(s) “Gather” is a fundamental building block for vectorizing indirect memory accesses.

int a[] = {1,2,3,4,……,99,100}; int b[] = {0,4,8,12,16,20,24,28}; // indices

for (i=0; i

Haswell introduces set of GATHER instructions to allow automatic vectorization of similar loops with non- adjacent, indirect memory accesses – HSW GATHER not always faster – software can make optimizing assumptions

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Gather Instruction - Sample VPGATHER DQ ymm1,[ymm9*8 + eax+22], ymm2

P integer data (no FP data) D indicate index size; here double word (32bit) Q to indicate data size; here quad word (64 bit) integer ymm1 destination register ymm9 index register; here 4-byte size indices 8 scale factor eax general register containing base address 22 offset added to base address ymm2 mask register; only most significant bit of each element used

Gather: Fundamental building block for nonadjacent, indirect memory accesses for either integer or floating point data

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 13 and/or other countries. *Other names and brands may be claimed as the property of others. Would My Code Benefit From Using Gather?

• When to use Gather depends on factors such as – Number of elements: what index/data size – Mask value: known to be all 1’s or unknown mask? – Data reuse: will gathered data be reused several times? – Index reuse: will the same index be reused by several gathers? – Mask and dest dependency: if mask is not all 1’s, there is dependency on old values in the destination • Gather can help or hurt performance (or be neutral) and should be used with caution – Haswell Optimization Manual will have detailed recommendations on when to use Gather and when to avoid

Be aware of performance implications when using Gather

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 14 and/or other countries. *Other names and brands may be claimed as the property of others. New Data Move Instructions VPERMQ/PD imm VPERMD/PS var Permutes & Blends VPBLENDD imm VPSLLVQ & VPSRLVQ Quadword Variable Vector Shift (Left & Right Logical) Element Based Vector Shifts VPSLLVD, VPSRLVD, VPSRAVD Doubleword Variable Vector Shift (Left & Right Logical, Right Arithmetic) VPBROADCASTB/W/D/Q XMM & mem VBROADCASTSS/SD XMM New Broadcasts Register broadcasts requested by software developers

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 15 and/or other countries. *Other names and brands may be claimed as the property of others. Shuffling / Data Rearrangement • Traditional IA focus was lowest possible latency – Many very specializes shuffles: – Unpacks, Packs, in-lane shuffles – Shuffle controls were not data driven • New Strategy – Any-to-any, data driven shuffles at slightly higher latency

7 6 5 4 3 2 1 0

Control register

any any any any any any any any

Element Width Vector Width Instruction Launch BYTE 128 PSHUFB SSE4 VPERMD DWORD 256 AVX2 New! VPERMPS VPERMQ QWORD 256 AVX2 New! VPERMPD

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. FMA: Fused Multiply-Add • Computes (a×b)±c with only one round – a×b intermediate result is not rounded before add/sub • Can speed up and improve the accuracy of many FP computations, e.g., – Matrix multiplication (SGEMM, DGEMM, etc.) – Dot product – Polynomial evaluation • Can perform 8 single-precision FMA operations or 4 double-precision FMA operations with 256- bit vectors per FMA unit – Increases FLOPS capacity over Intel® AVX – Maximum throughput of two FMA operations per cycle

FMA can provide improved accuracy and performance

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 17 and/or other countries. *Other names and brands may be claimed as the property of others. 20 FMA Varieties Supported

vFMAdd a×b + c fused mul-add ps, pd, ss, sd

vFMSub a×b – c fused mul-sub ps, pd, ss, sd

vFNMAdd – (a×b + c) fused negative mul-add ps, pd, ss, sd

vFNMSub – (a×b – c) fused negative mul-sub ps, pd, ss, sd

vFMAddSub a[i]×b[i] + c[i] on odd indices ps, pd a[i]×b[i] – c[i] on even indices vFMSubAdd a[i]×b[i] – c[i] on odd indices ps, pd a[i]×b[i] + c[i] on even indices

Multiple FMA varieties support both data types and eliminate additional negation instructions

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 18 and/or other countries. *Other names and brands may be claimed as the property of others. FMA Operand Ordering Convention • 3 orders: VFMADD 132 , VFMADD 213 , VFMADD 231 • VFMADD abc : Src a= (Src a×Src b) + Src c – Src a and Src b are numerically symmetric (interchangeable) – Src a must be a register & is always the destination – Src b must be a register – Src c can be memory or register – All 3 can be used in multiply or add • Example: Src a Src b Src c VFMADD231 xmm8, xmm9, mem256 xmm8 = (xmm9×mem256) + xmm8 • Notes – Combination of 20 varieties with 3 operand orderings result in 60 new mnemonics AT&T* format, read operands right-to-left: Src3, 2, 1

FMA operand ordering allows complete flexibility in selection of memory operand and destination

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 19 and/or other countries. *Other names and brands may be claimed as the property of others. FMA Latency • Due to out-of-order, FMA latency is not always better than separate multiply and add instructions – Add latency: 3 cycles – Multiply and FMA latencies: 5 cycles – Will likely differ in later CPUs

a b a b c × c (a×b) + c) × + FMA + Latency = 8 Latency = 5

a b c d a b c d

× × × (a×b) + (c×d) × + + Latency = 8 FMA Latency = 10

FMA can improve or reduce performance due to various factors

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 20 and/or other countries. *Other names and brands may be claimed as the property of others. Haswell Bit Manipulation Instructions

• 15 overall, operate on general purpose registers (GPR): – Leading and trailing zero bits counts – Trailing bit manipulations and masks – Random bit fields extract/pack – Improved long precision multiplies and rotates • Narrowly focused instruction set arch (ISA) extension – Partly driven by direct customers’ requests – Allows for additional performance on highly-optimized codes • Beneficial for applications with : – hot spots doing bit-level operations, universal (de)coding (Golomb, Rice, Elias Gamma codes), bit fields pack/extract – crypto algorithms using arbitrary precision arithmetic or rotates e.g.: RSA, SHA family of hashes

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 21 and/or other countries. *Other names and brands may be claimed as the property of others. Bit Manipulation Instructions

CPUID bit Name Operation CPUID.EAX=080000001H: LZCNT Leading Zero Count (previously introduced by AMD*) ECX. LZCNT [bit 5] TZCNT Trailing Zero Count ANDN Logical And Not Z = ~X & Y CPUID.(EAX=07H, ECX=0H): BLSR Reset Lowest Set Bit EBX. BMI1 [bit 3] BLSMSK Get Mask Up to Lowest Set Bit BLSI Isolate Lowest Set Bit BEXTR Bit Field Extract (compatibility only, use BZHI+SHRX instead) BZHI Zero High Bits Starting with Specified Position SHLX Variable Shifts (non-destructive, have Load+Operation forms, no implicit CL dependency and no flags effect) SHRX Z = X << Y, CPUID.(EAX=07H, ECX=0H): SARX Z = X >> Y (for signed and unsigned X) EBX. BMI2 [bit 8] PDEP Parallel Bit Deposit PEXT Parallel Bit Extract RORX Rotate Without Affecting Flags MULX Unsigned Multiply Without Affecting Flags Note: Software needs to check for CPUID bits of all groups it uses instructions from

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 22 and/or other countries. *Other names and brands may be claimed as the property of others. Key common properties

Eliminate legacy ISA limitations Destination register can be different from source registers • A.k.a “non-destructive”: ANDN rax , rbx , rcx NO implicit register parameter requirement • like CL for shifts, and RDX:RAX for MUL • more flexibility to compiler’s register allocation NO useless and “ anti-out-of-order“ flags clobbering • Modern u-arch cannot efficiently handle legacy requirements like: “ Preserve flags if shift count in CL was 0 ” (input flags dependency) Add previously missing Load+Operation forms Adds fusion opportunities, reduces register pressure

Allow for higher Front End efficiency and better utilization of HSW wide execution resources

23© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 4/18/ Fast Variable Bit Field Extract Primitives • The two most frequent bit field usages: – Extract based on START and END bit positions, or – Extract based on START bit position and LENGTH

Examples bit_field = _bzhi_u64( buf_ui64 , end_pos ) >> start_pos ;

// use _shrx_u64() instead of “>>” with GCC or MSVC

end_pos (RDX= 6) start_pos (RCX = 2) 636363

src (RBX) 111……… 111 000 000111 111 000111 000 111 000000

000 ...... dst = src[end-1, start] bzhi raxrax, , rbx , rdx shrx rax , raxrax,, rcx 636363

dst (RAX) 000……… 000 000 000000 000 000 000 000111 000 111

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 24 and/or other countries. *Other names and brands may be claimed as the property of others. Unary Decoding – Where BMI helps

• Unary encoded number is : Traditional DecDecDecBinary Alternative Unary • A sequence of ‘1’ bits ended with 000 00000 000 111 ‘0’ bit 111 00001 101010 010101 • The number of ‘1’ bits equals 222 00010 110110110 001001001 the integer encoded 333 00011 1110 0001 • Unary is a “ prefix-free ” code 444 00100 11110 00001 555 00101 111110 000001 • Sequence can be reconstructed 666 00110 1111110 0000001 from a stream 777 00111 11111110 00000001 • Mostly used as a part of other 888 01000 111111110 000000001 codes, like Elias Gamma or ...... Golomb • Relevant for compression algorithms The new bit-manipulation instructions accelerate in particular the conversion between unary and binary encoding

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Variable Bit Field Extract: “before and after”

Original BMI2

res = (input >> pos) & res = (ALLONES >> (length - bitsize)); _bzhi_u64(input >> pos, length)

subq %rsi, %r9 shrx %r8, (%rdx), %rax movq $-1, %rax bzhi %rcx, %rax, %r9 movl %r9d, %ecx movq (%rdx), %r11 shrq %cl, %r11 movq %rsi, %rcx negq %rcx shrq %cl, %rax andq %rax, %r11

9 instr, 13 uop’s 2 instr, 2 uop’s 3/4-cycle throughput 1-cycle throughput 4-cycle latency* 2-cycle latency* * not including load

BMI2 achieves 1 extract / cycle

26© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 4/18/ HSW Improvements for Threading Sample Code Computing PI by Windows Threads

#include void main () #define NUM_THREADS 2 { HANDLE thread_handles[NUM_THREADS]; double pi; int i; CRITICAL_SECTION hUpdateMutex; DWORD threadID; static long num_steps = 100000; int threadArg[NUM_THREADS]; double step; double global_sum = 0.0; for(i=0; i

for (i=start;i<= num_steps; WaitForMultipleObjects(NUM_THREADS, i=i+NUM_THREADS){ thread_handles, TRUE,INFINITE); x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); pi = global_sum * step; } EnterCriticalSection(&hUpdateMutex); printf(" pi is %f \n",pi); global_sum += sum; } LeaveCriticalSection(&hUpdateMutex); }

Locks can be key bottleneck – even in case there is no conflict

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Intel® Transactional Synchronization Extensions (Intel® TSX) • Intel® TSX = HLE + RTM • HLE (Hardware Lock Elision) is a hint inserted in front of a LOCK operation to indicate a region is a candidate for lock elision – XACQUIRE (0xF2) and XRELEASE (0xF3) prefixes – Don’t actually acquire lock, but execute region speculatively – Hardware buffers loads and stores, checkpoints registers – Hardware attempts to commit atomically without locks – If cannot do without locks, restart, execute non-speculatively

• RTM (Restricted Transactional Memory) is three new instructions (XBEGIN, XEND, XABORT) – Similar operation as HLE (except no locks, new ISA) – If cannot commit atomically, go to handler indicated by XBEGIN – Provides software additional capabilities over HLE

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 28 and/or other countries. *Other names and brands may be claimed as the property of others. HLE: Two New Prefixes

mov eax, 0 Add a prefix hint to those instructions (lock inc, dec, cmpxchg, etc.) that identify the start of a mov ebx, 1 critical section. Prefix is ignored on non-HLE F2 lock cmpxchg $semaphore, ebx systems. …

mov eax, $value1 mov $value2, eax Speculate mov $value3, eax …

… Add a prefix hint to those instructions (e.g., F3 mov $semaphore, 0 stores) that identify the end of a critical section.

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 29 and/or other countries. *Other names and brands may be claimed as the property of others. RTM: Three New Instructions Name Operation XBEGIN Transaction Begin XEND Transaction End XABORT arg8 Transaction Abort

XBEGIN : Starts a transaction in the pipeline. Causes a checkpoint of register state and all following memory transactions to be buffered. Rel16/32 is the fallback handler. Nested depth support up to 7 XBEGINS. Abort if depth is 8. All abort roll-back is to outermost region.

XEND : Causes all buffered state to be atomically committed to memory. LOCK ordering semantics (even for empty transactions)

XABORT : Causes all buffered state to be discarded and register checkpoint to be recovered. Will jump to the XBEGIN labeled fallback handler. Takes imm8 argument.

There is also an XTEST instruction which can be used both for HLE and RTM to query whether execution takes place in a transaction region

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 30 and/or other countries. *Other names and brands may be claimed as the property of others. Intel Compilers – Haswell Support • Compiler options • -Qxcore-avx2, -Qaxcore-avx2 (Windows*) • -xcore-avx2 and –axcore-avx2 (Linux) • -march=core-avx2 Intel compiler • Separate options for FMA: • -Qfma, -Qfma- (Windows) • -fma, -no-fma (Linux) • AVX2 – integer 256-bit instructions • Asm, intrinsics, automatic vectorization • Gather • Asm, intrinsics, automatic vectorization • BMI – Bit manipulation instructions • All supported through asm/intrinsics • Some through automatic code generation • INVPCID – Invalidate processor context • Asm only

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 31 and/or other countries. *Other names and brands may be claimed as the property of others. Manual Code Dispatch • Intel ® TSX is an optional feature for HSW – Not all models will have this feature – Thus two “guard” names are/will be available core_4th_gen_avx core_4th_gen_avx_tsx

__declspec(cpu_specific(core_4th_gen_avx )) void dispatch_func() { printf("\Code here can assume HSW NI but not TSX\n"); }

__ declspec(cpu_specific(core_4th_gen_avx_tsx )) void dispatch_func() { printf("\Code here can assume TSX to be available\n"); }

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 32 and/or other countries. *Other names and brands may be claimed as the property of others. Other Compilers – Haswell Support

• Windows: Microsoft Visual Studio* 12 has similar support as Intel compiler for Windows* • GNU Compiler (4.7 experimental, 4.8 final) will have similar support as Intel compiler • In particular switch -march=core-avx2 and same intrinsics

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 33 and/or other countries. *Other names and brands may be claimed as the property of others. Intel ® Software Development Emulator • Emulates execution of instruction set extension including those available in future processors • Current version includes support of new instruction set of 4 th Generation Intel® Core™ processors (Haswell) • Free available here • Available for Linux and Windows • Uses PIN tool for instrumentation (www.pintool.org) • Offers too many helpful features like a disassembler • Can e.g. be used to investigate in detail Haswell code generated by Intel® Compilers • Easy way to get instruction mix • Preferred way – much more precise than performance counter !

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Target Software for Haswell New Instructions

AVX AVX2 HPC linear solvers which are vectorizable VV Video encoders with nonstandard bitstreams, or very high quality requirements X V Audio (Floating point) Digital Signal Processing (FFT’s, FIR’s, IIR’s) V V Video effects, transitions, pre-processing, including using OpenCL * X V Games leveraging CPU VV

Bit Manipulation New Instructions Bit level encoded data processing/streaming V Arbitrary precision arithmetic V Hashes V

Intel ® AVX2 with FMA benefits compute bound vectorizable workloads; BMI benefits key integer algorithms

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Intel ® Advanced Vector Extensions 2 (Intel ® AVX2) Intel ® Advanced Vector Extensions (Intel ® AVX) Intel ® Microarchitecture code name Haswell References • ISA documentation for Haswell New Instructions – Intel® Architecture Instruction Set Extensions Programming Reference (PDF) . – Intel®64 and IA-32 Architectures Software Developer Manuals . • Forum Haswell New Instructions – http://software.intel.com/en-us/blogs/2011/06/13/haswell-new- instruction-descriptions-now-available/ • Software Developer Emulator (SDE) – Emulate new instructions before hardware is available – Intel® Software Development Emulator (Intel® SDE) • Intel ® Architecture Code Analyzer – Code analysis for new instructions before hardware is available – Intel® Architecture Code Analyzer • Intel ® Compiler – Version 13.0 supports Intel ® TSX – Intel® C++ Compiler • Intel ® VTune™ analyzer – New release will support Haswell PerfMON shortly after shipment

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 36 and/or other countries. *Other names and brands may be claimed as the property of others. © 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 37 and/or other countries. *Other names and brands may be claimed as the property of others.

Ivy Bridge New Instructions 3rd Generation Intel® Core™ Processors

7 new Instruction 3 x RDRAND – 16, 32 and 64bit Digital random number generation 2 x 16 FP support – load and store Conversion between 32 and 16 FP 2x: FS and GS segment registers access Fast implementation of thread local storage 1 x Supervisory Mode Execution Protect. Prevents certain security attacks

• Compiler support by intrinsics – Since 12.x compiler versions already – No implicit code generation • New extension name avx-core-I for code selection/vectorization switches to enable intrinsic support and manual code dispatching for Ivy Bridge

39© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.