X86 Intrinsics Cheat Sheet Jan Finis [email protected]

x86 Intrinsics Cheat Sheet Jan Finis [email protected] Bit Operations Conversions Boolean Logic Bit Shifting & Rotation Packed Conversions Convert all elements in a packed SSE register Reinterpet Casts Rounding Arithmetic Logic Shift Convert Float See also: Conversion to int Rotate Left/ Pack With S/D/I32 performs rounding implicitly Bool XOR Bool AND Bool NOT AND Bool OR Right Sign Extend Zero Extend 128bit Cast Shift Right Left/Right ≤64 16bit ↔ 32bit Saturation Conversion 128 SSE SSE SSE SSE Round up SSE2 xor SSE2 and SSE2 andnot SSE2 or SSE2 sra[i] SSE2 sl/rl[i] x86 _[l]rot[w]l/r CVT16 cvtX_Y SSE4.1 cvtX_Y SSE4.1 cvtX_Y SSE2 castX_Y si128,ps[SSE],pd si128,ps[SSE],pd si128,ps[SSE],pd si128,ps[SSE],pd epi16-64 epi16-64 (u16-64) ph ↔ ps SSE2 pack[u]s epi8-32 epu8-32 → epi8-32 SSE2 cvt[t]X_Y si128,ps/d (ceiling) mi xor_si128(mi a,mi b) mi and_si128(mi a,mi b) mi andnot_si128(mi a,mi b) mi or_si128(mi a,mi b) NOTE: Shifts elements right NOTE: Shifts elements left/ NOTE: Rotates bits in a left/ NOTE: Converts between 4x epi16,epi32 NOTE: Sign extends each NOTE: Zero extends each epi32,ps/d NOTE: Reinterpret casts !a & b while shifting in sign bits. right while shifting in zeros. right by a number of bits 16 bit floats and 4x 32 bit element from X to Y. Y must element from X to Y. Y must from X to Y. No operation is SSE4.1 ceil NOTE: Packs ints from two NOTE: Converts packed generated. The i version takes an The i version takes an specified in i. The l version floats (or 8 for 256 bit input registers into one be longer than X. be longer than X. elements from X to Y. If pd is ps/d,ss/d immediate as count. The immediate as count. The is for 64-bit ints and the w mode). For the 32 to 16-bit register halving the bitwidth. mi cvtepi8_epi32(mi a) mi cvtepu8_epi32(mi a) used as source or target md castsi128_pd(mi a) md ceil_pd(md a) version without i takes the version without i takes the version for 16-bit ints. conversion, a rounding mode Overflows are handled using type, then 2 elements are Change the position of selected bits, zeroing out lower 64bit of an SSE lower 64bit of an SSE r must be specified. u64 _lrotl(u64 a) saturation. The u version converted, otherwise 4. Selective Bit Moving remaining ones register. register. m cvtph_ps(mi a) saturates to the unsigned The t version is only 128/256bit Round down mi srai_epi32(mi a,ii32 i) mi slli_epi32(mi a,ii i) mi cvtps_ph(m a, i r) integer type. available when casting to int and performs a truncation Bit Scatter Bit Gather mi sra_epi32(mi a,mi b) mi sll_epi32(mi a,mi b) mi pack_epi32(mi a,mi b) Cast 256 Movemask Extract Bits instead of rounding. (floor) (Deposit) (Extract) md cvtepi32_pd(mi a) AVX castX_Y SSE4.1 ≤64 ≤64 ≤64 floor Variable Variable Logic pd128↔pd256, ps/d,ss/d BMI2 _pdep BMI2 _pext SSE2 movemask BMI1 _bextr ps128↔ps256, u32-64 u32-64 epi8,pd,ps[SSE] u32-64 si128↔si256 md floor_pd(md a) Arithmetic Shift Shift Convert a single element in the lower bytes of an SSE register NOTE: Deposit contiguous NOTE: Extract bits from a at NOTE: Creates a bitmask NOTE: Extracts l bits from a Single Element Conversion NOTE: Reinterpret casts low bits from a to dst at the corresponding bit from the most significant bit starting at bit s. AVX2 sl/rav AVX2 sl/rlv from X to Y. If cast is from 128 to 256 bit, then the the corresponding bit locations specified by mask of each element. u64 _bextr_u64(u64 a,u32 s,u32 l) epi32-64 epi32-64 Single SSE Float Old Float/Int upper bits are undefined! No Round locations specified by mask m to contiguous low bits in i movemask_epi8(mi a) (a >> s) & ((1 << l)-1) Bit Scan m; all other bits in dst are dst; the remaining upper NOTE: Shifts elements left/ NOTE: Shifts elements left/ Single Conversion Single Float to Single 128-bit operation is generated. right while shifting in sign right while shifting in zeros. Forward/Reverse to Normal Float set to zero. bits in dst are set to zero. to Float with Fill Conversion 128 m128d castpd256_pd128(m256d a) SSE4.1 round bits. Each element in a is Each element in a is shifted ≤64 Int Conversion Int Conversion Conversion u64 _pdep_u64(u64 a,u64 m) u64 _pext_u64(u64 a, u64 m) shifted by an amount by an amount specified by _bit_scan_forward 128 128 128 128 SSE cvt[t]_X2Y ps/d,ss/d x86 /reverse specified by the the corresponding element SSE2 SSE SSE2 SSE2 corresponding element in b. in b. (i32) cvtX_Y cvt[t]X_Y cvtX_Y cvtX_Y pi↔ps,si↔ss NOTE: Rounds according to a si32-64,ss/d → ss/d ss/d → si32-64 si32-128 ss→f32,sd→f64 NOTE: Converts X to Y. 256bit Cast rounding paramater r. mi slav_epi32(mi a,mi b) mi sllv_epi32(mi a,mi b) NOTE: Returns the index of Set or reset a range of bits Converts between int and 256 md round_pd(md a,i r) the lowest/highest bit set in NOTE: Converts a single NOTE: Converts a single NOTE: Converts a single NOTE: Converts a single SSE p_ Bit Masking b float. versions convert 2- the 32bit int a. Undefined if element in from X to Y. element from X (int) to Y integer from X to Y. Either of float from X (SSE Float type) packs from b and fill result AVX castX_Y a is 0. The remaining bits of the (float). Result is normal int, the types must be si128. If to Y (normal float type). with upper bits of operand ps,pd,si256 Find Lowest result are copied from a. not an SSE register! the new type is longer, the a s_ Reset Lowest Mask Up To Count specific ranges of 0 or 1 bits i32 _bit_scan_forward(i32 a) t f cvtss_f32(m a) . versions do the same Zero High Bits Bit Counting md cvtsi64_sd(md a,i64 b) The version performs a integer is zero extended. for one element in b. NOTE: Reinterpret casts 1-Bit Lowest 1-Bit double(b[63:0]) | truncation instead of t from X to Y. No operation is ≤64 ≤64 ≤64 1-Bit ≤64 rounding. mi cvtsi32_si128(i a) The version truncates (a[127:64] << 64) instead of rounding and is generated. BMI2 _bzhi BMI1 _blsr BMI1 _blsmsk BMI1 _blsi Count 1-Bits Count Leading Count Trailing i cvtss_si32(m a) available only when m256 castpd_ps(m256d a) u32-64 u32-64 u32-64 u32-64 (Popcount) converting from float to int. ≤64 Zeros ≤64 Zeros ≤64 m cvt_si2ss(m a,i b) NOTE: Zeros all bits in a NOTE: Returns a but with NOTE: Returns an int that NOTE: Returns an int that higher than and including the lowest 1-bit set to 0. has all bits set up to and has only the lowest 1-bit in a POPCNT popcnt LZCNT _lzcnt BMI1 _tzcnt the bit at index i. u64 _blsr_u64(u64 a) including the lowest 1-bit in or no bit set if a is 0. u32-64 u32-64 u32-64 u64 _bzhi_u64(u64 a, u32 i) a or no bit set if a is 0. u64 _blsi_u64(u64 a) (a-1) & a NOTE: Counts the number of NOTE: Counts the number of NOTE: Counts the number of dst := a u64 _blsmsk_u64(u64 a) (-a) & a a a a IF (i[7:0] < 64) 1-bits in . leading zeros in . trailing zeros in . dst[63:n] := 0 (a-1) ^ a u32 popcnt_u64(u64 a) u64 _lzcnt_u64(u64 a) u64 _tzcnt_u64(u64 a) Arithmetics Basic Arithmetics Loading data into an SSE register or storing data from an SSE register Register I/O Overview Version 2.1f Multiplication Load Loading data into an SSE register Fences Misc I/O Introduction Mul High with This cheat sheet displays most x86 intrinsics supported by Intel processors. The following intrinsics were omitted: Carryless Mul Mul Mul Low Mul High · obsolete or discontinued instruction sets like MMX and 3DNow! 128 Round & Scale Inserting data into an register without loading from memory · AVX-512, as it will not be available for some time and would blow up the space needed due to the vast amount of new instructions SSE SSE2 Set Register Store Fence Prefetch · Intrinsics not supported by Intel but only AMD like parts of the XOP instruction set (maybe they will be added in the future). CLMUL clmulepi64 SSE2 mul SSE4.1 mullo SSE2 mulhi SSSE3 mulhrs · Intrinsics that are only useful for operating systems like the _xsave intrinsic to save the CPU state si128 epi32[SSE4.1],epu32, epi16,epi32[SSE4.1] epi16,epu16 epi16 SSE sfence SSE prefetch · The RdRand intrinsics, as it is unclear whether they provide real random numbers without enabling kleptographic loopholes NOTE: Perform a carry-less ps/d,ss/d NOTE: Multiplies vertically NOTE: Multiplies vertically NOTE: Treat the 16-bit Set Reversed Set Insert Replicate Each family of intrinsics is depicted by a box as described below. It was tried to group the intrinsics meaningfully. Most information is taken from the Intel Intrinsics Guide (http://software.intel.com/en-us/articles/intel-intrinsics-guide).

X86 Intrinsics Cheat Sheet Jan Finis [email protected]

07 Vectorization for Intel C++ & Fortran Compiler .Pdf

Effective Virtual CPU Configuration with QEMU and Libvirt

Technical Report TR-2020-10

ADSP-21065L SHARC User's Manual; Chapter 2, Computation

Analysis of SIMD Applicability to SHA Algorithms O

Operating Guide

2.5 Classification of Parallel Computers

Most Computer Instructions Can Be Classified Into Three Categories

Microcode Revision Guidance August 31, 2019 MCU Recommendations

Undocumented CPU Behavior: Analyzing Undocumented Opcodes on Intel X86-64 Catherine Easdon Why Investigate Undocumented Behavior? the “Golden Screwdriver” Approach

Asrock G41C-VS Motherboard, a Reliable Motherboard Produced Under Asrock’S Consistently Stringent Quality Control

SIMD Extensions