X86 Intrinsics Cheat Sheet Jan Finis [email protected]

X86 Intrinsics Cheat Sheet Jan Finis Finis@In.Tum.De

x86 Intrinsics Cheat Sheet Jan Finis [email protected]

Legend Introduction

This cheat sheet displays most x86 intrinsics supported by Intel processors. The following intrinsics were omitted: · obsolete or discontinued instruction sets like MMX and 3DNow! · AVX-512, as it will not be available for some time and would blow up the space needed due to the vast amount of new instructions · Intrinsics not supported by Intel but only AMD like parts of the XOP instruction set (maybe they will be added in the future). · Intrinsics that are only useful for operating systems like the _xsave intrinsic to save the CPU state · The RdRand intrinsics, as it is unclear whether they provide real random numbers without enabling kleptographic loopholes

Each family of intrinsics is depicted by a box as described below. It was tried to group the intrinsics meaningfully. Most information is taken from the Intel Intrinsics Guide (http://software.intel.com/en-us/ articles/intel-intrinsics-guide). Let me know ([email protected]) if you find any wrong or unclear content.

When not stated otherwise, it can be assumed that each vector intrinsic performs its operation vertically on all elements which are packed into the input SSE registers. E.g., the add instruction has no description. Thus, it can be assumed that it performs a vertical add of the elements of the two input registers, i.e., the first element from register a is added to the first element in register b, the second is added to the second, and so on. In contrast, a horizontal add would add the first element of a to the second element of a, the third to the fourth, and so on.

To use the intrinsics, included the header and make sure that you set the target architecture to one that supports the intrinsics you want to use (using the -march=X compiler flag).

Available Bitwidth: If no bitwidth is specified, the operation is available for 128bit and 256bit SSE registers. Use the _mm_ prefix for the 128bit and the _mm256_ prefix for the 256bit flavors. Otherwise, the following restrictions apply:

2 5 6 : Only available for 256bit SSE registers (always use _mm256_ prefix) Name: Human readable name of 1 2 8 : Only available for 128bit SSE registers (always use _mm_ prefix) the operation ≤ 6 4 : Operation does not operate on SSE registers but usual 64bit registers (always use _mm_ prefix)

Sequence Indicator: If this Name(s) of the intrinsic: The names of the various flavors of the intrinsic. To assemble the final name to be used for indicator is present, it means that Float Compare an intrinsic, one must add a prefix and a suffix. The suffix determines the data type (see next field). Concerning the the intrinsic will generate a prefix, all intrinsics, except the ones which are prefixed with a green underscore _, need to be prefixed with _mm_ for sequence of assembly instructions SEQ! 128 128bit versions or _mm256_ for 256bit versions. Blue letters in brackets like [n] indicate that adding this letter leads instead of only one. Thus, it may SSE2 to another flavor of the function. Blue letters separated with a slash like l/r indicate that either letter can be used not be as efficient as anticipated. cmp[n]Z and leads to a different flavor. The different flavors are explained in the notes section. A red letter like Z indicates that ps/d,ss/ various different strings can be inserted here which are stated in the notes section. d,si128[SSE4.1] Instruction Set: Specifies the List of available data type suffixes: Consult the suffix table for further information about the various possible suffixes. instruction set which is necessary NOTE: Z can be one of: The suffix chosen determines the data type on which the intrinsic operates. It must be added as a suffix to the intrinsic for this operation. If more than ge/le/lt/gt/eq name separated by an underscore, so a possible name for the data type pd in this example would be one instruction set is given here, Returns a single int that is either 1 _mm_cmpeq_pd. A suffix is followed by an instruction set in brackets, then this instruction set is required for the then different flavors of this or 0. The n version is a not version, suffix. All suffixes without an explicit instruction set are available in the instruction set specified at the left. operation require different e.g., neq computes not equal. If no type suffixes are shown, then the method is type independent and must be used without a suffix. instruction sets. md cmpeq_pd(md a,md b) Notes: Explains the semantics of the intrinsic, the various flavors, and other important information.

Signature and Pseudocode: Shows one possible signature of the intrinsic in order to depict the parameter types and the return type. Note that only one flavor and data type suffix is shown; the signature has to be adjusted for other suffixes and falvors. The data types are displayed in shortened form. Consult the data types table for more information. Comparisons In addition to the signature, the pseudocode for some intrinsics is shown here. Note that the special variable dst depicts the destination register of the method which will be returned by the intrinsic. Float Compare Int Compare Instruction Sets Data Type Suffixes Data Types Compare Not Compare Single Each intrinsic is only available on machines which support the corresponding Most intrinsics are available for various suffixes which depict different data The following data types are used in the signatures of the intrinsics. Note that Float Compare Compare Int Compare instruction set. This list depicts the instruction sets and the first Intel and AMD types. The table depicts the suffixes used and the corresponding SSE register most types depend on the used type suffix and only one example suffix is NaN Float CPUs that supported them. The green numbers behind the CPUs depict the year of types (see Data Types Table for descriptions). shown in the signature. 128 128 release. SSE2 cmp[n]Z SSE2 cmp[un]ord AVX cmp SSE2 [u]comiZ SSE2 cmpZ I.Set Intel AMD Description Suffix Reg.Type Description Suffix Description ps/d,ss/d ps/d,ss/d ps/d,ss/d ss/d epi8-32,epi64[SSE4.1] x86 all all x86 base instructions ph mi Packed half float (16 bit float) i Signed 32 bit integer (int) NOTE: Z can be one of: NOTE: For each element pair NOTE: Compares packed or single NOTE: Z can be one of: NOTE: Z can be one of: ge/le/lt/gt/eq cmpord sets the result bits to 1 if elements based on the third eq/ge/gt/le/lt/neq lt/gt/eq.Elements that equal SSE Pentium III 99 K7 Palomino 01 ps/d m/md Packed single float / packed double float iX Signed X bit integer (intX_t) Returns a single int that is either 1 both elements are not NaN, operand which is an immediate. Returns a single int that is either 1 receive 1s in all bits, otherwise 0s. Pentium 4 01 K8 03 or 0. The n version is a not version, cmpunord Possible values are 0-31. Elements or 0. The u version does not signal SSE2 epiX mi Packed X bit signed integer uX Unsigned X bit integer (uintX_t) otherwise 0. sets bits mi cmpeq_epi8(mi a,mi b) e.g., neq computes not equal. that compare true receive 1s in all SSE3 Prescott 04 K9 05 epuX mi Packed X bit unsigned integer Immediate signed 32 bit integer: The value used for if at least one is NaN. an exception for QNaNs and is not Streaming SIMD Extensions Elements that compare true md cmpord_pd(md a,md b) bits, otherwise 0s. Check available for 256 bits! Woodcrest 06 Bulldozer 11 ii parameters of this type must be a compile time SSSE3 Single X bit signed integer. If there is a si128 receive 1s in all bits, otherwise 0s. documentation for semantics. i comieq_sd(mi a,mi b) SSE4.1 Penryn 07 Bulldozer 11 siX mi version, the same function for 256 bits is constant. md cmpeq_pd(md a,md b) md cmp_pd(md a,md b,ii imm) f float SSE4.2 Nehalem 08 Bulldozer 11 usually suffixed with si256. 32-bit float ( ) suX mi d double AVX Sandy Bridge 11 Bulldozer 11 Single X bit unsigned integer in SSE register 64-bit float ( ) Advanced Vector Extensions Single X bit unsigned integer (not in SSE Integer SSE register, i.e., __m128i or __m256i, Perform a bitwise operation and check whether all bits AVX2 Haswell 13 - uX - mi register) depending on the bitwidth used. Bit Compare are 0s afterwards String Compare FMA Haswell 13 Bulldozer 11 Fused Multiply and Add Single single float/double float, remaining 32-bit Float SSE register, i.e., __m128 or __m256, AES Westmere 10 Bulldozer 11 Advanced Encryption Standard bits copied. Operations involving these m depending on the bitwidth used. CLMUL Westmere 10 Bulldozer 11 Carryless Multiplication types often have different signatures: An Test And Not/ Test Mix Ones String Compare Description extra input register is used and all bits 64-bit Float (double) SSE register, i.e., __m128d or Test All Ones cmpistrX(mi a, mi b, ii imm) CVT16 Ivy Bridge 12 Bulldozer 11 16-bit Floats ss/d m/md which are not used by the single element md __m256d, depending on the bitwidth used. And Zeros SEQ! 128 cmpestrX(mi a, i la, mi b, i lb, ii imm) TSX Haswell 13 - Transactional Sync. Extensions are copied from this register. The signatures SSE4.1 SSE4.1 test_ncz Each operation has an i and an e version. The i version compares all are not depicted here; see manual for exact v void AVX testc/z AVX SSE4.1 test_all_ones POPCNT Nehalem 08 K10 07 signatures. elements, the e version compares up to specific lengths la and lb. The X* X* si128[SSE4.1],si256, si128[SSE4.1],si256 - LZCNT Haswell 13 K10 07 Pointer to X ( ) immediate value for all these comparisons consists of bit flags. Exactly one m128, ps/d ,ps/d flag per group must be present: Bit Manipulation Instructions m, AVX sometimes uses this for 128bit single/ NOTE: Returns true iff all bits in a Specials BMI1 Sandy Bridge 11 Bulldozer 11 m128d, Data type specifier NOTE: Compute the bitwise AND NOTE: Compute the bitwise AND are set. Needs two instructions and md,mi double/int operations. _SIDD_UBYTE_OPS unsigned 8-bit chars BMI2 Haswell?? 14 - m128i of 128 bits (representing integer of 128 bits (representing integer may be slower than native _SIDD_UWORD_OPS unsigned 16-bit chars data) in a and b, and set ZF to 1 if data) in a and b, and set ZF to 1 if implementations. _SIDD_SBYTE_OPS signed 8-bit chars Special Algorithms Transactional the result is zero, otherwise set ZF the result is zero, otherwise set ZF ( a) _SIDD_SWORD_OPS signed 16-bit chars to 0. Compute the bitwise AND to 0. Compute the bitwise AND i test_all_ones mi (~a) == 0 NOT of a and b, and set CF to 1 if NOT of a and b, and set CF to 1 if Compare mode specifier Memory CF AES Inverse Mix the result is zero, otherwise set CF the result is zero, otherwise set _SIDD_CMP_EQUAL_ANY For each character c in a, determine iff any Cyclic Redundancy Conversions character in b is equal to c. AES KeyGen Assist AES Encrypt AES Decrypt to 0. The c version returns CF and to 0. Return !CF && !ZF. For Columns Check (CRC32) Abort the z version ZF. For 128 bit, 128 bit, there is also the operation _SIDD_CMP_RANGES For each character c in a, determine ≤ whether b0 <= c <= b1or b2 <= c <= b3… 128 128 128 128 64 there is also the operation test_mix_ones_zeros which Convert all elements in a packed SSE register _SIDD_CMP_EQUAL_ORDERED Check for string equality of a and b AES AES AES AES SSE4.2 test_all_zeros which does does the same. aeskeygenassist aesimc aesenc[last] aesdec[last] crc32 Transaction Packed Conversions Reinterpet Casts Rounding Search substring b in a. Each byte where b the same as testz_si128. _SIDD_CMP_EQUAL_EACH si128 si128 si128 si128 u8-u64 i test_mix_ones_zeros(mi a,mi b) begins in a is treated as match. TSX _xabort See also: Conversion to int ZF := ((a & b) == 0) i testc_si128(mi a,mi b) Polarity specifier NOTE: Assist in expanding the AES NOTE: Performs the inverse mix NOTE: Perform one round of an NOTE: Perform one round of an NOTE: Starting with the initial Convert 16bit Float CF := ((a & !b) == 0) - performs rounding implicitly ZF := ((a & b) == 0) _SIDD_POSITIVE_POLARITY Match is indicated by a 1-bit. cipher key by computing steps columns transformation on the AES encryption flow on data AES decryption flow on data value in crc, accumulates a CRC32 Pack With S/D/I32 !CF && !ZF; Sign Extend Zero Extend 128bit Cast CF := ((a & !b) == 0) RETURN _SIDD_NEGATIVE_POLARITY Negation of resulting bitmask. towards generating a round key input. (state) in a using the round key. (state) in a using the round key. value for unsigned X-bit integer v, NOTE: Force an RTM abort. The ↔ 32bit Float 128 RETURN CF; Negation of resulting bitmask except for for encryption cipher using data mi aesimc_si128(mi a) The last version performs the The last version performs the and stores the result in dst. EAX register is updated to reflect Saturation Conversion _SIDD_MASKED_NEGAT last round. last round. an XABORT instruction caused the bits that have an index larger than the size from a and an 8-bit round CVT16 cvtX_Y SSE4.1 cvtX_Y SSE4.1 cvtX_Y SSE2 castX_Y Round up IVE_POLARITY of a or b. u crc32_u32(u crc,u v) imm constant specified in imm. mi aesenc_si128(mi a,mi key) mi aesdec_si128(mi a,mi key) abort, and the parameter will SSE2 SSE2 be provided in bits [31:24] of EAX. ph ↔ ps pack[u]s epi8-32 epu8-32 → epi8-32 cvt[t]X_Y si128,ps/d mi aeskeygenassist_si128 (ceiling) Following an RTM abort, the NOTE: Converts between 4x 16 bit epi16,epi32 NOTE: Sign extends each element NOTE: Zero extends each element NOTE: Reinterpret casts from X to (mi a,ii imm) epi32,ps/d logical processor resumes floats and 4x 32 bit floats (or 8 for from X to Y. Y must be longer than from X to Y. Y must be longer than Y. No operation is generated. SSE4.1 ceil String Compare String Compare String Compare String execution at the fallback address 256 bit mode). For the 32 to 16-bit NOTE: Packs ints from two input NOTE: Converts packed elements X. X. md castsi128_pd(mi a) String Compare computed through the outermost registers into one register halving from X to Y. If pd is used as source ps/d,ss/d conversion, a rounding mode must mi cvtepi8_epi32(mi a) mi cvtepu8_epi32(mi a) Mask Index with Nullcheck Nullcheck XBEGIN instruction. be specified. the bitwidth. Overflows are or target type, then 2 elements are md ceil_pd(md a) handled using saturation. The u converted, otherwise 4. ( imm) m cvtph_ps(mi a) SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 Miscellaneous v _xabort ii version saturates to the unsigned The t version is only available cmpi/estrm cmpi/estri cmpi/estrc cmpi/estra cmpi/estrs/z mi cvtps_ph(m a, i rounding) 128/256bit integer type. when casting to int and performs a ps/d,ss/d ps/d,ss/d ps/d,ss/d ps/d,ss/d ps/d,ss/d truncation instead of rounding. Round down Monitor Get MXCSR Commit Begin mi pack_epi32(mi a,mi b) Cast 256 NOTE: Compares strings in a and b NOTE: Compares strings in a and b NOTE: Compares strings in a and b NOTE: Compares strings in a and b NOTE: Compares two strings a and md cvtepi32_pd(mi a) and returns the comparsion mask. and returns the index of the first and returns true iff the resulting and returns true iff the resulting b and returns if a (s version) or b Pause Monitor Wait AVX castX_Y (floor) mask is not zero, i.e., if there was a (z version) contains a null Transaction Transaction If _SIDD_BIT_MASK is used, byte that matches. Otherwise mask is zero and there is no null Memory Register pd128↔pd256, SSE4.1 floor then the resulting mask is a bit Maxsize is returned (either 8 or 16 match. character in b. character. mask. If _SIDD_UNIT_MASK is depending on data type). i cmpistrc(mi a,mi b,ii imm) SSE2 SSE3 SSE3 SSE TSX _xend TSX _xbegin ps128↔ps256,si128↔s i cmpistra(mi a,mi b,ii imm)i i cmpistrs(mi a,mi b,ii imm) pause monitor mwait getcsr ps/d,ss/d used, then the result is a byte i cmpestrc Convert a single element in the lower bytes of an SSE register i256 i cmpistri(mi a,mi b,ii imm) i cmpestra i cmpestrs ------Single Element Conversion md floor_pd(md a) mask which has ones in all bits of (mi a,i la,mi b,i lb,ii imm) i cmpestri (mi a,i la,mi b,i lb,ii imm) (mi a,i la,mi b,i lb,ii imm) NOTE: Reinterpret casts from X to the bytes that do not match. NOTE: Provide a hint to the NOTE: Arm address monitoring NOTE: Hint to the processor that it NOTE: Get the content of the NOTE: Specify the end of an RTM NOTE: Specify the start of an RTM (mi a,i la,mi b,i lb,ii imm) Y. If cast is from 128 to 256 bit, processor that the code sequence hardware using the address can enter an implementation- MXCSR control and status register. code region. If this corresponds to code region. If the logical mi cmpistrm(mi a,mi b,ii imm) Single Single SSE Float Old Float/Int then the upper bits are undefined! is a spin-wait loop. This can help specified in p. A store to an dependent-optimized state while the outermost scope, the logical processor was not already in mi cmpestrm u getcsr() Single Float to Int Single 128-bit Int No operation is generated. improve the performance and address within the specified waiting for an event or store processor will attempt to commit transactional execution, then it Conversion to to Normal Float Conversion Round (mi a,i la,mi b,i lb,ii imm) power consumption of spin-wait address range triggers the operation to the address range the logical processor state transitions into transactional Conversion Conversion 128 m128d castpd256_pd128(m256d a) loops. monitoring hardware. Specify specified by MONITOR. atomically. If the commit fails, the execution. On an RTM abort, the Float with Fill128 128 128 Conversion 128 SSE cvt[t]_X2Y v pause() optional extensions in ext, and ?????? Set MXCSR logical processor will perform an logical processor discards all SSE4.1 round optional hints in hints. RTM abort. architectural register and memory SSE2 cvtX_Y SSE cvt[t]X_Y SSE2 cvtX_Y SSE2 cvtX_Y pi↔ps,si↔ss v mwait(u ext,u hints) ps/d,ss/d v monitor(v* ptr,u ext,u hints) Register v _xend() updates performed during the NOTE: Converts X to Y. Converts si32-64,ss/d → ss/d ss/d → si32-64 si32-128 ss→f32,sd→f64 256bit Cast NOTE: Rounds according to a Arithmetics RTM execution, restores between int and float. p_ versions SSE NOTE: Converts a single element NOTE: Converts a single SSE float 256 rounding paramater. setcsr architectural state, and starts NOTE: Converts a single element NOTE: Converts a single integer convert 2-packs from b and fill in b from X to Y. The remaining from X (SSE Float type) to Y - execution beginning at the fallback from X (int) to Y (float). Result is from X to Y. Either of the types result with upper bits of operand AVX md round_pd(md a,i rounding) bits of the result are copied from si128 (normal float type). castX_Y address computed from the normal int, not an SSE register! must be . If the new type is a. s_ versions do the same for NOTE: Set the MXCSR control and a. t ps,pd,si256 Basic Arithmetics outermost XBEGIN instruction. The version performs a longer, the integer is zero f cvtss_f32(m a) one element in b. status register with a 32bit int. md cvtsi64_sd(md a,i64 b) truncation instead of rounding. extended. u _xbegin() NOTE: Reinterpret casts from X to v setcsr(u a) double(b[63:0]) | i cvtss_si32(m a) mi cvtsi32_si128(i a) The t version truncates instead of Y. No operation is generated. (a[127:64] << 64) rounding and is available only m256 castpd_ps(m256d a) Multiplication when converting from float to int. m cvt_si2ss(m a,i b) Mul High with Register I/O Loading data into an SSE register or storing data from an SSE register Carryless Mul Mul Mul Low Mul High 128 Round & Scale SSE SSE2 CLMUL clmulepi64 SSE2 mul SSE4.1 mullo SSE2 mulhi SSSE3 mulhrs Loading data into an SSE register Load Fences Misc I/O Bit Operations si128 epi32[SSE4.1],epu32,ps/ epi16,epi32[SSE4.1] epi16,epu16 epi16 d,ss/d NOTE: Perform a carry-less NOTE: Multiplies vertically and NOTE: Multiplies vertically and NOTE: Treat the sixteen-bit words multiplication of two 64-bit NOTE: epi32 and epu32 version writes the low 16/32 bits into the writes the high 16bits into the in registers A and B as signed 15- Inserting data into an register without loading from memory integers, selected from a and b multiplies only 2 ints instead of 4! Set Register Store Fence Boolean Logic Bit Shifting & Rotation result. result. bit fixed-point numbers between Prefetch according to imm, and store the mi mul_epi32(mi a,mi b) −1 and 1 (e.g. 0x4000 is treated as results in dst (result is a 127 bit mi mullo_epi16(mi a,mi b) mi mulhi_epi16(mi a,mi b) 0.5 and 0xa000 as −0.75), and int). SSE sfence SSE prefetch Arithmetic Logic Shift multiply them together with Set Reversed Set Insert Replicate Bool XOR Bool AND Bool NOT AND Bool OR Rotate Left/Right mi clmulepi64_si128 - - correct rounding. (mi a,mi b,ii imm) SEQ! SEQ! 128 SEQ! Shift Right Left/Right ≤64 mi mulhrs_epi16(mi a,mi b) NOTE: Guarantees that every NOTE: Fetch the line of data from SSE2 SSE2 SSE SSE SSE SSE _[l]rot[w]l/r AVX setr AVX set SSE4.1 insert SSE2 set1 store instruction that precedes, in memory that contains address SSE2 xor SSE2 and SSE2 andnot SSE2 or SSE2 sra[i] SSE2 sl/rl[i] x86 epi8-64x,ps[SSE1]/d, epi8-64x, epi16[SSE2], epi8-64x,ps[SSE1]/d program order, is globally visible ptr to a location in the cache si128,ps[SSE],pd si128,ps[SSE],pd si128,ps[SSE],pd si128,ps[SSE],pd (u16-64) before any store instruction which heirarchy specified by the locality epi16-64 epi16-64 m128/m128d/i[AVX] ps[SSE1]/d,ss/d, epi8-64,ps a NOTE: Broadcasts one input follows the fence in program hint i. mi xor_si128(mi a,mi b) mi and_si128(mi a,mi b) mi andnot_si128(mi a,mi b) mi or_si128(mi a,mi b) NOTE: Rotates bits in left/right NOTE: Sets and returns an SSE [AVX] NOTE: Inserts an element at a element into all slots of an SSE !a & b NOTE: Shifts elements left/right by a number of bits specified in i. Addition / Subtraction m128/m128d/i order. v prefetch(c* ptr, i i) NOTE: Shifts elements right while register with input values. E.g., specified position. register. while shifting in zeros. The i The l version is for 64-bit ints and NOTE: Sets and returns an SSE shifting in sign bits. The i version epi8 version takes 16 input v sfence() version takes an immediate as the w version for 16-bit ints. register with input values. E.g., mi insert_epi16(mi a,i i,ii imm) mi set1_epi32(i a) takes an immediate as count. The parameters. The order in the count. The version without i Horizontal Add Horizontal Add with Alternating Add epi8 version takes 16 input dst[127:0] := a[127:0] version without i takes the lower u64 _lrotl(u64 a) register is reversed, i.e. the first Change the position of selected bits, zeroing out takes the lower 64bit of an SSE parameters. The first input gets sel := imm[2:0]*16 64bit of an SSE register. Add input gets stored in the lowest Cashline Flush Selective Bit Moving register. with Saturation stored in the highest bits. dst[sel+15:sel]:=i[15:0] remaining ones Add Saturation and Substract bits. The 128bit epi64 version Load Fence mi srai_epi32(mi a,ii32 imm) mi slli_epi32(mi a,ii imm) SSE SSE takes two MMX registers. For mi set_epi32(i a,i b,i c,i d) SSSE3 SSSE3 SSE3 mi sra_epi32(mi a,mi b) mi sll_epi32(mi a,mi b) hadds hadd SSE2 add SSE2 adds addsub 256bit, there is only an epi64x SSE2 clflush Bit Scatter Bit Gather epi16 epi16-32,ps/d epi8-64,ps[SSE]/d, epi8-16,epu8-16 ps/d version which takes usual ints. SSE2 Movemask Extract Bits ss[SSE]/d lfence - NOTE: Adds adjacent pairs of NOTE: Adds adjacent pairs of mi adds_epi16(mi a,mi b) m addsub_ps(m a,m b) mi setr_epi32(i a,i b,i c,i d) (Deposit) (Extract) - NOTE: Flushes cache line that ≤64 ≤64 ≤64 Variable Variable Logic elements with saturation elements mi add_epi16(mi a,mi b) FOR j := 0 to 3 i := j*32 contains p from all cache levels BMI2 _pdep BMI2 _pext SSE2 BMI1 _bextr mi hadds_epi16(mi a,mi b) mi hadd_epi16(mi a,mi b) NOTE: Guarantees that every load movemask (j is even) v clflush(v* ptr) Arithmetic Shift IF instruction that precedes, in u32-64 u32-64 epi8,pd,ps[SSE] u32-64 Shift dst[i+31:i] := Loading data from a memory address which must be program order, is globally visible AVX2 AVX2 a[i+31:i] - b[i+31:i] before any load instruction which NOTE: Deposit contiguous low bits NOTE: Extract bits from a at the NOTE: Creates an int bitmask NOTE: Extracts l bits from a sl/rav sl/rlv Aligned Load 16-byte aligned (or 32-byte for 256bit instructions) Horizontal Substract Horizontal ELSE follows the fence in program from a to dst at the corresponding bit locations from the most significant bits of starting at bit s. Get Undefined epi32-64 epi32-64 dst[i+31:i] := order. corresponding bit locations specified by mask to contiguous each element. with Saturation u64 _bextr_u64(u64 a,u32 s,u32 l) a[i+31:i] + b[i+31:i] specified by mask; all other bits in low bits in dst; the remaining (a >> s) & ((1 << l)-1) Substract Subtract Subtract with v lfence() Register i movemask_epi8(mi a) NOTE: Shifts elements left/right NOTE: Shifts elements left/right Bit Scan Stream Load Load Aligned Load Reversed Mask Load dst are set to zero. upper bits in dst are set to zero. SSSE3 hsubs SSSE3 hsub AVX while shifting in sign bits. Each while shifting in zeros. Each SSE Saturation SEQ! 128 undefined u64 _pdep_u64(u64 a, u64 mask) u64 _pext_u64(u64 a, u64 mask) element in a is shifted by an element in a is shifted by an Forward/Reverse ≤64 epi16 epi16-32,ps/d SSE2 sub AVX amount specified by the amount specified by the SSE2 SSE2 SSE2 SSE2 Memory Fence ps/d,si128-256 _bit_scan_forward epi8-64,ps[SSE]/d, subs stream_load load loadr AVX2 maskload corresponding element in b. corresponding element in b. x86 NOTE: Substracts adjacent pairs of NOTE: Substracts adjacent pairs of /reverse elements with saturation elements ss[SSE]/d epi8-16,epu8-16 si128,si256[AVX] pd,ps,si128 pd,ps ps/d,epi32-64[AVX2] NOTE: Returns an SSE register (Load & Store) mi slav_epi32(mi a,mi b) mi sllv_epi32(mi a,mi b) (i32) mi hsubs_epi16(mi a,mi b) mi hsub_epi16(mi a,mi b) mi sub_epi16(mi a,mi b) mi subs_epi16(mi a,mi b) NOTE: Loads 128 bit from NOTE: Loads from memory if the with undefined contents. Set or reset a range of bits NOTE: Loads 128 bit from NOTE: Loads 128 bit from Bit Masking NOTE: Returns the index of the memory into register. Memory highest bit for each element is set SSE2 memory into register. Memory memory into register. The mfence mi undefined_si128() lowest/highest bit set in the 32bit address must be aligned! in the mask. Otherwise 0 is used. address must be aligned! Memory elements are loaded in reversed - int a. Undefined if a is 0. ptr must be aligned! is fetched at once from an USWC md load_pd(d* ptr) order. Memory must be aligned! Reset Lowest Mask Up To Find Lowest Count specific ranges of 0 or 1 bits device without going through the 128bit versions zero upper parts of NOTE: Guarantees that every Zero High Bits Bit Counting i32 _bit_scan_forward(i32 a) Div/Sqrt/Reciprocal Sign cache hierachy. m loadr_ps(f* ptr) the register. memory access that precedes, in 1-Bit Lowest 1-Bit program order, the memory fence 1-Bit dst[31:0]:=*(ptr)[127:96] mi maskload_epi64 ≤64 ≤64 ≤64 ≤64 mi stream_load_si128(mi* ptr) instruction is globally visible dst[63:32]:=*(ptr)[95:64] (i64* ptr,mi mask) BMI2 BMI1 BMI1 BMI1 Count 1-Bits Count Leading Count Trailing Modification before any memory instruction _bzhi _blsr _blsmsk _blsi Approx. Approx. dst[95:64]:=*(ptr)[63:32] dst[MAX:128] := 0 which follows the fence in Div Square Root dst[127:96]:=*(ptr)[31:0] FOR j := 0 to 1 u32-64 u32-64 u32-64 u32-64 (Popcount) program order. ≤64 Zeros ≤64 Zeros ≤64 Reciprocal Reciprocal Sqrt i := j*64 NOTE: Zeros all bits in a higher NOTE: Returns a but with the NOTE: Returns an int that has all NOTE: Returns an int that has only Absolute IF mask[i+63] v mfence() than and including the bit at index lowest 1-bit set to 0. bits set up to and including the the lowest 1-bit in a or no bit set if POPCNT popcnt LZCNT _lzcnt BMI1 _tzcnt SSE2 SSE SSE SSE dst[i+63:i]:= *(ptr+i) i. a a div rcp rsqrt SSE2 sqrt u64 _blsr_u64(u64 a) lowest 1-bit in or no bit set if is a is 0. u32-64 u32-64 u32-64 ELSE ps/d,ss/d ps,ss ps,ss u64 _bzhi_u64(u64 a, u32 i) (a-1) & a 0. u64 _blsi_u64(u64 a) ps/d,ss/d SSSE3 dst[i+63:i]:= 0 NOTE: Counts the number of 1-bits NOTE: Counts the number of NOTE: Counts the number of abs dst := a u64 _blsmsk_u64(u64 a) (-a) & a m div_ps(m a,m b) m rcp_ps(m a) NOTE: Approximates 1.0/sqrt(x) in a. leading zeros in a. trailing zeros in a. epi8-32 IF (i[7:0] < 64) (a-1) ^ a m sqrt_ps(m a) m rsqrt_ps(m a) dst[63:n] := 0 u32 popcnt_u64(u64 a) u64 _lzcnt_u64(u64 a) u64 _tzcnt_u64(u64 a) mi abs_epi16(mi a) Loading data from a memory address which does Unaligned Load not have to be aligned to a specific boundary Conditional Min/Max/Avg Fast Unaligned Change the order of bytes in a register, duplicating or zeroing Negate or Zero Load High/Low Broadcast Load Broadcast Load Gather Mask Gather Byte Manipulation bytes selectively or mixing the bytes of two registers Horizontal SSSE3 sign Load 128 SEQ! 128 128 Min Max Average epi8-32 SSE3 lddqu SSE2 loadh/l SSE2 load SSE3 loaddup AVX2 i32/i64gather AVX2 mask_i32/i64gather Mix Registers Mix the contents of two registers Byte Shuffling Change the byte order using a control mask 1 Min 128 NOTE: For each element in a and si128 pd,pi pd,ps pd epi32-64,ps/d epi32-64,ps/d SSE2 SSE2 b a SSE4.1 minpos SSE4.1 min SSE4.1 max SSE2 avg , set result element to if b is NOTE: Loads a value from NOTE: Loads a 64-bit float from NOTE: Gather elements from NOTE: Same as gather but takes positive, set result element to -a NOTE: Loads 128bit integer data NOTE: Loads a float from memory Move Element Move Concatenate and 32-bit Int High / Low epu16 SSE:ps SSE2:epu8, SSE:ps SSE2:epu8, epu8-16 into SSE register. Is faster than memory into the high/low half of into all slots of the 128-bit register. memory into all slots of the 128- memory using 32-bit/64-bit an additional mask and src 256bit Insert Byte Shuffle if b is negative or set result loadu if value crosses cache line the register and fills the other half For pd, there is also the operation bit register. Memory does not indices. The elements are loaded register. Each element is only with Fill High↔Low Byteshift (Align) Shuffle 16bit Shuffle NOTE: Computes horizontal min of epi16,pd epi16,pd mi avg_epu16(mi a,mi b) element to 0 if b is 0. 128 128 256 one input vector of 16bit uints. boundary. Should be preferred from the input register. Memory loaddup in SSE3 which may need to be aligned. from addresses starting at ptr gathered if the highest SSE4.1: SSE4.1: mi sign_epi16(mi a,mi b) SSE AVX The min is stored in the lower 16 over loadu. does not need to be aligned. perform faster than load1. md loaddup_pd(d* ptr) and offset by each 32-bit/64-bit corresponding bit in the mask is SSE movelh/hl insertf128 SSSE3 SSE2 SSE2 shufflehi/lo SSSE3 epi8-32,epu8-32 epi8-32,epu8-32 SSE2 move AVX2 alignr shuffle shuffle bits and its index is stored in the md loadl_pd(md a,d* ptr) Memory does not need to be element in a (each index is scaled set. Otherwise it is copied from mi lddqu_si128(mi* ptr) ss[SSE],sd[SSE2] ps si256,ps/d mi alignr_epi8(mi a,mi b,i c) epi32 epi16 epi8 following 3 bits mi min_epu16(mi a,mi b) mi max_epu16(mi a,mi b) dst[63:0] :=*(ptr)[63:0] aligned. by the factor in s). Gathered src. Memory does not need to elements are merged into dst. s be aligned. NOTE: The lh version moves NOTE: Inserts 128bit at a position ((a << 128) | b) >> c*8 NOTE: Shuffles the high/low half mi minpos_epu16(mi a) dst[127:64] := a[127:64] md load1_pd(d* ptr) NOTE: Moves the lowest element mi shuffle_epi32(mi a,ii imm) NOTE: Shuffle packed 8-bit lower half of b into upper half of specified by imm. For AVX2, there should be 1, 2, 4 or 8. The 128bit mi mask_i64gather_epi32 (mi src, from first input and fills remaining S(s, mask){ of the register using an immediate integers in a according to shuffle result. Rest is filled from a. The hl is only si256 and the operation is control mask. Rest of the register versions zero the upper 128bit of i* ptr,mi a,mi mask,i32 s) elements from second input CASE(mask[1:0]) control mask in the corresponding 256bit registers. The number of version moves upper half to lower named inserti128. 0: r := s[31:0] is copied from input. 8-bit element of b, and store the FOR j := 0 to 1; i:=j*32;m:=j*64 m move_ss(m a,m b) 128-bit Dual gathered elements is limited by half. m insertf128_ps(m a,m b,ii i) 1: r := s[63:32] mi shufflehi_epi16(mi a,ii imm) results in dst. 128bit Pseudo IF mask[i+31] a[31:0] | b[127:32] the minimum of the type to load m movehl_ps(m a,m b) dst[255:0] := a[255:0] 2: r := s[95:64] dst[63:0] := a[63:0] Load Unaligned Load Single Broadcast Load dst[i+31:i]:=*(ptr+a[i+63:i]*s) Register Shuffle mi shuffle_epi8(mi a,mi b) Perform more than one operation at once and the type used for the offset. sel := i*128 256 3: r := s[127:96] dst[79:64] := Composite Arithmetics Gather mask[i+31] := 0 FOR j := 0 to 15 128 SEQ! 256 Memory does not need to be dst[sel+15:sel]:=b[127:0] AVX RETURN r[31:0] (a >> (imm[1:0]*16))[79:64] ELSE permute2f128 i := j*8 aligned. AVX2 } dst[95:80] := SSE2 loadu SSE2 load AVX broadcast AVX loadu2 dst[i+31:i] := src[i+31:i] IF b[i+7] == 1 ps/d,si256 dst[31:0]:=S(a,imm[1:0]) (a >> (imm[3:2]*16))[79:64] mi i64gather_epi32 mask[MAX:64] := 0 dst[i+7:i] := 0 pd,ps,si16-si128 ss,sd,epi64 ss/d,ps/d m128,m128d,m128i dst[63:32]:=S(a,imm[3:2]) dst[111:96] := Dot Product Composite Int Arithmetics (i* ptr,mi a,i s) dst[MAX:64] := 0 NOTE: Takes 2 registers and ELSE NOTE: Loads 128 bit or less (for NOTE: Loads a single element NOTE: Broadcasts one input (ss/ dst[95:64]:=S(a,imm[5:4]) (a >> (imm[5:4]*16))[79:64] NOTE: Loads two 128bit elements FOR j := 0 to 1; shuffles 128 bit chunks to a target k := b[i+3:i]*8 si16-64 versions) from memory from memory and zeros remaining d) element or 128bits (ps/d) from two memory locations which dst[127:96]:=S(a,imm[7:6]) dst[127:112] := Byte Multiply and Multiply and i := j*32 register. Each chunk can also be dst[i+7:i]:= a[k+7:k] into the low-bits of a register. bytes. For epi64, the command is from memory into all slots of an (a >> (imm[7:6]*16))[79:64] Conditional Float Sum of Absolute Sum of Absolute do not need to be aligned. m := j*64 cleared by a bit in the mask. For Horizontal Saturated Horizontal Saturated Memory does not have to be loadl_epi64! Memory does SSE register. All suffixes but ss Dual Register m loadu2_m128(f* p1,f* p2) dst[i+31:i] := AVX2, there is only the si256 aligned. not need to be aligned. are only available in 256 bit mode! Dot Product Add Add Differences Differences 2 dst[127:0] = *p2; *(ptr + a[m+63:m]*s]) Interleave Float Shuffle version which is renamed to md loadu_pd(d* ptr) md load_sd(d* ptr) m broadcast_ss(f* a) dst[255:128] = *p1; dst[MAX:64] := 0 permute2x128. Float Shuffle SSE4.1 SSSE3 SSE3 SSE2 SSE4.1 (Unpack) Blend SSE dp maddubs madd sad sadbw SSE2 shuffle md permute2f128_pd ps/d epi16 epi16 epu8 epu8 ps[SSE],pd[SSE2] (md a,md b,ii imm) SSE4.1 AVX permute[var] NOTE: Computes the dot product NOTE: Multiply vertical 16 bit ints NOTE: Compute the absolute NOTE: Compute the sum of SSE2 unpackhi/lo blend[v] S(s1, s2, control){ NOTE: Adds vertical 8 bit ints AVX2 NOTE: Shuffles floats from two of two vectors. A mask is used to producing a 32 bit int and adds differences of packed unsigned 8- absolute differences (SADs) of CASE(control[1:0]) ps/d 4x64bit Shuffle producing a 16 bit int and adds epi8-64,ps[SSE],pd epi8-16, registers. The lower part of the 8x32bit Shuffle state which components are to be horizontal pairs with saturation bit integers in a and b, then quadruplets of unsigned 8-bit Storing data from an SSE register into memory or registers 0: tmp[127:0] := s1[127:0] 256 horizontal pairs with saturation Store epi32[AVX2],ps/d result receives values from the NOTE: 128-bit version is same as multiplied and stored. producing 4x32 bit ints horizontally sum each consecutive integers in a compared to those in NOTE: Interleaves elements from 1: tmp[127:0] := s1[255:128] 256 producing 8x16 bit ints. The first blend first register. The upper part int shuffle for floats. 256-bit AVX2 8 differences to produce two b, and store the 16-bit results in the high/low half of two input NOTE: v uses 128bit mask, 2: tmp[127:0] := s2[127:0] permute4x64 m dp_ps(m a,m b,ii imm) input is treated as unsigned and receives values from the second version performs the same AVX2 permutevar8x32 unsigned 16-bit integers, and pack dst. Eight SADs are performed Storing data to a memory address which must be 16- registers into a target register. blend uses immediate 3: tmp[127:0] := s2[255:128] FOR j := 0 to 3 the second as signed. Results and +S register. Shuffle mask is an operation on two 128-bit lanes. epi64,pd these unsigned 16-bit integers in using one quadruplet from b and Aligned Store byte aligned (or 32-byte for 256bit instructions) Extraction mi blend_epi16(mi a,mi b,ii imm) IF control[3] epi32,ps IF imm[4+j] intermediaries are signed. unpackhi_epi32(a,b) immediate! The normal version uses an NOTE: Same as float shuffle but the low 16 bits of 64-bit elements eight quadruplets from a. One FOR j := 0 to 7 tmp[127:0] := 0 tmp[i+31:i] := * +S dst[31:0] := a[95:64] immediate and the var version a no [var] version is available. In NOTE: Same as 4x64 bit shuffle in dst. quadruplet is selected from b i := j*16 md shuffle_pd(md a,md b,ii imm) RETURN tmp[127:0] a[i+31:i] * b[i+31:i] dst[63:32] := b[95:64] register b (the lowest bits in each addition, the shuffle is performed but with 8x32bit and only a [var] * starting at on the offset specified Aligned Stream Aligned IF imm[j] dst[63:0] := (imm[0] == 0) ? } ELSE +S dst[95:64] := a[127:96] element in b are used for the mask over the full 256 bit instead of two version taking a register instead of +S in imm. Eight quadruplets are Broadcast Store dst[i+15:i] := b[i+15:i] a[63:0] : a[127:64] dst[127:0] := S(a, b, imm[3:0]) tmp[i+31:i] := 0 Aligned Store Masked Store Extract of that element in a). Reverse Store dst[127:96]:= b[127:96] lanes of 128 bit. an immediate is available. ABS(a-b) 0 formed from sequential 8-bit Store ELSE dst[127:64] := (imm[1] == 0) ? dst[255:128]:= S(a, b, imm[7:4]) sum[31:0] := +S +S SEQ! 128 SEQ! 128 128 * integers selected from a starting at

dst[i+15:i] := a[i+15:i] b[63:0] : b[127:64] dst[MAX:256]:= 0 . * mi unpackhi_epi16(mi a, mi b) m permutevar_ps(m a,mi b) md permute4x64_pd(md a,ii imm) m permutevar8x32_ps(m a,mi b) tmp[127:96] + tmp[95:64] .

SSE SSE SSE SSE AVX . + . the offset specified in imm. . SSE4.1 . stream storer store store maskstore extract + tmp[63:32]+ tmp[31:0] . . SSE2 SSE2 SSE2 SSE2 1 AVX2 . . si32-128,pd,ps[SSE], si128,pd,ps[SSE], epi16[SSE2], . . mi sadbw_epu8(mi a,mi b,ii imm) pd,ps[SSE] pd[SSE2],ps ps/d,epi32-64[AVX2] FOR j := 0 to 3 0 { si256[AVX] si256[AVX] epi8-64,ps IF imm[j] mi maddubs_epi16(mi a,mi b) NOTE: Stores the elements from NOTE: Stores a float to 128-bits of NOTE: Stores bytes from a into Replicating one element in a register mi madd_epi16(mi a,mi b) + a_offset := imm[2]*32 memory replicating it two (pd) or memory at p if the corresponding NOTE: Extracts one element and dst[i+31:i] :=sum[31:0] NOTE: Stores 128-bits (or 32-64) the register to memory in reverse NOTE: Stores 128-bits into Byte Zeroing Broadcast to fill the entire register Byte Movement b_offset := imm[1:0]*32 four (ps) times. Memory location byte in the mask m has its highest returns it as a normal int (no SSE ELSE for si32-64 integer a into order. Memory must be 16-byte memory. Memory location must FOR j := 0 to 7 must be 16-byte aligned! bit set. Memory must be aligned register!) dst[i+31:i] := 0 memory using a non-temporal hint aligned! be 16-byte aligned! i := j*8 to minimize cache pollution. v store1_pd(d* ptr,md a) i extract_epi16(mi a,ii imm) k := a_offset+i v storer_pd(d* ptr,md a) v maskstore_ps(f* p,mi m,m a) Zero All 64-bit Byteshift mi sad_epu8(mi a,mi b) Memory location must be 16-byte v store_pd(d* ptr,md a) (a>>(imm[2:0]*16))[15:0] Zero Register Broadcast l := b_offset aligned! dst[i+15:i] := Registers 256 Broadcast left/right ABS(a[k+7:k]-b[l+7:l])+ v stream_si32(i* ptr,i a) ABS(a[k+15:k+8]-[l+15:l+8])+ v stream_si128(mi* ptr,mi a) SSE2 setzero AVX zeroall SSE3 movedup AVX2 broadcastX SSE2 [b]sl/rli 256bit Extract ABS(a[k+23:k+16]-b[l+23:l+16])+ ps[SSE1]/d,si128 - pd epi8-64,ps/d,si256 si128,epi128[256] ABS(a[k+31:k+24]-[l+31:l+24]) 256 NOTE: Duplicates lower half of a. NOTE: Broadcasts the lowest NOTE: Shift input register left/ AVX NOTE: Returns a register with all Storing data to a memory address which does extractf128 NOTE: Zeros all SSE/AVX registers. md movedup_pd(md a) element into all slots of the result right by imm bytes while shifting AVX2 bits zeroed. Unaligned Store not have to be aligned to a specific boundary a[0:63] | (a[63:0]<<64) register. The si256 version in zeros. There is no difference si256,ps/d mi setzero_si128() v zeroall() broadcasts one 128bit element! between the b and the non-b NOTE: Extracts 128bit and returns The letter X must be the following: version. Fused Multiply and Add it as a 128bit SSE register. For epi8: b, epi16: w, epi32: d, Unaligned Single Element 128bit Pseudo mi bsrli_si128(mi a,ii imm) Store High Masked Store AVX2, there is only si256 and the epi64:q, ps: s, pd: d, Store Store Scatter operation is named si256: si128 128 128 SEQ! 256 extracti128. 32-bit Broadcast FM-Add FM-Sub FM-AddSub FM-SubAdd SSE SSE Zero High mi broadcastb_epi8(mi a) storeu SSE2 storeh store[l] SSE2 maskmoveu AVX storeu2 m128 extractf128_ps(m256 a,ii i) Zero High All SSE2 SSE2 128 High/Low Byte Swap si16-128,pd,ps[SSE], pd ss[SSE],sd,epi64 si128 m128,m128d,m128i (a >> (i * 128))[128:0] Registers FMA FMA FMA FMA SSE2 256 ≤64 f[n]madd f[n]madd fmaddsub fmsubadd si256[AVX] NOTE: Stores the high 64bits into NOTE: Stores two 128bit elements move SSE3 NOTE: Stores the low element NOTE: Stores bytes into memory movel/hdup ps/d,ss/d ps/d,ss/d ps/d,ss/d ps/d,ss/d NOTE: Stores 128 bits (or less for memory. Memory does not need into two memory locations which epi64 AVX zeroupper x86 _bswap[64] into memory. Memory does not if the corresponding byte in mask ps (a*b)+c (a*b)-c the si16-64 versions) into memory. to be aligned. do not need to be aligned. NOTE: Computes (a*b)+c for NOTE: Computes (a*b)-c for NOTE: Computes for NOTE: Computes for need to be aligned. The l version has its highest bit set. Memory - (i32,i64) Memory does not need to be NOTE: Moves lower half of input NOTE: Duplicates 32bits into the each element. The n version each element. The n version elements with even index and elements with even index and v storeh_pd(v* ptr,m a) must be used for epi64 and can does not need to be aligned. v storeu2_m128(f* p1,f* p2,m a) aligned. into result and zero remaining lower 64bits of the result. l computes -(a*b)+c. computes -(a*b)-c. (a*b)-c for elements with odd (a*b)+c for elements with odd pd *p2 = dst[127:0]; NOTE: Zeros all bits in [MAX:256] NOTE: Swaps the bytes in the 32- be used for . bytes. version duplicates bits [31:0], h index. index. v storeu_pd(v* ptr,m a) v maskmoveu_si128 *p1 = dst[255:128]; of all SSE/AVX registers. bit int or 64-bit int. m fmadd_ps(m a,m b,m c) m fmsub_ps(m a,m b,m c) v store_sd(d* ptr,md a) (mi a,mi mask,c* ptr) mi move_epi64(mi a) version duplicates bits [63,32]. m fmsubadd_ps(m a,m b,m c) a[63:0] v zeroupper() m moveldup_ps(m a) i64 _bswap64(i64 a) Version 1.0