<<

Intrinsics Cheat Sheet Jan Finis [email protected]

Bit Operations Conversions Boolean Logic Shifting & Rotation Packed Conversions Convert all elements in a packed SSE register Reinterpet Casts Arithmetic Logic Shift Convert Float See also: Conversion to Rotate Left/ Pack With S//I32 performs rounding implicitly Bool XOR Bool AND Bool NOT AND Bool OR Right Sign Extend Zero Extend 128bit Cast Shift Right Left/Right ≤64 16bit ↔ 32bit Saturation Conversion 128 SSE SSE SSE SSE Round up SSE2 xor SSE2 and SSE2 andnot SSE2 or SSE2 sra[i] SSE2 sl/rl[i] x86 _[l]rot[w]l/r CVT16 cvtX_Y SSE4.1 cvtX_Y SSE4.1 cvtX_Y SSE2 castX_Y si128,ps[SSE],pd si128,ps[SSE],pd si128,ps[SSE],pd si128,ps[SSE],pd epi16-64 epi16-64 (u16-64) ph ↔ ps SSE2 pack[u]s epi8-32 epu8-32 → epi8-32 SSE2 cvt[t]X_Y si128,ps/d (ceiling) mi xor_si128(mi a,mi b) mi and_si128(mi a,mi b) mi andnot_si128(mi a,mi b) mi or_si128(mi a,mi b) NOTE: Shifts elements right NOTE: Shifts elements left/ NOTE: Rotates in a left/ NOTE: Converts between 4x epi16,epi32 NOTE: Sign extends each NOTE: Zero extends each epi32,ps/d NOTE: Reinterpret casts !a & b while shifting in sign bits. right while shifting in zeros. right by a number of bits 16 bit floats and 4x 32 bit element from X to Y. Y must element from X to Y. Y must from X to Y. No operation is SSE4.1 ceil NOTE: Packs ints from two NOTE: Converts packed generated. The i version takes an The i version takes an specified in i. The l version floats (or 8 for 256 bit input registers into one be longer than X. be longer than X. elements from X to Y. If pd is ps/d,ss/d immediate as count. The immediate as count. The is for 64-bit ints and the w mode). For the 32 to 16-bit register halving the bitwidth. mi cvtepi8_epi32(mi a) mi cvtepu8_epi32(mi a) used as source or target md castsi128_pd(mi a) md ceil_pd(md a) version without i takes the version without i takes the version for 16-bit ints. conversion, a rounding mode Overflows are handled using type, then 2 elements are Change the position of selected bits, zeroing out lower 64bit of an SSE lower 64bit of an SSE r must be specified. u64 _lrotl(u64 a) saturation. The u version converted, otherwise 4. Selective Bit Moving remaining ones register. register. m cvtph_ps(mi a) saturates to the unsigned The t version is only 128/256bit Round down mi srai_epi32(mi a,ii32 i) mi slli_epi32(mi a,ii i) mi cvtps_ph(m a, i r) type. available when casting to int and performs a truncation Bit Scatter Bit Gather mi sra_epi32(mi a,mi b) mi sll_epi32(mi a,mi b) mi pack_epi32(mi a,mi b) Cast 256 Movemask Extract Bits instead of rounding. (floor) (Deposit) (Extract) md cvtepi32_pd(mi a) AVX castX_Y SSE4.1 ≤64 ≤64 ≤64 floor Variable Variable Logic pd128↔pd256, ps/d,ss/d BMI2 _pdep BMI2 _pext SSE2 movemask BMI1 _bextr ps128↔ps256, u32-64 u32-64 epi8,pd,ps[SSE] u32-64 si128↔si256 md floor_pd(md a) Shift Convert a single element in the lower of an SSE register NOTE: Deposit contiguous NOTE: Extract bits from a at NOTE: Creates a bitmask NOTE: Extracts l bits from a Single Element Conversion NOTE: Reinterpret casts low bits from a to dst at the corresponding bit from the most significant bit starting at bit s. AVX2 sl/rav AVX2 sl/rlv from X to Y. If cast is from 128 to 256 bit, then the the corresponding bit locations specified by of each element. u64 _bextr_u64(u64 a,u32 s,u32 l) epi32-64 epi32-64 Single SSE Float Old Float/Int upper bits are undefined! No Round locations specified by mask m to contiguous low bits in i movemask_epi8(mi a) (a >> s) & ((1 << l)-1) Bit Scan m; all other bits in dst are dst; the remaining upper NOTE: Shifts elements left/ NOTE: Shifts elements left/ Single Conversion Single Float to Single 128-bit operation is generated. right while shifting in sign right while shifting in zeros. Forward/Reverse to Normal Float set to zero. bits in dst are set to zero. to Float with Fill Conversion 128 m128d castpd256_pd128(m256d a) SSE4.1 round bits. Each element in a is Each element in a is shifted ≤64 Int Conversion Int Conversion Conversion u64 _pdep_u64(u64 a,u64 m) u64 _pext_u64(u64 a, u64 m) shifted by an amount by an amount specified by _bit_scan_forward 128 128 128 128 SSE cvt[t]_X2Y ps/d,ss/d x86 /reverse specified by the the corresponding element SSE2 SSE SSE2 SSE2 corresponding element in b. in b. (i32) cvtX_Y cvt[t]X_Y cvtX_Y cvtX_Y pi↔ps,si↔ss NOTE: Rounds according to a si32-64,ss/d → ss/d ss/d → si32-64 si32-128 ss→f32,sd→f64 NOTE: Converts X to Y. 256bit Cast rounding paramater r. mi slav_epi32(mi a,mi b) mi sllv_epi32(mi a,mi b) NOTE: Returns the index of Set or reset a range of bits Converts between int and 256 md round_pd(md a,i r) the lowest/highest bit set in NOTE: Converts a single NOTE: Converts a single NOTE: Converts a single NOTE: Converts a single SSE p_ Bit Masking b float. versions convert 2- the 32bit int a. Undefined if element in from X to Y. element from X (int) to Y integer from X to Y. Either of float from X (SSE Float type) packs from b and fill result AVX castX_Y a is 0. The remaining bits of the (float). Result is normal int, the types must be si128. If to Y (normal float type). with upper bits of operand ps,pd,si256 Find Lowest result are copied from a. not an SSE register! the new type is longer, the a s_ Reset Lowest Mask Up To Count specific ranges of 0 or 1 bits i32 _bit_scan_forward(i32 a) t f cvtss_f32(m a) . versions do the same Zero High Bits Bit Counting md cvtsi64_sd(md a,i64 b) The version performs a integer is zero extended. for one element in b. NOTE: Reinterpret casts 1-Bit Lowest 1-Bit double(b[63:0]) | truncation instead of t from X to Y. No operation is ≤64 ≤64 ≤64 1-Bit ≤64 rounding. mi cvtsi32_si128(i a) The version truncates (a[127:64] << 64) instead of rounding and is generated. BMI2 _bzhi BMI1 _blsr BMI1 _blsmsk BMI1 _blsi Count 1-Bits Count Leading Count Trailing i cvtss_si32(m a) available only when m256 castpd_ps(m256d a) u32-64 u32-64 u32-64 u32-64 (Popcount) converting from float to int. ≤64 Zeros ≤64 Zeros ≤64 m cvt_si2ss(m a,i b) NOTE: Zeros all bits in a NOTE: Returns a but with NOTE: Returns an int that NOTE: Returns an int that higher than and including the lowest 1-bit set to 0. has all bits set up to and has only the lowest 1-bit in a POPCNT popcnt LZCNT _lzcnt BMI1 _tzcnt the bit at index i. u64 _blsr_u64(u64 a) including the lowest 1-bit in or no bit set if a is 0. u32-64 u32-64 u32-64 u64 _bzhi_u64(u64 a, u32 i) a or no bit set if a is 0. u64 _blsi_u64(u64 a) (a-1) & a NOTE: Counts the number of NOTE: Counts the number of NOTE: Counts the number of dst := a u64 _blsmsk_u64(u64 a) (-a) & a a a a IF (i[7:0] < 64) 1-bits in . leading zeros in . trailing zeros in . dst[63:n] := 0 (a-1) ^ a u32 popcnt_u64(u64 a) u64 _lzcnt_u64(u64 a) u64 _tzcnt_u64(u64 a)

Arithmetics Basic Arithmetics Loading into an SSE register or storing data from an SSE register Register I/O Overview Version 2.1f Multiplication Load Loading data into an SSE register Fences Misc I/O Introduction Mul High with This cheat sheet displays most x86 intrinsics supported by processors. The following intrinsics were omitted: Carryless Mul Mul Mul Low Mul High · obsolete or discontinued instruction sets like MMX and 3DNow! 128 Round & Scale Inserting data into an register without loading from memory · AVX-512, as it will not be available for some time and would blow up the space needed due to the vast amount of new instructions SSE SSE2 Set Register Store Fence Prefetch · Intrinsics not supported by Intel but only AMD like parts of the XOP instruction set (maybe they will be added in the future). CLMUL clmulepi64 SSE2 mul SSE4.1 mullo SSE2 mulhi SSSE3 mulhrs · Intrinsics that are only useful for operating systems like the _xsave intrinsic to save the CPU state si128 epi32[SSE4.1],epu32, epi16,epi32[SSE4.1] epi16,epu16 epi16 SSE sfence SSE prefetch · The RdRand intrinsics, as it is unclear whether they provide real random numbers without enabling kleptographic loopholes NOTE: Perform a carry-less ps/d,ss/d NOTE: Multiplies vertically NOTE: Multiplies vertically NOTE: Treat the 16-bit Set Reversed Set Insert Replicate Each family of intrinsics is depicted by a box as described below. It was tried to group the intrinsics meaningfully. Most information is taken from the Intel Intrinsics Guide (http://software.intel.com/en-us/articles/intel-intrinsics-guide). Let me know multiplication of two 64-bit NOTE: epi32 and epu32 and writes the low 16/32 and writes the high 16bits words in registers A and B as - - version multiplies only 2 ints SEQ! SEQ! 128 SEQ! ([email protected]) if you find any wrong or unclear content. , selected from a bits into the result. into the result. signed 15-bit fixed-point NOTE: Guarantees that every NOTE: Fetch the line of data instead of 4! numbers between −1 and 1 SSE2 SSE2 SSE4.1 SSE2 and b according to imm, and mi mullo_epi16(mi a,mi b) mi mulhi_epi16(mi a,mi b) AVX setr AVX set insert set1 store instruction that from memory that contains When not stated otherwise, it can be assumed that each vector intrinsic performs its operation vertically on all elements which are packed into the input SSE registers. E.g., the add instruction has no description. Thus, it can be assumed that it performs store the results in dst (e.g. 0x4000 is treated as 0.5 precedes, in program order, is mi mul_epi32(mi a,mi b) epi8-64x,ps[SSE]/d, epi8-64x, epi16[SSE2], epi8-64x,ps[SSE]/d address ptr to a location in a vertical add of the elements of the two input registers, i.e., the first element from register a is added to the first element in register b, the second is added to the second, and so on. In contrast, a horizontal add would add the first element of a to the (result is a 127 bit int). and 0xa000 as −0.75), and globally visible before any the heirarchy specified second element of a, the third to the fourth, and so on. multiply them together. m128/m128d/i[AVX] ps[SSE]/d,ss/d, epi8-64,ps NOTE: Broadcasts one input store instruction which follows mi clmulepi64_si128 by the locality hint i. To use the intrinsics, included the header and make sure that you set the target architecture to one that supports the intrinsics you want to use (using the -march=X flag). NOTE: Sets and returns an SSE m128/m128d/i[AVX] NOTE: Inserts an element i element into all slots of an the fence in program order. (mi a,mi b,ii imm) mi mulhrs_epi16(mi a,mi b) p a v prefetch(* ptr,i i) register with input values. The NOTE: Sets and returns an at a position into . SSE register. v sfence() order in the register is SSE register with input mi insert_epi16 mi set1_epi32(i a) Name: Human readable reversed, i.e. the first input Available Bitwidth: If no bitwidth is specified, the operation is available for 128bit and 256bit SSE registers. Use the _mm_ prefix for the 128bit and values. E.g., epi8 version (mi a,i i,ii p) Cache Line name of the operation _mm256_ gets stored in the lowest bits. takes 16 input parameters. dst[127:0] := a[127:0] the prefix for the 256bit flavors. Otherwise, the following restrictions apply: For epi64, use the epi64x The first input gets stored in sel := p[2:0]*16 Sequence Indicator: If this 2 5 6 : Only available for 256bit SSE registers (always use _mm256_ prefix) / Subtraction suffix. the highest bits. For epi64, dst[sel+15:sel]:=i[15:0] Load Fence Flush indicator is present, it means 1 2 8 : Only available for 128bit SSE registers (always use _mm_ prefix) mi setr_epi32(i a,i b,i c,i d) use the epi64x suffix. SSE2 clflush that the intrinsic will ≤ 6 4 : Operation does not operate on SSE registers but usual 64bit registers (always use _mm_ prefix) Horizontal Add Horizontal Add with Alternating Add mi set_epi32(i a,i b,i c,i d) SSE2 lfence - generate a sequence of Add - with Saturation NOTE: Flushes cache line assembly instructions instead Name(s) of the intrinsic: The names of the various flavors of the intrinsic. To assemble the final name to be used for an intrinsic, one must add a Add Saturation and Subtract SSE SSE NOTE: Guarantees that every that contains p from all of only one. Thus, it may not Float Compare prefix and a suffix. The suffix determines the (see next ). Concerning the prefix, all intrinsics, except the ones which are prefixed with SSSE3 hadds SSSE3 hadd add adds SSE3 addsub Loading data from a memory address which must be load instruction that cache levels a green underscore _, need to be prefixed with _mm_ for 128bit versions or _mm256_ for 256bit versions. Blue letters in brackets like [n] SSE2 SSE2 be as efficient as anticipated. epi16 epi16-32,ps/d epi8-64,ps[SSE]/d, epi8-16,epu8-16 ps/d Aligned Load 16- aligned (or 32-byte for 256bit instructions) precedes, in program order, v clflush(v* ptr) SEQ! 128 indicate that adding this letter leads to another flavor of the function. Blue letters separated with a slash like l/r indicate that either letter can be is globally visible before any NOTE: Adds adjacent pairs of NOTE: Adds adjacent pairs of ss[SSE]/d SSE2 cmp[n]Z used and leads to a different flavor. The different flavors are explained in the notes section. A red letter like Z indicates that various different mi adds_epi16(mi a,mi b) m addsub_ps(m a,m b) load instruction which follows Instruction Set: Specifies the elements with saturation elements mi add_epi16(mi a,mi b) FOR j := 0 to 3 the fence in program order. ps/d,ss/d, strings can be inserted here which are stated in the notes section. Get Undefined instruction set which is mi hadds_epi16(mi a,mi b) mi hadd_epi16(mi a,mi b) i := j*32 Stream Load Load Aligned Load Reversed Mask Load v lfence() necessary for this operation. si128[SSE4.1] IF (j is even) List of available data type suffixes: Consult the suffix table for further information about the various possible suffixes. The suffix chosen SEQ! 128 If more than one instruction NOTE: Z can be one of: dst[i+31:i] := Register ge/le/lt/gt/eq determines the data type on which the intrinsic operates. It must be added as a suffix to the intrinsic name separated by an underscore, so a a[i+31:i] - b[i+31:i] SSE2 stream_load SSE2 SSE2 AVX set is given here, then Horizontal Subtract Horizontal load loadr AVX2 maskload Memory Fence AVX undefined Returns a single int that is possible name for the data type pd in this example would be _mm_cmpeq_pd. A suffix is followed by an instruction set in brackets, then this ELSE si128,si256[AVX] pd,ps,si128 pd,ps ps/d,epi32-64[AVX2] different flavors of this either 1 or 0. The n version is dst[i+31:i] := ps/d,si128-256 a not version, e.g., neq instruction set is required for the suffix. All suffixes without an explicit instruction set are available in the instruction set specified at the left. with Saturation Subtract Subtract with a[i+31:i] + b[i+31:i] NOTE: Loads 128 bit from operation require different Subtract NOTE: Loads 128 bit from NOTE: Loads 128 bit from NOTE: Loads from memory (Load & Store) NOTE: Returns an SSE computes not equal. If no type suffixes are shown, then the method is type independent and must be used without a suffix. If the suffixes are in parenthesis, the suffixes memory into register. instruction sets. SSSE3 SSSE3 memory into register. memory into register. The if the highest bit for each register with undefined must not be appended and are encoded into the name of the intrinsic in another way (see notes for further information). hsubs hsub Saturation Memory address must be Memory address must be SSE2 mfence md cmpeq_pd(md a,md b) SSE elements are loaded in element is set in the mask. contents. epi16 epi16-32,ps/d SSE2 sub aligned! Memory is fetched aligned! reversed order. Memory Otherwise 0 is used. ptr Notes: Explains the semantics SSE2 - Signature and Pseudocode: Shows one possible signature of the intrinsic in order to depict the parameter types and the return type. Note that only NOTE: Subtracts adjacent pairs NOTE: Subtracts adjacent epi8-64,ps[SSE]/d, subs at once from an USWC md load_pd(d* ptr) must be aligned! must be aligned! mi undefined_si128() of the intrinsic, the various device without going NOTE: Guarantees that one flavor and data type suffix is shown; the signature has to be adjusted for other suffixes and falvors. The data types are displayed in shortened of elements with saturation pairs of elements ss[SSE]/d epi8-16,epu8-16 mi maskload_epi64 every memory access that through the cache hierachy. m loadr_ps(f* ptr) flavors, and other important form. Consult the data types table for more information. mi hsubs_epi16(mi a,mi b) mi hsub_epi16(mi a,mi b) mi sub_epi16(mi a,mi b) mi subs_epi16(mi a,mi b) dst[31:0]:=*(ptr)[127:96] (i64* ptr,mi mask) precedes, in program order, information. mi stream_load_si128 dst[63:32]:=*(ptr)[95:64] dst[MAX:128] := 0 the memory fence In addition to the signature, the pseudocode for some intrinsics is shown here. Note that the special variable dst depicts the destination register of (mi* ptr) dst[95:64]:=*(ptr)[63:32] FOR j := 0 to 1 instruction is globally visible the method which will be returned by the intrinsic. dst[127:96]:=*(ptr)[31:0] i := j*64 before any memory IF mask[i+63] instruction which follows the dst[i+63:i]:= *(ptr+i) fence in program order. Div/Sqrt/Reciprocal Sign ELSE v mfence() Instruction Sets Data Type Suffixes Data Types dst[i+63:i]:= 0 Each intrinsic is only available on machines which support the Most intrinsics are available for various suffixes which depict different The following data types are used in the signatures of the intrinsics. Approx. Approx. Modification corresponding instruction set. This list depicts the instruction sets data types. The table depicts the suffixes used and the corresponding Note that most types depend on the used type suffix and only one Div and the first Intel and AMD CPUs that supported them. SSE register types (see Data Types for descriptions). example suffix is shown in the signature. Reciprocal Reciprocal Sqrt Loading data from a memory address which does SSE2 SSE SSE SSE Absolute Unaligned Load not have to be aligned to a specific boundary I.Set Intel (Year) AMD (Year) Description Suffix Type Description Suffix Description div rcp rsqrt SSE2 sqrt x86 all all x86 Base Instructions ph mi Packed half float (16 bit float) i Signed 32 bit integer (int) ps/d,ss/d ps,ss ps,ss ps/d,ss/d Fast Load Load Broadcast Broadcast SSE III 99 K7 Palomino 01 ps/d m/md Packed single float / packed double float iX Signed X bit integer (intX_t) m div_ps(m a,m b) m rcp_ps(m a) NOTE: Approximates 1.0/sqrt(x) m sqrt_ps(m a) SSSE3 abs Gather Mask Gather SSE2 01 K8 03 epiX mi Packed X bit signed integer uX Unsigned X bit integer (uintX_t) m rsqrt_ps(m a) epi8-32 Unaligned High/Low Load Load SSE3 Prescott 04 K9 05 Streaming SIMD epuX mi Packed X bit unsigned integer Immediate signed 32 bit integer: The value used for mi abs_epi16(mi a) 128 SEQ! 128 ii SSE3 lddqu SSE2 loadh/l SSE2 load1 AVX broadcast AVX2 i32/i64gather AVX2 mask_i32/i64gather SSSE3 Woodcrest 06 Bulldozer 11 Extensions Single X bit signed integer. If there is a si128 parameters of this type must be a compile time constant siX mi version, the same function for 256 bits is usually si128 pd,pi pd,ps ss/d,ps/d epi32-64,ps/d epi32-64,ps/d SSE4.1 07 Bulldozer 11 f 32-bit float (float) Conditional SSE4.2 Nehalem 08 Bulldozer 11 suffixed with si256. d 64-bit float (double) Min/Max/Avg NOTE: Loads 128bit integer NOTE: Loads a value from NOTE: Loads a float from NOTE: Broadcasts one input NOTE: Gather elements from NOTE: Same as gather but 256 suX mi Single X bit unsigned integer in SSE register data into SSE register. Is memory into the high/low memory into all slots of the (ss/d) element or 128bits memory using 32-bit/64-bit takes an additional mask and AVX 11 Bulldozer 11 Advanced Vector Integer SSE register, i.e., __m128i or __m256i, Negate or Zero mi faster than loadu if value half of a register and fills the 128-bit register. (ps/d) from memory into all indices. The elements are src register. Each element is AVX2 Haswell 13 - Extensions uX - Single X bit unsigned integer (not in SSE register) depending on the bitwidth used. Horizontal crosses cache line boundary. other half from a. For pd, there is also the slots of an SSE register. All loaded from addresses only gathered if the highest Min Max Average SSSE3 sign FMA Haswell 13 Bulldozer 11 Fused Multiply and Add Single single float/double float, remaining bits 32-bit Float SSE register, i.e., __m128 or __m256, Should be preferred over md loadl_pd(md a,d* ptr) operation loaddup in SSE3 suffixes but ss are only starting at ptr and offset by corresponding bit in the mask copied. Operations involving these types often m Min 128 epi8-32 loadu. which may perform faster available in 256 bit mode! each 32-bit/64-bit element in is set. Otherwise it is copied AES depending on the bitwidth used. dst[63:0] :=*(ptr)[63:0] Westmere 10 Bulldozer 11 Advanced Encryption Standard have different signatures: An extra input register SSE2 SSE2 NOTE: For each element in a dst[127:64] := a[127:64] than load1. a (each index is scaled by the from src. SSE4.1 minpos SSE4.1 min SSE4.1 max SSE2 avg mi lddqu_si128(mi* ptr) m broadcast_ss(f* a) s CLMUL Westmere 10 Bulldozer 11 Carryless Multiplication ss/d m/md is used and all bits which are not used by the 64-bit Float (double) SSE register, i.e., __m128d or and b, set result element to md load1_pd(d* ptr) factor in ). Gathered mi mask_i64gather_epi32 md epu16 epu8-16 elements are merged into dst. CVT16 Ivy Bridge 12 Bulldozer 11 16-bit Floats single element are copied from this register. The __m256d, depending on the bitwidth used. SSE:ps SSE2:epu8, SSE:ps SSE2:epu8, a if b is positive, set result (mi src,i* ptr,mi a, NOTE: Computes horizontal epi16,pd epi16,pd mi avg_epu16(mi a,mi b) element to -a if b is s should be 1, 2, 4 or 8. The mask, s) TSX signatures are not depicted here; see manual for v void mi i32 Haswell 13 - Transactional Sync. Extensions min of one input vector of SSE4.1: SSE4.1: negative or set result 128bit Pseudo number of gathered elements FOR j := 0 to 1; exact signatures. Load Load Single is limited by the minimum of POPCNT Nehalem 08 K10 07 X* Pointer to X (X*) 16bit uints. The min is stored in epi8-32,epu8-32 epi8-32,epu8-32 element to 0 if b is 0. Broadcast i := j*32 m128, the lower 16 bits and its index 128 the type to load and the type m := j*64 LZCNT Haswell 13 K10 07 m, AVX sometimes uses this suffix for 128bit float/ mi sign_epi16(mi a,mi b) Unaligned SEQ! Gather 256 used for the offset. m128d, is stored in the following 3 bits mi min_epu16(mi a,mi b) mi max_epu16(mi a,mi b) SSE2 IF mask[i+31] BMI1 Sandy Bridge 11 Bulldozer 11 Instructions md,mi double/int operations. load Load 128 AVX mi i64gather_epi32 dst[i+31:i]:= m128i mi minpos_epu16(mi a) SSE2 loadu loadu2 BMI2 Haswell?? 14 - ss,sd,epi64 SSE3 loaddup m128,m128d,m128i (i* ptr,mi a,i s) *(ptr+a[i+63:i]*s) pd,ps,si16-si128 FOR j := 0 to 1; mask[i+31] := 0 NOTE: Loads a single pd NOTE: Loads two 128bit NOTE: Loads 128 bit or less i := j*32 ELSE element from memory and elements from two memory [i+31:i] := (for si16-64 versions) from zeros remaining bytes. For NOTE: Loads a 64-bit float m := j*64 dst locations. src[i+31:i] memory into the low-bits of epi64, the command is from memory into all slots of dst[i+31:i] := Perform more than one operation at once *(ptr + a[m+63:m]*s]) mask[MAX:64] := 0 Composite Arithmetics a register. loadl_epi64! the 128-bit register. m loadu2_m128(f* p1,f* p2) dst[127:0] = *p2; dst[MAX:64] := 0 dst[MAX:64] := 0 md loadu_pd(d* ptr) md load_sd(d* ptr) md loaddup_pd(d* ptr) dst[255:128] = *p1; Dot Product Compwosite Int Arithmetics Conditional Multiply and Multiply and Sum of Absolute Sum of Absolute Storing data from an SSE register into memory or registers Change the order of bytes in a register, duplicating or zeroing Store Byte Manipulation bytes selectively or mixing the bytes of two registers Dot Product Horizontal Add Horizontal Add Differences Differences 2 Storing data to a memory address which must be 16- SSE4.1 dp SSSE3 maddubs SSE3 madd SSE2 sad SSE4.1 sadbw Aligned Store byte aligned (or 32-byte for 256bit instructions) Extraction Mix Registers Mix the contents of two registers Byte Shuffling Change the byte order using a control mask ps/d epi16 epi16 epu8 epu8 NOTE: Computes the dot NOTE: Adds vertical 8 bit ints NOTE: Multiply 16-bit ints NOTE: Compute the absolute NOTE: Compute the sum of product of two vectors. A producing a 16 bit int and producing a 32-bit int and differences of packed absolute differences (SADs) of Aligned Aligned Reverse Broadcast Move Element Move Concatenate and 32-bit Int High / Low unsigned 8-bit integers in a Aligned Store Masked Store Extract mask is used to state which adds horizontal pairs with add horizontal pairs with quadruplets of unsigned 8-bit Store 256bit Insert Byte Shuffle components are to be saturation producing 8x16 bit saturation producing and b, then horizontally sum integers in a compared to those Stream Store Store each consecutive 8 SEQ! 128 SEQ! 128 128 with Fill 128 High↔Low 128 256 Byteshift (Align) Shuffle 16bit Shuffle multiplied and stored. ints. The first input is treated 4x32-bit ints in b, and store the 16-bit results. SSE SSE SSE SSE AVX as unsigned and the second as differences to produce two Eight SADs are performed using SSE2 stream SSE2 storer SSE2 store SSE2 store1 AVX2 maskstore SSE4.1 extract SSE SSE AVX SSSE3 SSE2 SSE2 SSSE3 m dp_ps(m a,m b,ii imm) SSE2 move movelh/hl AVX2 insertf128 alignr shuffle shufflehi/lo shuffle signed. Results and +S unsigned 16-bit integers, and one quadruplet from b and eight si32-128,pd,ps[SSE], pd,ps[SSE] si128,pd,ps[SSE], pd,ps[SSE] ps/d,epi32-64[AVX2] epi16[SSE2], FOR j := 0 to 3 intermediaries are signed. pack these unsigned 16-bit quadruplets from a. One ss[SSE],sd ps si256,ps/d epi8 epi32 epi16 epi8 IF imm[4+j] si256[AVX] si256[AVX] epi8-64,ps * +S integers in the low 16 bits of quadruplet is selected from b NOTE: Stores the elements NOTE: Stores a float to 128- NOTE: Stores bytes from a NOTE: Moves the lowest mi alignr_epi8(mi a,mi b,i c) tmp[i+31:i] := NOTE: Extracts one element NOTE: The lh version NOTE: Inserts 128bit at a mi shuffle_epi32(mi a,ii i) NOTE: Shuffles the high/low NOTE: Shuffle packed 8-bit * 64-bit elements in dst. starting at on the offset specified NOTE: Stores 128-bits (or 32- from the register to memory NOTE: Stores 128-bits into bits of memory replicating it into memory at p if the element from first input and a[i+31:i] * b[i+31:i] +S and returns it as a normal int moves lower half of b into position specified by i. For ((a << 128) | b) >> c*8 S(s, mask){ half of the register using an integers in a according to in imm. Eight quadruplets are 64) for si32-64 integer a in reverse order. Memory memory. Memory location two (pd) or four (ps) times. corresponding byte in the fills remaining elements ELSE +S (no SSE register!) upper half of result. Rest is AVX2, there is only si256 CASE(mask[1:0]) immediate control mask. shuffle control mask in the formed from sequential 8-bit into memory using a non- must be aligned! must be aligned! Memory location must be mask m has its highest bit from second input tmp[i+31:i] := 0 +S filled from a. The hl version and the operation is named 0: r := s[31:0] Rest of the register is copied corresponding 8-bit element * ABS(a-b) 0 integers selected from a starting

temporal hint to minimize aligned! set. Memory must be i extract_epi16(mi a,ii i) . .

. +S ( ptr, a) m move_ss(m a,m b) moves upper half to lower inserti128. 1: r := s[63:32] from input. of b, and store the results in sum[31:0] := . at the offset specified in imm. v storer_pd d* md . 128-bit Dual . cache pollution. Memory v store_pd(d* ptr,md a) aligned! (a>>(i[2:0]*16))[15:0] . . * ( ptr, a) half. 2: r := s[95:64] dst. tmp[127:96] + tmp[95:64] . . + location must be aligned! v store1_pd d* md a[31:0] | b[127:32] m insertf128_ps . . mi sadbw_epu8 v maskstore_ps(f* p,mi m,m a) Register Shuffle 3: r := s[127:96] mi shufflehi_epi16 + tmp[63:32]+ tmp[31:0] (m a,m b,ii i) ( a, b) (mi a,mi b,ii imm) v stream_si32(i* p,i a) m movehl_ps(m a,m b) 256 r[31:0] (mi a,ii i) mi shuffle_epi8 mi mi ( a, b) dst[255:0] := a[255:0] RETURN j := 0 15 mi maddubs_epi16 mi mi 0 a_offset := imm[2]*32 AVX } dst[63:0] := a[63:0] FOR to FOR j := 0 to 3 v stream_si128(mi* p,mi a) sel := i*128 permute2f128 i := j*8 mi madd_epi16(mi a,mi b) b_offset := imm[1:0]*32 AVX2 dst[31:0]:=S(a,i[1:0]) dst[79:64] := IF imm[j] + 256bit Extract dst[sel+15:sel]:=b[127:0] IF b[i+7] == 1 FOR j := 0 to 7 ps/d,si256 dst[63:32]:=S(a,i[3:2]) (a >> (i[1:0]*16))[79:64] dst[i+31:i] :=sum[31:0] dst[95:80] := dst[i+7:i] := 0 ELSE i := j*8 NOTE: Takes 2 registers and dst[95:64]:=S(a,i[5:4]) k := a_offset+i Storing data to a memory address which does AVX (a >> (i[3:2]*16))[79:64] ELSE dst[i+31:i] := 0 AVX2 extractf128 shuffles 128 bit chunks to a dst[127:96]:=S(a,i[7:6]) k := b[i+3:i]*8 l := b_offset not have to be aligned to a specific boundary dst[111:96] := mi sad_epu8(mi a,mi b) Unaligned Store si256,ps/d Dual Register target register. Each chunk (a >> (i[5:4]*16))[79:64] dst[i+7:i]:= a[k+7:k] dst[i+15:i] := can also be cleared by a bit dst[127:112] := ABS(a[k+7:k]-b[l+7:l])+ NOTE: Extracts 128bit and in the mask. For AVX2, there ABS(a[k+15:k+8]-b[l+15:l+8])+ Unaligned Single Element 128bit Pseudo returns it as a 128bit SSE Float Shuffle (a >> i[7:6]*16))[79:64] Blend SSE is only the si256 version ABS(a[k+23:k+16]-b[l+23:l+16])+ Store High Masked Store register. For AVX2, there is shuffle which is renamed to ABS(a[k+31:k+24]-b[l+31:l+24]) si256 Interleave SSE2 Float Shuffle Store 128 Store 128 SEQ! Scatter 256 only and the permute2x128. operation is named SSE4.1 ps[SSE],pd SSE SSE blend[v] md permute2f128_pd SSE2 storeu SSE2 storeh SSE2 store[l] SSE2 maskmoveu AVX storeu2 extracti128. (Unpack) AVX2 NOTE: Shuffles floats from AVX 4x64bit epi8-16, (md a,md b,ii i) permute[var] si16-128,pd,ps[SSE], pd ss[SSE],sd,epi64 si128 m128,m128d,m128i m128 extractf128_ps two registers. The lower part S(s1, s2, control){ 8x32bit SSE2 unpackhi/lo epi32[AVX2],ps/d ps/d si256[AVX] (m256 a,ii i) of the result receives values (control[1:0]) Shuffle NOTE: Stores the high 64bits NOTE: Stores the low element NOTE: Stores bytes into NOTE: Stores two 128bit from the first register. The CASE 256 Fused Multiply and Add (a >> (i * 128))[128:0] epi8-64,ps[SSE],pd NOTE: blendv uses 128bit 0: tmp[127:0]:=s1[127:0] NOTE: 128-bit version is the Shuffle 256 NOTE: Stores 128 bits (or into memory. into memory. The l version memory if the corresponding elements into two memory mask, blend uses upper part receives values same as int shuffle for floats. AVX2 permute4x64 less for the si16-64 versions) must be used for epi64 and byte in mask has its highest locations. NOTE: Interleaves elements 1: tmp[127:0]:=s1[255:128] AVX2 v storeh_pd(v* ptr,m a) immediate from the second register. 2: tmp[127:0]:=s2[127:0] 256-bit version performs the permutevar8x32 into memory. can be used for pd. bit set. v storeu2_m128 from the high/low half of Shuffle mask is an epi64,pd mi blend_epi16 3: tmp[127:0]:=s2[255:128] same operation on two 128- (f* p,f* q,m a) two input registers into a immediate! epi32,ps FM-Add FM-Sub FM-AddSub FM-SubAdd v storeu_pd(v* ptr,m a) v store_sd(d* ptr,md a) v maskmoveu_si128 (mi a,mi b,ii imm) IF control[3] bit lanes. The normal version NOTE: Same as float shuffle *p = dst[127:0]; target register. var (mi a,mi mask,c* ptr) FOR j := 0 to 7 md shuffle_pd tmp[127:0] := 0 uses an immediate and the but no [ ] version is NOTE: Same as 4x64 bit *q = dst[255:128]; var b available. In addition, the shuffle but with 8x32bit and mi unpackhi_epi32(mi a,mi b) i := j*16 (md a,md b,ii i) RETURN tmp[127:0] version a register (the FMA f[n]madd FMA f[n]madd FMA fmaddsub FMA fmsubadd dst[31:0] := a[95:64] dst[63:0]:= (i[0] == 0) ? } lowest bits in each element shuffle is performed over the only a [var] version taking a IF imm[j] ps/d,ss/d ps/d,ss/d ps/d,ss/d ps/d,ss/d dst[63:32] := b[95:64] dst[i+15:i] := b[i+15:i] a[63:0] : a[127:64] dst[127:0] :=S(a,b,i[3:0]) in b are used for the mask of full 256 bit instead of two register instead of an dst[95:64] := a[127:96] ELSE dst[127:64]:= (i[1] == 0) ? dst[255:128]:=S(a,b,i[7:4]) that element in a). lanes of 128 bit. immediate is available. NOTE: Computes (a*b)+c NOTE: Computes (a*b)-c NOTE: Computes (a*b)+c NOTE: Computes (a*b)-c dst[127:96]:= b[127:96] dst[i+15:i] := a[i+15:i] b[63:0] : b[127:64] dst[MAX:256]:=0 m permutevar_ps(m a,mi b) md permute4x64_pd(md a,ii i) m permutevar8x32_ps(m a,mi b) for each element. The n for each element. The n for elements with even index for elements with even index version computes -(a*b)+c. version computes -(a*b)-c. and (a*b)-c for elements and (a*b)+c for elements with odd index. with odd index. m fmadd_ps(m a,m b,m c) m fmsub_ps(m a,m b,m c) m fmsubadd_ps(m a,m b,m c) Replicating one element in a register Byte Zeroing Broadcast to fill the entire register Byte Movement Zero All 64-bit Byteshift Zero Register Broadcast Registers 256 Broadcast left/right Specials SSE SSE2 setzero AVX zeroall SSE3 movedup AVX2 broadcastX SSE2 [b]sl/rli ps[SSE]/d,si128 - pd epi8-64,ps/d,si256 si128,epi128[256] Special Transactional NOTE: Zeros all SSE/AVX NOTE: Duplicates the lower NOTE: Broadcasts the NOTE: Shift input register NOTE: Returns a register registers. half of a. lowest element into all slots left/right by i bytes while AES KeyGen AES Inverse Mix Cyclic Redundancy Memory with all bits zeroed. v zeroall() md movedup_pd(md a) of the result register. The shifting in zeros. There is no AES Encrypt AES Decrypt mi setzero_si128() a[0:63]|(a[63:0]<<64) si256 version broadcasts one difference between the b Check (CRC32) 128bit element! The letter X and the non-b version. Comparisons Assist 128 Columns 128 128 128 ≤64 must be the following: Commit mi bsrli_si128(mi a,ii i) AES aeskeygenassist AES aesimc AES aesenc[last] AES aesdec[last] SSE4.2 crc32 32-bit Broadcast epi8: b, epi16: w, epi32: d, Transaction epi64:q, ps: s, pd: d, Float Compare Int Compare si128 si128 si128 si128 u8-u64 Zero High High/Low si256: si128 NOTE: Assist in expanding the NOTE: Performs the inverse NOTE: Perform one round of NOTE: Perform one round of NOTE: Starting with the initial TSX _xend Zero High All mi broadcastb_epi8(mi a) value in crc, accumulates a 128 AES cipher key by computing mix columns transformation an AES encryption flow on the an AES decryption flow on the - SSE3 movel/hdup Byte Swap Float Compare Not Compare steps towards generating a on the input. state in a using the round state in a using the round CRC32 value for unsigned X- SSE2 move Registers 256 Compare Int Compare round key for encryption cipher mi aesimc_si128(mi a) key. The last version key. The last version bit integer v, and stores the NOTE: Specify the end of an ps ≤64 result in dst. RTM code region. If this epi64 AVX zeroupper Compare 128 NaN 128 Single Float using data from a and an 8-bit performs the last round. performs the last round. NOTE: Duplicates 32bits into x86 round constant specified in i. corresponds to the outermost _bswap[64] SSE2 SSE2 AVX SSE2 SSE2 mi aesenc_si128(mi a,mi key) mi aesdec_si128(mi a,mi key) u crc32_u32(u crc,u v) NOTE: Moves lower half of - the lower 64bits of the cmp[n]Z cmp[un]ord cmp [u]comiZ cmpZ scope, the logical l (i32,i64) mi aeskeygenassist_si128 will attempt to commit the input into result and zero NOTE: Zeros all bits in result. version duplicates ps/d,ss/d ps/d,ss/d ps/d,ss/d ss/d epi8-32,epi64[SSE4.1] remaining bytes. bits [31:0], h version (mi a,ii i) logical processor state [MAX:256] of all SSE/AVX NOTE: Swaps the bytes in NOTE: Z can be one of: NOTE: Compares packed or NOTE: Z can be one of: registers. duplicates bits [63,32]. the 32-bit int or 64-bit int. NOTE: For each element pair NOTE: Z can be one of: atomically. If the commit fails, mi move_epi64(mi a) ge/le/lt/gt/eq cmpord sets the result bits single elements based on eq/ge/gt/le/lt/neq lt/gt/eq.Elements that the logical processor will a[63:0] v zeroupper() m moveldup_ps(m a) i64 _bswap64(i64 a) The n version is a not to 1 if both elements are not imm. Possible values are 0-31 Returns a single int that is equal receive 1s in all bits, perform an RTM abort. version, e.g., neq computes NaN, otherwise 0. (check documentation). either 1 or 0. The u version otherwise 0s. v _xend() not equal. Elements that cmpunord sets bits if at Elements that compare true does not an exception Miscellaneous compare true receive 1s in least one is NaN. receive 1s in all bits, for QNaNs and is not mi cmpeq_epi8(mi a,mi b) all bits, otherwise 0s. otherwise 0s. available for 256 bits! Monitor Get MXCSR Abort Begin md cmpeq_pd(md a,md b) md cmpord_pd(md a,md b) md cmp_pd(md a,md b,ii imm) i comieq_sd(mi a,mi b) Pause Monitor Wait Memory Register Transaction Transaction Perform a and check whether all bits SSE2 SSE3 SSE3 SSE TSX _xabort TSX _xbegin pause monitor mwait getcsr Bit Compare are 0s afterwards String Compare ------NOTE: Provide a hint to the NOTE: Arm address NOTE: Hint to the processor NOTE: Get the content of the NOTE: Force an RTM abort. The NOTE: Specify the start of an processor that the code monitoring hardware using that it can enter an MXCSR register. EAX register is updated to RTM code region. If the logical Test And Not/ Test Mix Ones String Compare Description processor was not already in Test All Ones cmpistrX(mi a, mi b, ii i) sequence is a spin-wait loop. the address specified in p. A implementation-dependent- u getcsr() reflect an XABORT instruction This can help improve the store to an address within the optimized state while waiting caused the abort, and the imm transactional , then it And Zeros SEQ! 128 cmpestrX(mi a, i la, mi b, i lb, ii i) parameter will be provided in transitions into transactional performance and power specified address range for an event or store SSE4.1 SSE4.1 test_ncz SSE4.1 test_all_ones Each operation has an i and an e version. The i version compares consumption of spin-wait triggers the monitoring operation to the address bits [31:24] of EAX. Following execution. On an RTM abort, AVX testc/z AVX all elements, the e version compares up to specific lengths la and loops. hardware. Specify optional range specified by MONITOR. Set MXCSR an RTM abort, the logical the logical processor discards si128[SSE4.1], si128[SSE4.1], - lb. The immediate value i for all these comparisons consists of all architectural register and extensions in e, and optional processor resumes execution si256,ps/d si256,ps/d NOTE: Returns true iff all bits bit flags. Exactly one flag per group must be present: v pause() hints in h. v mwait(u ext,u hints) at the fallback address memory updates performed Register NOTE: Compute the bitwise NOTE: Compute the bitwise in a are set. Needs two Data type specifier computed through the during the RTM execution, _SIDD_UBYTE_OPS v monitor(v* ptr,u e,u h) SSE AND of 128 bits in a and b, and AND of 128 bits in a and b, instructions and may be unsigned 8-bit chars setcsr outermost XBEGIN instruction. restores architectural state, ZF _SIDD_UWORD_OPS unsigned 16-bit chars and starts execution beginning set ZF to 1 if the result is zero, and set to 1 if the result is slower than native - v _xabort(ii imm) zero, otherwise set ZF to 0. implementations. _SIDD_SBYTE_OPS signed 8-bit chars at the fallback address otherwise set ZF to 0. _SIDD_SWORD_OPS signed 16-bit chars NOTE: Set the MXCSR register computed from the outermost Compute the bitwise AND NOT Compute the bitwise AND i test_all_ones(mi a) with a 32bit int. XBEGIN instruction. of a and b, and set CF to 1 if NOT of a and b, and set CF to (~a) == 0 Compare mode specifier 1 if the result is zero, For each character c in a, determine v setcsr(u a) u _xbegin() the result is zero, otherwise _SIDD_CMP_EQUAL_ANY set CF to 0. The c version otherwise set CF to 0. Return iff any character in b is equal to c. returns CF and the z version !CF && !ZF. For 128 bit, _SIDD_CMP_RANGES For each character c in a, determine ZF there is also the operation whether b0 <= c <= b1 or b2 <= c <= b3… . For 128 bit, there is also _SIDD_CMP_EQUAL_ORDERED Check for string equality of a and b test_all_zeros which does test_mix_ones_zeros which does the same. _SIDD_CMP_EQUAL_EACH Search substring b in a. Each byte where the same as testz_si128. b begins in a is treated as match. i test_mix_ones_zeros i testc_si128(mi a,mi b) (mi a,mi b) Polarity specifier ZF := ((a & b) == 0) ZF := ((a & b) == 0) _SIDD_POSITIVE_POLARITY Match is indicated by a 1-bit. CF := ((a & !b) == 0) CF := ((a & !b) == 0) _SIDD_NEGATIVE_POLARITY of resulting bitmask. RETURN CF; RETURN !CF && !ZF; _SIDD_MASKED_NEGAT Negation of resulting bitmask IVE_POLARITY except for bits that have an index larger than the size of a or b. String Compare String Compare String String Compare String Mask Index Compare with Nullcheck Nullcheck SSE4.2 cmpi/estrm SSE4.2 cmpi/estri SSE4.2 cmpi/estrc SSE4.2 cmpi/estra SSE4.2 cmpi/estrs/z ps/d,ss/d ps/d,ss/d ps/d,ss/d ps/d,ss/d ps/d,ss/d NOTE: Compares strings a and NOTE: Compares strings in a NOTE: Compares strings in a NOTE: Compares strings in a NOTE: Compares two strings b and returns the comparsion and b and returns the index and b and returns true iff the and b and returns true iff the a and b and returns if a (s mask. If _SIDD_BIT_MASK is of the first byte that resulting mask is not zero, resulting mask is zero and version) or b (z version) used, the resulting mask is a matches. Otherwise Maxsize i.e., if there was a match. there is no null character in b. contains a null character. bit mask. If _SIDD_UNIT_MASK is is returned (either 8 or 16 ( a, b, i) ( a, b, i) ( a, b, i) used, the result is a byte mask depending on data type). i cmpistrc mi mi ii i cmpistra mi mi ii i cmpistrs mi mi ii which has ones in all bits of i cmpestrc i cmpestra i cmpestrs i cmpistri(mi a,mi b,ii i) (mi a,i la,mi b,i lb,ii i) (mi a,i la,mi b,i lb,ii i) (mi a,i la,mi b,i lb,ii i) the bytes that do not match. i cmpestri mi cmpistrm(mi a,mi b,ii i) (mi a,i la,mi b,i lb,ii i) mi cmpestrm (mi a,i la,mi b,i lb,ii i)