x86 Intrinsics Cheat Sheet Jan Finis [email protected]
Bit Operations Conversions Boolean Logic Bit Shifting & Rotation Packed Conversions Convert all elements in a packed SSE register Reinterpet Casts Rounding Arithmetic Logic Shift Convert Float See also: Conversion to int Rotate Left/ Pack With S/D/I32 performs rounding implicitly Bool XOR Bool AND Bool NOT AND Bool OR Right Sign Extend Zero Extend 128bit Cast Shift Right Left/Right ≤64 16bit ↔ 32bit Saturation Conversion 128 SSE SSE SSE SSE Round up SSE2 xor SSE2 and SSE2 andnot SSE2 or SSE2 sra[i] SSE2 sl/rl[i] x86 _[l]rot[w]l/r CVT16 cvtX_Y SSE4.1 cvtX_Y SSE4.1 cvtX_Y SSE2 castX_Y si128,ps[SSE],pd si128,ps[SSE],pd si128,ps[SSE],pd si128,ps[SSE],pd epi16-64 epi16-64 (u16-64) ph ↔ ps SSE2 pack[u]s epi8-32 epu8-32 → epi8-32 SSE2 cvt[t]X_Y si128,ps/d (ceiling) mi xor_si128(mi a,mi b) mi and_si128(mi a,mi b) mi andnot_si128(mi a,mi b) mi or_si128(mi a,mi b) NOTE: Shifts elements right NOTE: Shifts elements left/ NOTE: Rotates bits in a left/ NOTE: Converts between 4x epi16,epi32 NOTE: Sign extends each NOTE: Zero extends each epi32,ps/d NOTE: Reinterpret casts !a & b while shifting in sign bits. right while shifting in zeros. right by a number of bits 16 bit floats and 4x 32 bit element from X to Y. Y must element from X to Y. Y must from X to Y. No operation is SSE4.1 ceil NOTE: Packs ints from two NOTE: Converts packed generated. The i version takes an The i version takes an specified in i. The l version floats (or 8 for 256 bit input registers into one be longer than X. be longer than X. elements from X to Y. If pd is ps/d,ss/d immediate as count. The immediate as count. The is for 64-bit ints and the w mode). For the 32 to 16-bit register halving the bitwidth. mi cvtepi8_epi32(mi a) mi cvtepu8_epi32(mi a) used as source or target md castsi128_pd(mi a) md ceil_pd(md a) version without i takes the version without i takes the version for 16-bit ints. conversion, a rounding mode Overflows are handled using type, then 2 elements are Change the position of selected bits, zeroing out lower 64bit of an SSE lower 64bit of an SSE r must be specified. u64 _lrotl(u64 a) saturation. The u version converted, otherwise 4. Selective Bit Moving remaining ones register. register. m cvtph_ps(mi a) saturates to the unsigned The t version is only 128/256bit Round down mi srai_epi32(mi a,ii32 i) mi slli_epi32(mi a,ii i) mi cvtps_ph(m a, i r) integer type. available when casting to int and performs a truncation Bit Scatter Bit Gather mi sra_epi32(mi a,mi b) mi sll_epi32(mi a,mi b) mi pack_epi32(mi a,mi b) Cast 256 Movemask Extract Bits instead of rounding. (floor) (Deposit) (Extract) md cvtepi32_pd(mi a) AVX castX_Y SSE4.1 ≤64 ≤64 ≤64 floor Variable Variable Logic pd128↔pd256, ps/d,ss/d BMI2 _pdep BMI2 _pext SSE2 movemask BMI1 _bextr ps128↔ps256, u32-64 u32-64 epi8,pd,ps[SSE] u32-64 si128↔si256 md floor_pd(md a) Arithmetic Shift Shift Convert a single element in the lower bytes of an SSE register NOTE: Deposit contiguous NOTE: Extract bits from a at NOTE: Creates a bitmask NOTE: Extracts l bits from a Single Element Conversion NOTE: Reinterpret casts low bits from a to dst at the corresponding bit from the most significant bit starting at bit s. AVX2 sl/rav AVX2 sl/rlv from X to Y. If cast is from 128 to 256 bit, then the the corresponding bit locations specified by mask of each element. u64 _bextr_u64(u64 a,u32 s,u32 l) epi32-64 epi32-64 Single SSE Float Old Float/Int upper bits are undefined! No Round locations specified by mask m to contiguous low bits in i movemask_epi8(mi a) (a >> s) & ((1 << l)-1) Bit Scan m; all other bits in dst are dst; the remaining upper NOTE: Shifts elements left/ NOTE: Shifts elements left/ Single Conversion Single Float to Single 128-bit operation is generated. right while shifting in sign right while shifting in zeros. Forward/Reverse to Normal Float set to zero. bits in dst are set to zero. to Float with Fill Conversion 128 m128d castpd256_pd128(m256d a) SSE4.1 round bits. Each element in a is Each element in a is shifted ≤64 Int Conversion Int Conversion Conversion u64 _pdep_u64(u64 a,u64 m) u64 _pext_u64(u64 a, u64 m) shifted by an amount by an amount specified by _bit_scan_forward 128 128 128 128 SSE cvt[t]_X2Y ps/d,ss/d x86 /reverse specified by the the corresponding element SSE2 SSE SSE2 SSE2 corresponding element in b. in b. (i32) cvtX_Y cvt[t]X_Y cvtX_Y cvtX_Y pi↔ps,si↔ss NOTE: Rounds according to a si32-64,ss/d → ss/d ss/d → si32-64 si32-128 ss→f32,sd→f64 NOTE: Converts X to Y. 256bit Cast rounding paramater r. mi slav_epi32(mi a,mi b) mi sllv_epi32(mi a,mi b) NOTE: Returns the index of Set or reset a range of bits Converts between int and 256 md round_pd(md a,i r) the lowest/highest bit set in NOTE: Converts a single NOTE: Converts a single NOTE: Converts a single NOTE: Converts a single SSE p_ Bit Masking b float. versions convert 2- the 32bit int a. Undefined if element in from X to Y. element from X (int) to Y integer from X to Y. Either of float from X (SSE Float type) packs from b and fill result AVX castX_Y a is 0. The remaining bits of the (float). Result is normal int, the types must be si128. If to Y (normal float type). with upper bits of operand ps,pd,si256 Find Lowest result are copied from a. not an SSE register! the new type is longer, the a s_ Reset Lowest Mask Up To Count specific ranges of 0 or 1 bits i32 _bit_scan_forward(i32 a) t f cvtss_f32(m a) . versions do the same Zero High Bits Bit Counting md cvtsi64_sd(md a,i64 b) The version performs a integer is zero extended. for one element in b. NOTE: Reinterpret casts 1-Bit Lowest 1-Bit double(b[63:0]) | truncation instead of t from X to Y. No operation is ≤64 ≤64 ≤64 1-Bit ≤64 rounding. mi cvtsi32_si128(i a) The version truncates (a[127:64] << 64) instead of rounding and is generated. BMI2 _bzhi BMI1 _blsr BMI1 _blsmsk BMI1 _blsi Count 1-Bits Count Leading Count Trailing i cvtss_si32(m a) available only when m256 castpd_ps(m256d a) u32-64 u32-64 u32-64 u32-64 (Popcount) converting from float to int. ≤64 Zeros ≤64 Zeros ≤64 m cvt_si2ss(m a,i b) NOTE: Zeros all bits in a NOTE: Returns a but with NOTE: Returns an int that NOTE: Returns an int that higher than and including the lowest 1-bit set to 0. has all bits set up to and has only the lowest 1-bit in a POPCNT popcnt LZCNT _lzcnt BMI1 _tzcnt the bit at index i. u64 _blsr_u64(u64 a) including the lowest 1-bit in or no bit set if a is 0. u32-64 u32-64 u32-64 u64 _bzhi_u64(u64 a, u32 i) a or no bit set if a is 0. u64 _blsi_u64(u64 a) (a-1) & a NOTE: Counts the number of NOTE: Counts the number of NOTE: Counts the number of dst := a u64 _blsmsk_u64(u64 a) (-a) & a a a a IF (i[7:0] < 64) 1-bits in . leading zeros in . trailing zeros in . dst[63:n] := 0 (a-1) ^ a u32 popcnt_u64(u64 a) u64 _lzcnt_u64(u64 a) u64 _tzcnt_u64(u64 a)
Arithmetics Basic Arithmetics Loading data into an SSE register or storing data from an SSE register Register I/O Overview Version 2.1f Multiplication Load Loading data into an SSE register Fences Misc I/O Introduction Mul High with This cheat sheet displays most x86 intrinsics supported by Intel processors. The following intrinsics were omitted: Carryless Mul Mul Mul Low Mul High · obsolete or discontinued instruction sets like MMX and 3DNow! 128 Round & Scale Inserting data into an register without loading from memory · AVX-512, as it will not be available for some time and would blow up the space needed due to the vast amount of new instructions SSE SSE2 Set Register Store Fence Prefetch · Intrinsics not supported by Intel but only AMD like parts of the XOP instruction set (maybe they will be added in the future). CLMUL clmulepi64 SSE2 mul SSE4.1 mullo SSE2 mulhi SSSE3 mulhrs · Intrinsics that are only useful for operating systems like the _xsave intrinsic to save the CPU state si128 epi32[SSE4.1],epu32, epi16,epi32[SSE4.1] epi16,epu16 epi16 SSE sfence SSE prefetch · The RdRand intrinsics, as it is unclear whether they provide real random numbers without enabling kleptographic loopholes NOTE: Perform a carry-less ps/d,ss/d NOTE: Multiplies vertically NOTE: Multiplies vertically NOTE: Treat the 16-bit Set Reversed Set Insert Replicate Each family of intrinsics is depicted by a box as described below. It was tried to group the intrinsics meaningfully. Most information is taken from the Intel Intrinsics Guide (http://software.intel.com/en-us/articles/intel-intrinsics-guide). Let me know multiplication of two 64-bit NOTE: epi32 and epu32 and writes the low 16/32 and writes the high 16bits words in registers A and B as - - version multiplies only 2 ints SEQ! SEQ! 128 SEQ! ([email protected]) if you find any wrong or unclear content. integers, selected from a bits into the result. into the result. signed 15-bit fixed-point NOTE: Guarantees that every NOTE: Fetch the line of data instead of 4! numbers between −1 and 1 SSE2 SSE2 SSE4.1 SSE2 and b according to imm, and mi mullo_epi16(mi a,mi b) mi mulhi_epi16(mi a,mi b) AVX setr AVX set insert set1 store instruction that from memory that contains When not stated otherwise, it can be assumed that each vector intrinsic performs its operation vertically on all elements which are packed into the input SSE registers. E.g., the add instruction has no description. Thus, it can be assumed that it performs store the results in dst (e.g. 0x4000 is treated as 0.5 precedes, in program order, is mi mul_epi32(mi a,mi b) epi8-64x,ps[SSE]/d, epi8-64x, epi16[SSE2], epi8-64x,ps[SSE]/d address ptr to a location in a vertical add of the elements of the two input registers, i.e., the first element from register a is added to the first element in register b, the second is added to the second, and so on. In contrast, a horizontal add would add the first element of a to the (result is a 127 bit int). and 0xa000 as −0.75), and globally visible before any the cache heirarchy specified second element of a, the third to the fourth, and so on. multiply them together. m128/m128d/i[AVX] ps[SSE]/d,ss/d, epi8-64,ps NOTE: Broadcasts one input store instruction which follows mi clmulepi64_si128 by the locality hint i. To use the intrinsics, included the
temporal hint to minimize aligned! set. Memory must be i extract_epi16(mi a,ii i) . .
. +S ( ptr, a) m move_ss(m a,m b) moves upper half to lower inserti128. 1: r := s[63:32] from input. of b, and store the results in sum[31:0] := . at the offset specified in imm. v storer_pd d* md . 128-bit Dual . cache pollution. Memory v store_pd(d* ptr,md a) aligned! (a>>(i[2:0]*16))[15:0] . . * ( ptr, a) half. 2: r := s[95:64] dst. tmp[127:96] + tmp[95:64] . . + location must be aligned! v store1_pd d* md a[31:0] | b[127:32] m insertf128_ps . . mi sadbw_epu8 v maskstore_ps(f* p,mi m,m a) Register Shuffle 3: r := s[127:96] mi shufflehi_epi16 + tmp[63:32]+ tmp[31:0] (m a,m b,ii i) ( a, b) (mi a,mi b,ii imm) v stream_si32(i* p,i a) m movehl_ps(m a,m b) 256 r[31:0] (mi a,ii i) mi shuffle_epi8 mi mi ( a, b) dst[255:0] := a[255:0] RETURN j := 0 15 mi maddubs_epi16 mi mi 0 a_offset := imm[2]*32 AVX } dst[63:0] := a[63:0] FOR to FOR j := 0 to 3 v stream_si128(mi* p,mi a) sel := i*128 permute2f128 i := j*8 mi madd_epi16(mi a,mi b) b_offset := imm[1:0]*32 AVX2 dst[31:0]:=S(a,i[1:0]) dst[79:64] := IF imm[j] + 256bit Extract dst[sel+15:sel]:=b[127:0] IF b[i+7] == 1 FOR j := 0 to 7 ps/d,si256 dst[63:32]:=S(a,i[3:2]) (a >> (i[1:0]*16))[79:64] dst[i+31:i] :=sum[31:0] dst[95:80] := dst[i+7:i] := 0 ELSE i := j*8 NOTE: Takes 2 registers and dst[95:64]:=S(a,i[5:4]) k := a_offset+i Storing data to a memory address which does AVX (a >> (i[3:2]*16))[79:64] ELSE dst[i+31:i] := 0 AVX2 extractf128 shuffles 128 bit chunks to a dst[127:96]:=S(a,i[7:6]) k := b[i+3:i]*8 l := b_offset not have to be aligned to a specific boundary dst[111:96] := mi sad_epu8(mi a,mi b) Unaligned Store si256,ps/d Dual Register target register. Each chunk (a >> (i[5:4]*16))[79:64] dst[i+7:i]:= a[k+7:k] dst[i+15:i] := can also be cleared by a bit dst[127:112] := ABS(a[k+7:k]-b[l+7:l])+ NOTE: Extracts 128bit and in the mask. For AVX2, there ABS(a[k+15:k+8]-b[l+15:l+8])+ Unaligned Single Element 128bit Pseudo returns it as a 128bit SSE Float Shuffle (a >> i[7:6]*16))[79:64] Blend SSE is only the si256 version ABS(a[k+23:k+16]-b[l+23:l+16])+ Store High Masked Store register. For AVX2, there is shuffle which is renamed to ABS(a[k+31:k+24]-b[l+31:l+24]) si256 Interleave SSE2 Float Shuffle Store 128 Store 128 SEQ! Scatter 256 only and the permute2x128. operation is named SSE4.1 ps[SSE],pd SSE SSE blend[v] md permute2f128_pd SSE2 storeu SSE2 storeh SSE2 store[l] SSE2 maskmoveu AVX storeu2 extracti128. (Unpack) AVX2 NOTE: Shuffles floats from AVX 4x64bit epi8-16, (md a,md b,ii i) permute[var] si16-128,pd,ps[SSE], pd ss[SSE],sd,epi64 si128 m128,m128d,m128i m128 extractf128_ps two registers. The lower part S(s1, s2, control){ 8x32bit SSE2 unpackhi/lo epi32[AVX2],ps/d ps/d si256[AVX] (m256 a,ii i) of the result receives values (control[1:0]) Shuffle NOTE: Stores the high 64bits NOTE: Stores the low element NOTE: Stores bytes into NOTE: Stores two 128bit from the first register. The CASE 256 Fused Multiply and Add (a >> (i * 128))[128:0] epi8-64,ps[SSE],pd NOTE: blendv uses 128bit 0: tmp[127:0]:=s1[127:0] NOTE: 128-bit version is the Shuffle 256 NOTE: Stores 128 bits (or into memory. into memory. The l version memory if the corresponding elements into two memory mask, blend uses upper part receives values same as int shuffle for floats. AVX2 permute4x64 less for the si16-64 versions) must be used for epi64 and byte in mask has its highest locations. NOTE: Interleaves elements 1: tmp[127:0]:=s1[255:128] AVX2 v storeh_pd(v* ptr,m a) immediate from the second register. 2: tmp[127:0]:=s2[127:0] 256-bit version performs the permutevar8x32 into memory. can be used for pd. bit set. v storeu2_m128 from the high/low half of Shuffle mask is an epi64,pd mi blend_epi16 3: tmp[127:0]:=s2[255:128] same operation on two 128- (f* p,f* q,m a) two input registers into a immediate! epi32,ps FM-Add FM-Sub FM-AddSub FM-SubAdd v storeu_pd(v* ptr,m a) v store_sd(d* ptr,md a) v maskmoveu_si128 (mi a,mi b,ii imm) IF control[3] bit lanes. The normal version NOTE: Same as float shuffle *p = dst[127:0]; target register. var (mi a,mi mask,c* ptr) FOR j := 0 to 7 md shuffle_pd tmp[127:0] := 0 uses an immediate and the but no [ ] version is NOTE: Same as 4x64 bit *q = dst[255:128]; var b available. In addition, the shuffle but with 8x32bit and mi unpackhi_epi32(mi a,mi b) i := j*16 (md a,md b,ii i) RETURN tmp[127:0] version a register (the FMA f[n]madd FMA f[n]madd FMA fmaddsub FMA fmsubadd dst[31:0] := a[95:64] dst[63:0]:= (i[0] == 0) ? } lowest bits in each element shuffle is performed over the only a [var] version taking a IF imm[j] ps/d,ss/d ps/d,ss/d ps/d,ss/d ps/d,ss/d dst[63:32] := b[95:64] dst[i+15:i] := b[i+15:i] a[63:0] : a[127:64] dst[127:0] :=S(a,b,i[3:0]) in b are used for the mask of full 256 bit instead of two register instead of an dst[95:64] := a[127:96] ELSE dst[127:64]:= (i[1] == 0) ? dst[255:128]:=S(a,b,i[7:4]) that element in a). lanes of 128 bit. immediate is available. NOTE: Computes (a*b)+c NOTE: Computes (a*b)-c NOTE: Computes (a*b)+c NOTE: Computes (a*b)-c dst[127:96]:= b[127:96] dst[i+15:i] := a[i+15:i] b[63:0] : b[127:64] dst[MAX:256]:=0 m permutevar_ps(m a,mi b) md permute4x64_pd(md a,ii i) m permutevar8x32_ps(m a,mi b) for each element. The n for each element. The n for elements with even index for elements with even index version computes -(a*b)+c. version computes -(a*b)-c. and (a*b)-c for elements and (a*b)+c for elements with odd index. with odd index. m fmadd_ps(m a,m b,m c) m fmsub_ps(m a,m b,m c) m fmsubadd_ps(m a,m b,m c) Replicating one element in a register Byte Zeroing Broadcast to fill the entire register Byte Movement Zero All 64-bit Byteshift Zero Register Broadcast Registers 256 Broadcast left/right Specials SSE SSE2 setzero AVX zeroall SSE3 movedup AVX2 broadcastX SSE2 [b]sl/rli ps[SSE]/d,si128 - pd epi8-64,ps/d,si256 si128,epi128[256] Special Algorithms Transactional NOTE: Zeros all SSE/AVX NOTE: Duplicates the lower NOTE: Broadcasts the NOTE: Shift input register NOTE: Returns a register registers. half of a. lowest element into all slots left/right by i bytes while AES KeyGen AES Inverse Mix Cyclic Redundancy Memory with all bits zeroed. v zeroall() md movedup_pd(md a) of the result register. The shifting in zeros. There is no AES Encrypt AES Decrypt mi setzero_si128() a[0:63]|(a[63:0]<<64) si256 version broadcasts one difference between the b Check (CRC32) 128bit element! The letter X and the non-b version. Comparisons Assist 128 Columns 128 128 128 ≤64 must be the following: Commit mi bsrli_si128(mi a,ii i) AES aeskeygenassist AES aesimc AES aesenc[last] AES aesdec[last] SSE4.2 crc32 32-bit Broadcast epi8: b, epi16: w, epi32: d, Transaction epi64:q, ps: s, pd: d, Float Compare Int Compare si128 si128 si128 si128 u8-u64 Zero High High/Low si256: si128 NOTE: Assist in expanding the NOTE: Performs the inverse NOTE: Perform one round of NOTE: Perform one round of NOTE: Starting with the initial TSX _xend Zero High All mi broadcastb_epi8(mi a) value in crc, accumulates a 128 AES cipher key by computing mix columns transformation an AES encryption flow on the an AES decryption flow on the - SSE3 movel/hdup Byte Swap Float Compare Not Compare steps towards generating a on the input. state in a using the round state in a using the round CRC32 value for unsigned X- SSE2 move Registers 256 Compare Int Compare round key for encryption cipher mi aesimc_si128(mi a) key. The last version key. The last version bit integer v, and stores the NOTE: Specify the end of an ps ≤64 result in dst. RTM code region. If this epi64 AVX zeroupper Compare 128 NaN 128 Single Float using data from a and an 8-bit performs the last round. performs the last round. NOTE: Duplicates 32bits into x86 round constant specified in i. corresponds to the outermost _bswap[64] SSE2 SSE2 AVX SSE2 SSE2 mi aesenc_si128(mi a,mi key) mi aesdec_si128(mi a,mi key) u crc32_u32(u crc,u v) NOTE: Moves lower half of - the lower 64bits of the cmp[n]Z cmp[un]ord cmp [u]comiZ cmpZ scope, the logical processor l (i32,i64) mi aeskeygenassist_si128 will attempt to commit the input into result and zero NOTE: Zeros all bits in result. version duplicates ps/d,ss/d ps/d,ss/d ps/d,ss/d ss/d epi8-32,epi64[SSE4.1] remaining bytes. bits [31:0], h version (mi a,ii i) logical processor state [MAX:256] of all SSE/AVX NOTE: Swaps the bytes in NOTE: Z can be one of: NOTE: Compares packed or NOTE: Z can be one of: registers. duplicates bits [63,32]. the 32-bit int or 64-bit int. NOTE: For each element pair NOTE: Z can be one of: atomically. If the commit fails, mi move_epi64(mi a) ge/le/lt/gt/eq cmpord sets the result bits single elements based on eq/ge/gt/le/lt/neq lt/gt/eq.Elements that the logical processor will a[63:0] v zeroupper() m moveldup_ps(m a) i64 _bswap64(i64 a) The n version is a not to 1 if both elements are not imm. Possible values are 0-31 Returns a single int that is equal receive 1s in all bits, perform an RTM abort. version, e.g., neq computes NaN, otherwise 0. (check documentation). either 1 or 0. The u version otherwise 0s. v _xend() not equal. Elements that cmpunord sets bits if at Elements that compare true does not signal an exception Miscellaneous compare true receive 1s in least one is NaN. receive 1s in all bits, for QNaNs and is not mi cmpeq_epi8(mi a,mi b) all bits, otherwise 0s. otherwise 0s. available for 256 bits! Monitor Get MXCSR Abort Begin md cmpeq_pd(md a,md b) md cmpord_pd(md a,md b) md cmp_pd(md a,md b,ii imm) i comieq_sd(mi a,mi b) Pause Monitor Wait Memory Register Transaction Transaction Perform a bitwise operation and check whether all bits SSE2 SSE3 SSE3 SSE TSX _xabort TSX _xbegin pause monitor mwait getcsr Bit Compare are 0s afterwards String Compare ------NOTE: Provide a hint to the NOTE: Arm address NOTE: Hint to the processor NOTE: Get the content of the NOTE: Force an RTM abort. The NOTE: Specify the start of an processor that the code monitoring hardware using that it can enter an MXCSR register. EAX register is updated to RTM code region. If the logical Test And Not/ Test Mix Ones String Compare Description processor was not already in Test All Ones cmpistrX(mi a, mi b, ii i) sequence is a spin-wait loop. the address specified in p. A implementation-dependent- u getcsr() reflect an XABORT instruction This can help improve the store to an address within the optimized state while waiting caused the abort, and the imm transactional execution, then it And Zeros SEQ! 128 cmpestrX(mi a, i la, mi b, i lb, ii i) parameter will be provided in transitions into transactional performance and power specified address range for an event or store SSE4.1 SSE4.1 test_ncz SSE4.1 test_all_ones Each operation has an i and an e version. The i version compares consumption of spin-wait triggers the monitoring operation to the address bits [31:24] of EAX. Following execution. On an RTM abort, AVX testc/z AVX all elements, the e version compares up to specific lengths la and loops. hardware. Specify optional range specified by MONITOR. Set MXCSR an RTM abort, the logical the logical processor discards si128[SSE4.1], si128[SSE4.1], - lb. The immediate value i for all these comparisons consists of all architectural register and extensions in e, and optional processor resumes execution si256,ps/d si256,ps/d NOTE: Returns true iff all bits bit flags. Exactly one flag per group must be present: v pause() hints in h. v mwait(u ext,u hints) at the fallback address memory updates performed Register NOTE: Compute the bitwise NOTE: Compute the bitwise in a are set. Needs two Data type specifier computed through the during the RTM execution, _SIDD_UBYTE_OPS v monitor(v* ptr,u e,u h) SSE AND of 128 bits in a and b, and AND of 128 bits in a and b, instructions and may be unsigned 8-bit chars setcsr outermost XBEGIN instruction. restores architectural state, ZF _SIDD_UWORD_OPS unsigned 16-bit chars and starts execution beginning set ZF to 1 if the result is zero, and set to 1 if the result is slower than native - v _xabort(ii imm) zero, otherwise set ZF to 0. implementations. _SIDD_SBYTE_OPS signed 8-bit chars at the fallback address otherwise set ZF to 0. _SIDD_SWORD_OPS signed 16-bit chars NOTE: Set the MXCSR register computed from the outermost Compute the bitwise AND NOT Compute the bitwise AND i test_all_ones(mi a) with a 32bit int. XBEGIN instruction. of a and b, and set CF to 1 if NOT of a and b, and set CF to (~a) == 0 Compare mode specifier 1 if the result is zero, For each character c in a, determine v setcsr(u a) u _xbegin() the result is zero, otherwise _SIDD_CMP_EQUAL_ANY set CF to 0. The c version otherwise set CF to 0. Return iff any character in b is equal to c. returns CF and the z version !CF && !ZF. For 128 bit, _SIDD_CMP_RANGES For each character c in a, determine ZF there is also the operation whether b0 <= c <= b1 or b2 <= c <= b3… . For 128 bit, there is also _SIDD_CMP_EQUAL_ORDERED Check for string equality of a and b test_all_zeros which does test_mix_ones_zeros which does the same. _SIDD_CMP_EQUAL_EACH Search substring b in a. Each byte where the same as testz_si128. b begins in a is treated as match. i test_mix_ones_zeros i testc_si128(mi a,mi b) (mi a,mi b) Polarity specifier ZF := ((a & b) == 0) ZF := ((a & b) == 0) _SIDD_POSITIVE_POLARITY Match is indicated by a 1-bit. CF := ((a & !b) == 0) CF := ((a & !b) == 0) _SIDD_NEGATIVE_POLARITY Negation of resulting bitmask. RETURN CF; RETURN !CF && !ZF; _SIDD_MASKED_NEGAT Negation of resulting bitmask IVE_POLARITY except for bits that have an index larger than the size of a or b. String Compare String Compare String String Compare String Mask Index Compare with Nullcheck Nullcheck SSE4.2 cmpi/estrm SSE4.2 cmpi/estri SSE4.2 cmpi/estrc SSE4.2 cmpi/estra SSE4.2 cmpi/estrs/z ps/d,ss/d ps/d,ss/d ps/d,ss/d ps/d,ss/d ps/d,ss/d NOTE: Compares strings a and NOTE: Compares strings in a NOTE: Compares strings in a NOTE: Compares strings in a NOTE: Compares two strings b and returns the comparsion and b and returns the index and b and returns true iff the and b and returns true iff the a and b and returns if a (s mask. If _SIDD_BIT_MASK is of the first byte that resulting mask is not zero, resulting mask is zero and version) or b (z version) used, the resulting mask is a matches. Otherwise Maxsize i.e., if there was a match. there is no null character in b. contains a null character. bit mask. If _SIDD_UNIT_MASK is is returned (either 8 or 16 ( a, b, i) ( a, b, i) ( a, b, i) used, the result is a byte mask depending on data type). i cmpistrc mi mi ii i cmpistra mi mi ii i cmpistrs mi mi ii which has ones in all bits of i cmpestrc i cmpestra i cmpestrs i cmpistri(mi a,mi b,ii i) (mi a,i la,mi b,i lb,ii i) (mi a,i la,mi b,i lb,ii i) (mi a,i la,mi b,i lb,ii i) the bytes that do not match. i cmpestri mi cmpistrm(mi a,mi b,ii i) (mi a,i la,mi b,i lb,ii i) mi cmpestrm (mi a,i la,mi b,i lb,ii i)