4Th Generation Intel® Core™ Processors
Total Page:16
File Type:pdf, Size:1020Kb
4th Generation Intel® Core™ Processors Instruction Set Extensions and Developer Product Support ) April 2013 © 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 1 and/or other countries. *Other names and brands may be claimed as the property of others. Entdecken Sie weitere interessante Artikel und News zum Thema auf all-electronics.de! Hier klicken & informieren! Intel “Tick-Tock” Roadmap – Part I 2nd Generation 3nd Generation Intel ® Core ™ Micro Architecture Intel® Core™ Intel® Core™ MicroArchitecture Codename “Nehalem” Micro Micro Architecture Architecture Merom Penryn Nehalem Westmere Sandy Bridge Ivy Bridge NEW NEW NEW NEW NEW NEW Micro architecture Process Technology Micro architecture Process Technology Micro architecture Process Technology 65nm 45nm 45nm 32nm 32nm 22nm TOCK TICK TOCK TICK TOCK TICK 2006 2007 2008 2009 2011 2012 SSSE-3 SSE4.1 SSE4.2 AES AVX RDRAND etc © 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Intel “Tick-Tock” Roadmap – Part II 4nth Generation Intel® Core™ Micro TBD TBD TBD TBD TBD Architecture TICK + TOCK = SHRINK + INNOVATE Haswell Broadwell TBD TBD TBD TBD NEW NEW NEW NEW NEW NEW Micro architecture Process Technology Micro architecture Process Technology Micro architecture Process Technology 22nm 14nm 14nm 10nm 10nm 7nm TICK TOCK TICK TOCK TICK TOCK 2013 >= 2014 ??? ??? ??? ??? AVX-2 © 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Instruction Set Architecture (ISA) Extensions 199x MMX, CMOV, Multiple new instruction sets added to the initial 32bit instruction PAUSE, XCHG, set of the Intel® 386 processor … 1999 Intel® SSE 70 new instructions for 128-bit single-precision FP support 2001 Intel® SSE2 144 new instructions adding 128-bit integer and double-precision FP support 2004 Intel® SSE3 13 new 128-bit DSP-oriented math instructions and thread synchronization instructions 2006 Intel SSSE3 16 new 128-bit instructions including fixed-point multiply and horizontal instructions 2007 Intel® SSE4.1 47 new instructions improving media, imaging and 3D workloads 2008 Intel® SSE4.2 7 new instructions improving text processing and CRC 2010 Intel® AES-NI 7 new instructions to speedup AES 2011 Intel® AVX 256-bit FP support, non-destructive (3-operand) 2012 Ivy Bridge NI RNG, 16 Bit FP 2013 Haswell NI AVX-2, TSX, FMA, Gather, Bit NI A long history of ISA Extensions ! © 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 4 and/or other countries. *Other names and brands may be claimed as the property of others. New Instructions in Haswell Group Description Count *** SIMD Integer Adding vector integer operations to 256-bit Instructions promoted to 256 bits Gather Load elementsusing a vector of indices, vectorization 170 / 124 AVX2 enabler Shuffling / Data Blend, element shift and permute instructions Rearrangement FMA Fused Multiply-Add operation forms ( FMA-3) 96/ 60 Bit Manipulation and Improving performance of bit stream manipulation and 15 / 15 Cryptography decode, large integer arithmetic and hashes TSX=RTM+HLE Transactional Memory 4/4 Others MOVBE:Load and Store of Big Endian forms 2 / 2 INVPCID: Invalidate processor context ID * Total instructions / different mnemonics © 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Haswell Execution Unit Overview Unified Reservation Station Port 4 Port Port 2 Port Port 1 Port Port 0 Port Port 7 Port Port 6 Port Port 5 Port Port 3 Port Integer Integer Load & Store Integer Integer Store ALU & Shift ALU & LEA Store Address Data ALU & LEA ALU & Shift Address FMA FMA FP Mult 2xFMA Vector FP Multiply FP Add • Doubles peak FLOPs Shuffle Vector Int Vector Int • Two FP multiplies benefits Vector Int Multiply ALU legacy ALU Vector Vector Vector Logicals Logicals Logicals 4th ALU Branch Branch • Great for integer workloads • Frees Port0 & 1 for vector Divide Vector Shifts New Branch Unit New AGU for Stores • Reduces Port0 Conflicts • Leaves Port 2 & 3 open • 2nd EU for high branch code for Loads Intel ® Microarchitecture (Haswell) © 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Core Cache Size/Latency/Bandwidth Metric Nehalem Sandy Bridge Haswell L1 Instruction Cache 32K, 4-way 32K, 8-way 32K, 8-way L1 Data Cache 32K, 8-way 32K, 8-way 32K, 8-way Fastest Load-to-use 4 cycles 4 cycles 4 cycles 32 Bytes/cycle Load bandwidth 16 Bytes/cycle 64 Bytes/cycle (banked) Store bandwidth 16 Bytes/cycle 16 Bytes/cycle 32 Bytes/cycle L2 Unified Cache 256K, 8-way 256K, 8-way 256K, 8-way Fastest load-to-use 10 cycles 11 cycles 11 cycles Bandwidth to L1 32 Bytes/cycle 32 Bytes/cycle 64 Bytes/cycle 4K: 128, 4-way 4K: 128, 4-way 4K: 128, 4-way L1 Instruction TLB 2M/4M: 7/thread 2M/4M: 8/thread 2M/4M: 8/thread 4K: 64, 4-way 4K: 64, 4-way 4K: 64, 4-way L1 Data TLB 2M/4M: 32, 4-way 2M/4M: 32, 4-way 2M/4M: 32, 4-way 1G: fractured 1G: 4, 4-way 1G: 4, 4-way 4K+2M shared: L2 Unified TLB 4K: 512, 4-way 4K: 512, 4-way 1024, 8-way All caches use 64-byte lines © 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Intel ® Microarchitecture (Haswell); Intel ® Microarchitecture (Sandy Bridge); Intel ® Microarchitecture (Nehalem) FMA & Peak FLOPS Instruction Set SP FLOPs DP FLOPs • 2 new FMA units provide 2x per cycle per cycle peak FLOPs/cycle of previous Nehalem SSE (128-bits) 8 4 Sandy generation AVX (256-bits) 16 8 Bridge Haswell AVX2 & FMA32 3216 16 • 2X cache bandwidth to feed wide vector units – 32-byte load/store for L1 70 – 2x L2 bandwidth 60 Haswell 50 • 5-cycle FMA latency same as an 40 FP multiply 30 Sandy Bridge – Latency of MUL + ADD down from 20 8 cycles to 5 Merom 10 Peak Peak Cache bytes/clock Banias 0 0 5 10 15 20 25 30 35 Peak FLOPS/clock All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. © 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Intel ® Microarchitecture (Haswell); Intel ® Microarchitecture (Sandy Bridge); Intel ® Microarchitecture (Meron); Intel ® Microarchitecture (Banias) Intel ® Advanced Vector Extensions (Intel® AVX) Extends all 16 XMM registers to 256bits “YMM” registers; works on either – The whole 256-bits – for FP instructions – The lower 128-bits (like existing Intel ® SSE instructions) • A drop-in replacement for all existing scalar/128-bit SSE instructions • The upper part of the register is zeroed out Provides non-destructive functionality – Three-operand instructions with two sources and a different destination 128 bits (Intel® SSE) Sandy Bridge (code name) architecture introduced “Intel® AVX” instructions set XMM0 • 360 instructions (245 mnemonics) YMM0 256 bits (Intel® AVX) YMM regs are 256 bits (only for FP), with low 128 bits corresponding to existing XMM registers © 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 9 and/or other countries. *Other names and brands may be claimed as the property of others. Intel ® AVX2: 256-bit Integer Vector • Extends Intel® AVX to cover integer operations • Uses same AVX (256-bit) register set • Nearly all 128-bit integer vector instructions are ‘promoted’ to 256 – Including Intel ® SSE2, Intel ® SSSE3, Intel ® SSE4 – Exceptions: GPR moves (MOVD/Q) ; Insert and Extracts <32b, Specials (STTNI instructions, AES, PCLMULQDQ) • New 256b Integer vector operations (not present in Intel ® SSE) – Cross-lane element permutes – Element variable shifts – Gather • Haswell doubles the cache bandwidth • Two 256-bit loads per cycle • Helps both Intel® AVX2 and legacy Intel® AVX performance Intel® AVX2 completes the 256-bit extensions started with Intel® AVX: provides 256-bit integer, cross-lane permutes, gather… © 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. 10 and/or other countries. *Other names and brands may be claimed as the property of others. Integer Instructions Promoted to 256 VMOVNTDQA VPHADDSW VPSUBQ VPMOVZXDQ VPMULHW VPABSB VPHADDW VPSUBSB VPMOVZXWD VPMULLD VPABSD VPHSUBD VPSUBSW VPMOVZXWQ VPMULLW VPABSW VPHSUBSW VPSUBUSB VPSHUFB VPMULUDQ VPADDB VPHSUBW VPSUBUSW VPSHUFD VPSADBW VPADDD VPMAXSB VPSUBW VPSHUFHW VPSLLD VPADDQ VPMAXSD VPACKSSDW VPSHUFLW VPSLLDQ VPADDSB VPMAXSW VPACKSSWB VPUNPCKHBW VPSLLQ VPADDSW VPMAXUB VPACKUSDW VPUNPCKHDQ VPSLLW VPADDUSB VPMAXUD VPACKUSWB VPUNPCKHQDQ VPSRAD VPADDUSW VPMAXUW VPALIGNR VPUNPCKHWD VPSRAW VPADDW VPMINSB VPBLENDVB VPUNPCKLBW VPSRLD VPAVGB VPMINSD VPBLENDW VPUNPCKLDQ VPSRLDQ VPAVGW VPMINSW VPMOVSXBD VPUNPCKLQDQ VPSRLQ VPCMPEQB VPMINUB VPMOVSXBQ VPUNPCKLWD VPSRLW VPCMPEQD VPMINUD VPMOVSXBW VMPSADBW VPAND VPCMPEQQ VPMINUW VPMOVSXDQ VPCMPGTQ VPANDN VPCMPEQW VPSIGNB VPMOVSXWD VPMADDUBSW VPOR VPCMPGTB VPSIGND VPMOVSXWQ VPMADDWD VPXOR VPCMPGTD VPSIGNW VPMOVZXBD VPMULDQ VPMOVMSKB VPCMPGTW VPSUBB VPMOVZXBQ VPMULHRSW VPHADDD VPSUBD VPMOVZXBW VPMULHUW © 2013, Intel Corporation.