<<

ECE 4750 Computer Architecture Skylake

Christopher Batten School of Electrical and Computer Engineering Cornell University

http://www.csl.cornell.edu/courses/ece4750 Intel Skylake • App Req vs Tech Constraints • Skylake System Overview • Skylake Processor • Skylake Memory • Skylake Network • Skylake System Manager Intel Tick/Tock Product Releases

Kaby Lake

14nm

Optimizations

???? 2009 2010 2011 2012 2013 2014 2015 2016 Application Requirements: Low-Power, High-Performance, Scalable Design Technology Constraints: Power Consumption Intel Transistor Leadership Technology Constraints: New Devices

2003 2005 2007 2009 2011 90 nm 65 nm 45 nm 32 nm 22 nm

SiGe SiGe

Invented 2nd Gen. Invented 2nd Gen. First to SiGe SiGe Gate-Last Gate-Last Implement Strained Silicon Strained Silicon High-k High-k Tri-Gate Metal Gate Metal Gate

Strained Silicon High-k Metal Gate Tri-Gate

10 Traditional PlanarTri-Gate Transistor Transistors22 nm Tri-Gate Transistor

Drain Gate Gate

Drain High-k Dielectric Source Source

Oxide Oxide

Silicon Silicon Substrate Substrate

Traditional 22D-D planar planar transistors transistors form a conducting form3-D Trichannel -Gate3D transistorsin the tri-gate form transistorsconducting channels form on three sides silicon regiona underconducting the gate electrode channel when in ofthe a “on”vertical state conductingfin structure, providing channels “fully depleted” on operation Transistors have now entered the third dimension! silicon region under the three sides4 of a vertical fin 5 gate electrode structure Tri-Gate Transistors

32 nm Planar Transistors 22 nm Tri-Gate Transistors

9 Intel Skylake • App Req vs Tech Constraints • Skylake System Overview • Skylake Processor • Skylake Memory • Skylake Network • Skylake System Manager Skylake System SKYLAKE SPEEDSHIFTS TO NEXT GEAR Hardware Power Management, Wider CPU Improve Efficiency

By Linley Gwennap (September 7, 2015) ......

In one of its largest product introductions ever, Intel the first time, Intel is developing different versions of the rolled out 48 new PC processors based on its new Skylake new CPU, omitting AVX512 for PCs but including it for CPU design. Skylake uses the same 14nm process as Broad- servers. The latter version of Skylake will appear in well but offers several improvements. Perhaps the most E5 and E7 server processors in 2016. significant is a move to hardware power management, a feature Intel calls SpeedShift. The sixth-generation CPU has Improving the PC instruction-set extensions dubbed SGX and MPX that work Intel’s vision of the future includes PCs with no wires. The with to improve security, and AVX512 doubles company’s WiDi technology, which allows a PC to connect the processing speed for vector operations. wirelessly to a TV or monitor via Wi-Fi, is starting to gain Owing to delays in the 14nm ramp, only the Y-Series users. A new technology called WiGig delivers up to 7Gbps Broadwell launched in 2014; mainstream notebook models using the 60GHz band (see NWR 5/19/14, “Millimeter appeared in January of this year, and a few high-end desk- Wave Seeks Best Home”); some Skylake notebooks will top models rolled out a few months ago. Now that the connect to a docking station using this technology, elimi- 14nm process is in full production, Skylake is launching nating a wired connection. The docking station then links with a broad set of notebookIntel and Skylake desktop chips, including i7-6700K, to a monitor, Fallkeyboard, and2015 other peripherals. dual- and quad-core versions. This debut marks the first update in two years for mainstream desktop processors, Up to 4K Graphics Display PCIe 3.0 which entirely skipped the Broadwell generation. Display Card North For systems that are moving up from the 22nm Bridge Intel DMI/OPI MC DIMMs Haswell, Skylake provides significant gains in either clock 100 Series South Skylake L3 2 channel speed or power. Compared with Broadwell, however, the USB 3.1 gains are minimal; these products see little or no clock- Bridge CPU Core DDR3/DDR4 speed increase. We estimate the enhancements to the CPU Skylake L3 design yield about 5% more performance per clock on Wi-Fi + CPU Core Cache Bluetooth most PC applications. Skylake’s advances are mainly out- Skylake L3 CPU Core Cache side the CPU: the processor includes a new image- Ethernet Skylake L3 processing engine (ISP) and adds DDR4 and PCI Express CPU Core Cache 3.0 interfaces, as Figure 1 shows. Revisions to the GPU and Thunderbolt 64MB or 128MB HD 500 Graphics Unit video engine boost performance per watt. eDRAM WiGig Most of the newly announced products are available ISP Video immediately, but the Iris and models will ship in the fourth quarter, along with new vPro products. By early Figure 1. Intel Skylake processor. In addition to CPU and 2016, the architecture will extend all the way2 to the low- graphics improvements, the design includes hardware power end• ~1.7B brand transistors as well as inthe ~122 high-end mm Iris Proin aline. 22nm For tri-gatemanagement, process a new ISP, and DDR4 capability. • Four out-of-order cores each with two SMT threads running at 4.0-4.2 GHz • Three-level with last-level on-chip cache capacity of 8MB © 5IF-JOMFZ(SPVQt Report September 2015 • Max of 91W • 2 DDR4 DRAM memory controllers, 34.1 GB/s max memory bandwidth • Integrated 3D graphics processor running at 350 MHz to 1.15 GHz • Pipelined bus on-chip network connecting cores, last-level cache banks, and GPU Intel Skylake: Block Diagram Design-Time Modularity to Meet Scalability Application Requirement Scalable Design in Haswell

Intel Haswell i7-4770K Intel Haswell 3560Y 85W @ 3.5 GHz 6W @ 880 MHz Intel Skylake • App Req vs Tech Constraints • Skylake System Overview • Skylake Processor • Skylake Memory • Skylake Network • Skylake System Manager ‹ 8D.‹.Z\fZ‹

.{’ ƒlo‚’ ‚\Yƒo|{’ Š\’ o{oƒoTttŽ’ [o‚Yˆ‚‚’ T’ s\Ž’ T‚}\Yƒ’ |_’ ƒl\’ .? ’ Tjo{j’}|Xt\w’ ƒl\’j|Šƒl’|_’o{‚ƒˆYƒo|{‚’ o{’{ˆwX\’T{[’o{’ t\{jƒl’ Cl\{’ Š\’ o{‚}\Yƒ’ Œ’o{‚ƒˆYƒo|{‚’ˆ‚Tj\’ |‰\’ ƒow\’ Growth in Instruction Sets Over Time

Ě

%'  Ě &' UĚ ‰‹ SĚ £Ě 20@;.G GA.;*0.G &#G=2E.G8A.;G7@6+.;G8/G27=>;@,>287=G/8;GCG Š‹ Ě Ě /8;G.*,1G#G.C>.7=287G Ě Ě ŸĚ UĚ íù SĚ ' o{‚†ˆYƒo|{‚’ ƒ|’ƒl\’‚wTtt\‚ƒ’J.?\‚’ ,|‹\‰\’ Yˆ\{ƒtŽ’ {\‹’ o{ Ě Ě «Ě  Ě  Ě Ě Ě Ě Ě ‚ƒˆYƒo|{‚’ ƒ\{[’ƒ|’ TYYˆwˆtTƒ\’\ŒƒT’d\t[‚’ƒ|’[oa\\{ƒoTƒ\’ƒl\w’ ÜåàùĚ f|w’ ƒl\’ |t[’ |{\‚’ *|’ \ŒTw}t\’ o{’ |ˆ’ ‚ƒˆ[Ž’ Š\’ [\ƒ\Yƒ\[’

20@;.G G#GThree0;8B>1G key 27Gadditions7@6+.;G in8/G Skylake27=>;@,>287=G 8A.;G >26.G /8;G ƒlTƒ’ ƒl\’ / ’o{‚†ˆYƒo|{’ Ě  ?%//’T[qˆ‚ƒ’T_ƒ\’T[[oƒo|{’o‚’ -2//.;.7>G 9;8,.==8;G• AVX512:*;,12>.,>@;.=G 512-bit SIMD Extensions {\‰\’_|ˆ{[’o{’w|[\{’‚|gŠT\’ Xˆƒ’ |YYˆ}o\‚’T’}ow\’T\T’o{’ • SGX: Software Guarded Execution ƒl\’ |}Y|[\’‚}UY\’ p XŽƒ\’|}Y|[\’ .{’Y|{ƒU‚ƒ’ ƒl\’ o{‚ƒˆYƒo|{’ • MPX: Memory Protection Extensions " f|w’ ƒl\’ HO’ . ’ \Œƒ\{‚o|{’ PR’ ˆ‚\‚’ ’ XŽƒ\‚’ 52:7,>-1>7,&>180#&5>2(>-16758$7-216> /{’)ojˆ\’’ ‹\’|X _|’ |}Y|[\’T{[’o‚’|`„\{’ˆ‚\[’XŽ’ w|[\{’ Y|w}ot\‚’ ƒ|’ j\{\Tƒ\’ ‚\‰\’T’‚ƒ\T[Ž’j|‹ƒl’o{’ƒl\’{ˆwX\’|_’ o{‚ƒˆYƒo|{‚’ o{’ ƒl\’ Œ’ e|Tƒo{j }|o{ƒ’ Y|w}ˆƒTƒo|{‚’ ‚’ T’\‚ˆtƒ’|_’ ƒlo‚’†\{[’ ƒl\’T‰\ <:L(=’ T{[’ =6’ /? ‚’ /{’ƒl\’ Œ’ /?’ ’ ƒl\’ ’ }|Y\‚‚|’ Tj\’ o{‚ƒˆYƒo|{’ ‚o\’ o{Y\T‚\‚’ 6|\|‰\’ T‚’{\Š’o{‚ƒˆYƒo|{‚’ _|w’ ’ lT[’ TX|ˆƒ’ ’ o{‚„€ˆYƒo|{‚’ ClTƒ’ {ˆwX\’ lT[’ j|Š’o{’}|}ˆtToƒŽ’ ƒl\’ \‚ˆtƒo{j’Xo{TŽ’\Œ\YˆƒTXt\‚’[\Y\T‚\’ [|ˆXt\[’ XŽ’’ Šoƒl’ ƒl\’ \t\U‚\’ |_’ <\{ƒoˆw’ ///’ U{[’ƒl\’ ??(’ o{’ ‚}TY\’ \_cYo\{YŽ’ *ojˆ\’ ’ ‚l|Š‚’ ƒl\’ o{Y\T‚\’ o{’ ƒl\’ T‰ wˆtƒow\[oT’ \Œƒ\{‚o|{’ PR’ C|[TŽ’ ƒl\’.? ’lT‚’\TYl\[’ ’ \Tj\’ ‚o\’ |_’I.?’ o{’\TYl’ Œ’.? ’ \Œƒ\{‚o|{’ |‰\’ ƒl\’ Ž\T€‚’ o{‚ƒˆYƒo|{‚’o{’ƒl\’ ,T‚Š\tt’ T€Yloƒ\Yƒˆ\’ PR’ T{[’o{’ƒl\’ {\T€’ )|’\ŒTw}t\’ ’{\Š’o{‚ƒˆYƒo|{‚’ Š\\’o{ƒ|[ˆY\[’ X\ƒŠ\\{’ _ˆƒˆ\’ Šoƒl’ .{ƒ\t‚’ 3{ojlƒ‚’ 5T{[o{j’ TYloƒ\Yƒˆ\’‚ˆ}}|ƒo{j’ 77O’ T{[’ ??(’ ’ T‚’ ‚l|Š{’ XŽ’ *ojˆ\’ ’ T{[’ ƒ|’ \}\‚\{ƒ’ !HN’o‚’ \Œ}\Yƒ\[’ƒ|’o{Ytˆ[\’w|\’’{\‹’ o{‚†ˆYƒo|{‚’ QS’ ƒlo‚’T[[oƒo|{Tt’Tw|ˆ{ƒ’|_’o{‚ƒˆYƒo|{‚’ w|\’ƒlT{’’ I.?’XŽƒ\’ Dl\\’ o‚’ {|’\‰o[\{Y\’ƒlTƒ’ƒl\’ /?!’ ‹ott’‚ƒ|}’j|‹o{j’?owotVtŽ’ o‚’{\\[\[’ Cl\\_|\’ ƒl\’ T‰\Tj\’ {ˆwX\’ |_’ I.?’ XŽƒ\‚’ lT‚’ ƒl\’ {ˆwX\’ |_’o{‚„€ˆYƒo|{‚’o{’ƒl\’ c‚ƒ’ <:M(>’.? ’ [|ˆXt\[’ j|Š{’_|w’’ ƒ|’’XŽƒ\‚’)ojˆ\’’ o{’ƒl\’ <:L(=’/? ’ _|w’’ ŠloYl’o{Y|}|Tƒ\‚’ o{‚†ˆY ƒo|{‚’_|w’ƒl\’<|‹\<%’<:L(=u’T{[’<:L(=’/?!‚’ 6|\’ ,&>-03"$7>2(> >"+-1+> /{’w|[\{’Œ’ow}t\w\{ƒTƒo|{‚’ \Y\{ƒtŽ’ ƒl\’o{oƒoTt’{ˆwX\’ \TYl\[’T’ƒl\\_|t[’j|Šƒl’ o{’ ƒl\’ ‚ˆYl’ T‚’ ƒl\’ -T‚‹\tt’woY|VYloƒ\Yƒˆ\’ T{’o{‚ƒˆYƒo|{’†T‰\‚\‚’ <:L(>’.? ’T{[’oƒ‚’ tƒo H\Y’ ?.7'’\Œƒ\{‚o|{’ f|w’ ’ ‚\‰\Tt’ Y|w}|{\{ƒ‚’ X\_|\’ X\o{j’ \Œ\Yˆƒ\[’ {Tw\tŽ’ ’ T{’ T{[’Y|{ƒo{ˆ\‚’ƒ|’o{Y\T‚\’o{’ƒl\’<:M(>’.? ’ _|w’’ o{‚„€ˆYƒo|{’ _\ƒYl’ ˆ{oƒ’ ’ T{’ . YTYl\’ ’ T’ $’ }\[\Y|[\’ .{ƒ\\‚ƒo{jtŽ’ ƒl\’.? ’\ƒo\‚’o{‚ƒˆYƒo|{‚’TY|‚‚’j\{\Tƒo|{‚’ _\ƒYl’Xˆ__\’ ’ T{’ o{‚ƒˆYƒo|{’ ~ˆ\ˆ\’ T{[’ ’ |{\’ Šoƒlo{’ T’ ƒ|’\{TXt\’ Y|‚ƒ’T{[’}\_|wT{Y\’ |}ƒowoTƒo|{‚’ PR’ *o{TttŽ’ ‚\ƒ’|_’[\Y|[\‚’ PR’ 'ˆ\’ƒ|’.? ’Tjo{j’ T‚’o{‚ƒˆYƒo|{’t\{jƒl’ *ojˆ\’’ Tt‚|’ ‚l|Š‚’ ƒlTƒ’ ƒl\’ >7’ .? ’ j\Š’ _|w’ ’ o{’ j|‹‚’ ‹\’ |X‚\‰\’ƒm\\’wTo{’\a\Yƒ‚’ )o‚ƒ’ _\‹\’o{‚ƒˆYƒo|{‚’ H’ f|w’ ’ ƒ|’ {\TtŽ’ ’ o{‚ƒˆYƒo|{‚’ Šoƒl’ ƒl\’ T[[o dƒ’o{ƒ|’ƒl\’ / YTYl\’ ƒl\\XŽ’\[ˆYo{j’oƒ‚’\a\Yƒo‰\’YT}TYoƒŽ’ QS’ ƒo|{’ |_’ƒl\’ 8(:8’ [‰T{Y\[’ ?.7'’ \Œƒ\{‚o|{‚’ f|w’’ ?\Y|{[’ ƒl\’}\[\Y|[\’Xˆ__\’\oƒl\’{\\[‚’ƒ|’o{Y\T‚\’o{’‚o\’ 6|\|‰\’!=6‚’DlˆwX’\Œƒ\{‚o|{’Tt‚|’‚ˆ}}|ƒ‚’8(:89)<’ ƒl\\_|\’ o{’tTƒ\{YŽ’ X\’\}toYTƒ\[’ |’|ƒl\Šo‚\’\[ˆY\’ƒl\’ o{Y\T‚o{j’ƒl\’tos\tol||[’ƒlTƒ’ƒl\’.? ’ Šott’‚ˆa\’ Šoƒl’ Tjo{j’ o{‚ƒˆYƒo|{ [\Y|[o{j’ƒl|ˆjl}ˆƒ’ Dlo[’ ƒl\’‰Uo\ƒŽ’|_’ o{‚ƒˆY ƒo|{’ t\{jƒl’ow}|‚\‚’ˆ{[\‚oTXt\’ ‚\~ˆ\{ƒoTtoƒŽ’ ƒ|’ƒl\’ [\Y|[o{j’ 1-48&> -16758$7-21> 6-+1"785&6>  > .{’ƒl\’ Œ’.? ’ ƒl\’ }|Y\‚‚’ ‚o{Y\’ t\T[o{j’ o{‚ƒˆYƒo|{‚’ {\\[’ ƒ|’ X\’ [\Y|[\[’ T{[’ t\{jƒl’T{[’ ƒl\’ _|wTƒ’|_’o{‚ƒˆYƒo|{‚’wTŽ’ ‰VŽ’ C|’wo{owTttŽ’ lT‰\’ ƒl\o’t\{jƒl’[\ƒ\wo{\[’X\_|\’ ƒl\’ {\Œƒ’o{‚ƒˆYƒo|{’YT{’X\’ o[\{ƒo_Ž’T{’o{‚ƒˆYƒo|{’ Š\’ wˆ‚ƒ’wTƒYl’oƒ‚’ c\t[’ ,|Š [_ ;[=>} [\ƒ\Yƒ\[’ QS’ !‚’T’\‚ˆtƒ’ o{‚†ˆYƒo|{’[\Y|[\‚’ ƒ\{[’ ƒ|’X\Y|w\’ \‰\’w|‚ƒ’o{‚†ˆYƒo|{‚’\~ˆo\’T{’\Œ†T’ c\t[’ T{[’ |ƒl\‚’ _a>G} T’ X|ƒƒt\{\Ys’ T‚’}o}\to{\‚’ Šo[\{’ .{’‚ˆwwT€Ž’ lT‰o{j’_\Š\’ \~ˆo\’ T’ c\t[’ *|’ ƒlTƒ’ \T‚|{’ Š\’ T[|}ƒ’ ƒl\’ ƒ\w’ (\=.(} ‚l|ƒ\’ o{‚ƒˆYƒo|{‚’ Š|ˆt[’ ‚ow}to_Ž’ ƒl\’ }|Y\‚‚|‚’ [\‚oj{’ ŠloYl’ [\{|ƒ\‚’ ƒl\’ wo{o nYM`n>}MYhjan;jM\Y}hMIY7jna>}3$/} ƒ\‚ƒo{j’ T{[’ ‰Tto[Tƒo|{’ ƒl\’\a\Yƒo‰\{\‚‚’ |_’ƒl\’/ YTYl\’T{[’ ƒl\’ wˆw’Y|wXo{Tƒo|{’ |_’ƒl|‚\’d\t[‚’ƒlTƒ’T\’{\Y\‚‚TŽ’ ƒ|’o[\{ƒo_Ž’ c{Tt’ ‚o\’ |_’ƒl\’ Xo{TŽ’ \Œ\YˆƒTXt\‚’ T{’o{‚„€ˆYƒo|{’ *|’ \ŒTw}t\’ o{‚ƒˆYƒo|{’" " " T[[’ T’  Xoƒ’oww\[oTƒ\’ƒ|’T’  Xoƒ’\jo‚ƒ\’lT‚’ƒŠ|’[oa\\{ƒ’|} ù 16758$7-21> 6"+&> 9&5> -0&> Y|[\‚’ [\}\{[o{j’|{’Šl\ƒl\’ "o‚’!" |’{|ƒ’ .{’ƒlo‚’YT‚\’ M\’ }\‚\{ƒ\[’ ‚\‰\Tt’ T‚}\Yƒ‚’ |_’ƒl\’Œ’.? ’ j|Šƒl’ lo‚ƒ|Ž’ Š\’ Š|ˆt[’ lT‰\’ ƒŠ|’ [o__\\{ƒ’ I.?\‚’ Moƒl’ƒlTƒ’o{’wo{[’ Š\’ M\’ ‚l|Š’ {|Š’ l|Š’ |g\{’ \TYl’ |{\’ |_’ ƒl\’ Œ’ o{‚ƒˆYƒo|{‚’ [\c{\’ 3$/} h_ 7;>}T‚’ƒl\’‚\ƒ’|_’Ttt’ ‰Tto[’.? ’Xo{TŽ’\{Y|[o{j‚’ o‚’ ˆ‚\[’ o{’ }TYƒoY\’ %|{Y\ƒ\tŽ’ Š\’ YT}ƒˆ\’ l|Š’ }|jTw‚’ 52:7,> -1> -16758$7-21> /&1+7,> ’ ‚o\ \_cYo\{ƒ’ T}}|TYl’ ‚}T{{o{j’ wˆtƒo}t\’j\{\Tƒo|{‚’T{[’‚Ž‚ƒ\w‚’ˆƒoto\’ƒl\’ Œ’/?!’ _|’T’%.?%’.? ’|jT{oTƒo|{’Š|ˆt[’T‚‚oj{’ƒl\’w|‚ƒ’f\~ˆ\{ƒ’ _|’Y|ww|{’ˆ‚\’ƒT‚s‚’ \j’ ‚|`„ŠT\’ˆ‚\[’o{’l|w\’ |’|bY\’

ù AVX512: 512-bit SIMD Extensions SGX: Software Guarded Extensions MPX: Memory Protection Extensions Processor Block Diagram

From Intel 64 and IA-32 Architectures Optimization Reference Manual IO Fetch

~5 cycles

IO Decode

OOO Issue and

Late Commit ~14 cycles Integer/FP Functional Units with OOO Writeback

Load/Store Execution IO Fetch and Decode

• Complex CISC instructions are broken into much simpler, almost RISC-like, micro-ops • Predecoder handles variable length encoding (1-15B), finds instruction boundaries and inserts into instruction queue • Parallel decoders are used to transform x86 instructions into uops; can decode either five “simple” x86 instructions (decodes into 1 uop) per cycle or one “complex” x86 instruction (decodes into 1-4 fused uops)

• Very complex instructions fall back to a microcoded control unit • uop cache acts as a kind of L0 instruction cache that holds decoded uops and enables much of the front-end to be shut down to save power • uop decode queue can be used as a special loop cache IO Fetch and Decode

• Skylake predictor has changed but little is known about it; more is known about (2gen old) • Sandy Bridge predictor has a misprediction latency of ~15 cycles for branches in uop cache • Sandy Bridge predictor uses a “two- level predictor with 32b global history buffer and a history pattern table of unknown size” • Sandy Bridge uses a BTB for both L1 I$ and uop cache; “conditional jumps are less efficient if there are more than 3 branch instructions per 16 of code” • Sandy Bridge uses a return address stack predictor with 16 entries OOO Issue and Late Commit

• Integer/FP Registers are the physical registers used for register renaming • Load Buffer and Store Buffer are the finished load/store buffers • Branch Order Buffer is used to store snapshots of the rename tables to recover from mispredicted branches • Unified Scheduler is a centralized issue queue • Can rename and insert into the IQ up to six fused uops per cycle; can commit up to four uops per per cycle; since fused uop can encode two uops peak throughput is eight uops/cycle Functional Units with OOO Writeback

• Can “issue” (dispatch) up to eight to eight “dispatch ports”, which is just several arithmetic units collected into a functional unit • “Every cycle, the 8 oldest, non- conflicting uops that are ready for execution are sent from the unified scheduler to the dispatch ports.”

• “Execution units are arranged into stacks: integer, SIMD integer, and floating point. ... Each stack has different data types, different result forwarding networks, and potentially different registers.” • “Note that the divider on port 0 is not fully pipelined and is shared by all types of uops (integer, SIMD integer, and floating point)” Skylake SpeedShifts to Next Gear 3 six micro-ops per cycle, a 50% improve- Branch I-TLB 32KB Instruction Cache L2 cache ment in the front-end bandwidth. Pred 256 Qualcomm MDM9215 >20 bytes Whereas Haswell added two func- Fetch x86 Predecode tion units and two dispatch ports to the Unit 5 instr MPR 9/24/12, OOO section (see “Intel’s Micro- Complex Simple Simple Simple Simple Haswell Cuts Core Power”), Skylake’s code Decoder Decoder Decoder Decoder Decoder execution resources remain much the Retire 5 µops Unit same. The scheduler can dispatch eight 1.5K Micro-Op (L0) Cache 2x 4 µops micro-ops per cycle to an array of four 6 µops integer ALUs (two with branch units), Unified Scheduler (8 ports) 8 µops two floating-point units, and two load/ Load Load / Store ALU / ALU / ALU / Integer FP/AVX FP/AVX store units. The CPU can simultaneously Unit Store Unit Branch Divide Branch ALU Multiply Add execute instructions from two different 64 64 64 64 64 64 64 512* 512* threads using Intel’s Hyper-Threading 64-Bit Integer Registers FP + AVX Reg technology. 512* The wider front end floods the 3 x 512* OOO core with more micro-ops per OOO OOO Section D-TLB 32KB Data Cache L2 cache cycle. To compensate, Skylake increases 512 the reorder-buffer (ROB) size from 192 to 224 entries, providing a larger window Figure 2. Skylake microarchitecture (simplified). The new CPU widens the front end with a fifth decoder and adds bandwidth to support it. Red indicates new in which the CPU can seek out parallel features in Skylake. *256 bits wide in PC version. instructions. To account for extra store instructions in this window, Intel en- larged the store buffer by 14 entries, as Table 1 shows. but we expect most PC programs to gain about 5% rela- Skylake also increases the number of integer registers avail- tive to Broadwell. Server processors will see much greater able for renaming. improvement on applications that take advantage of the As in Haswell, Skylake implements a unified sched- AVX512 extensions. uler that can emit eight micro-ops per cycle to the function units, which are grouped into eight ports (not shown). The Faster GPU and System I/O total number of queue entries is considerably larger than in Compared with the Broadwell GPU, Skylake’s new Intel Haswell, giving the scheduler more instructions to choose HD 500 graphics unit improves performance by 20–40%, from when some instructions are stalled. Skylake splits the depending on the version. This comparison is based on unified allocation queue in Haswell on a per-thread basis, 3DMark, so results on other benchmarks will vary. The returning to the model that Intel used in Sandy Bridge and new GPU supports leading-edge APIs such as DirectX 12, Nehalem. The total number of entries is more than double OpenCL 2.0, and OpenGL 4.4, and it will support Vulkan Haswell’s. Micro-ops are now retired at four per thread, or once that standard is ratified. It includes a new SFC (scalar a total of eight per cycle. and format conversion) engine and a Quick Sync video- The new CPU increases the load/store bandwidth to transcoding engine. (We’ll dive into more details of the 128 bytes per cycle. This bandwidth requires using the new GPU and multimedia engines in a future article.) AVX512 instructions, which support 64- (512-bit) As in previous processors, Skylake implements a ring loads and stores; executing two AVX512 loads or stores in that connects the CPU cores with the L3 cache, GPU, and one cycle achieves the peak bandwidth. Since the PC ver- northSize bridge of (Datashown Structures in Figure 1). The new ring design sion of Skylake omits AVX512, it remains limited to 64 bytes per cycle, as in Haswell. Nehalem Sandy Bridge Haswell Skylake Server processors, however, will benefit x86 Decoders 4 instr 4 instr 4 instr 5 instr from the greater memory bandwidth. Max Instr/Cycle 4 ops 6 ops 8 ops 8 ops Skylake uses the same pipeline as its Reorder Buffer 128 ops 168 ops 192 ops 224 ops predecessors and thus will operate at about Load Buffer 48 loads 64 loads 72 loads 72 loads Store Buffer 32 stores 36 stores 42 stores 56 stores the same clock speed. Some minor tuning Scheduler 36 entries 54 entries 60 entries 97 entries raises the clock speed slightly at a given power Integer Rename In ROB 160 regs 168 regs 180 regs level, although only a few products actually FP Rename In ROB 144 regs 168 regs 168 regs realize this increase. The changes to the Allocation Queue 28/thread 28/thread 56 total 64/thread microarchitecture also boost instructions per Table 1. CPU microarchitecture comparison. To improve performance, Skylake cycle (IPC), but only by a small amount. The increases• Reorder the Buffer number is the ofnumber micro of- opsentries and in the the reorder number buffer of (ROB) stores that can be in effect varies depending on the application, flight• Load at anyBuffer given is the time. number (Source: of entries Intel) in the finished store buffer (FLB) • Store Buffer is the number of entries in the finished store buffer (FSB) • Scheduler is the number of entires in the centralized issue queue (IQ) • Integer/FP Rename is the number of physical integer and floating point registers © 5IF-JOMFZ(SPVQtMicroprocessor Report • Allocation Queue is a decoupling queue between front-end and back-endSeptember 2015 • Nehalem to Sandy Bridge transitioned from value- to pointer-based register renaming Maco-Op Fusion Micro-Op Fusion

Combines a compare x86 instruction Combines two micro-ops together (load plus and a jump x86 instruction into a single integer op, split stores) so that they only micro-op for the entire pipeline take a single ROB and IQ entry, but the fused micro-op is split such that two micro- cmp eax, ecx ops are issued to two different execution jl loop units mov [esi], eax ; 1 fused uop • Only works for specific versions of add eax, [esi] ; 1 fused uop comparison and jump instructions add [esi], eax ; 2 uops + 1 fused • There can be no other instructions between the compare and jump • Decoding becomes more efficient, instructions because instructions that generate one fused uop can use the simple decoders • Both instructions must be in a single 16-byte aligned block • Reduces pressure on register renaming and commit pipeline stages • Capacity of IQ an ROB are increased since fused uop only uses one entry Move Elimination Zero Idiom

Moving a value from one register to Zero’ing out a register is very common another does not require any “real” work but requires very little “real” work

xor eax, ecx

Allocate a fresh destination register in the rename stage, but then immediately clear the value in this destination register to zero.

Simply update the rename table so r3 now points to the same physical register as r1. Only perform move elimination if r1 is ready.

Both techniques enable specific instructions to use no execution resources! Multithreading & SIMD

• SMT enables two threads to share much of the OOO pipeline, although some data-structures are statically partitioned between the two threads

• Subword-SIMD can process 512 bits of integer or floating-point data with a single instruction, where this data is carved into 64x8b, 32x16b, 16x32b, or 8x64b Intel Skylake • App Req vs Tech Constraints • Skylake System Overview • Skylake Processor • Skylake Memory • Skylake Network • Skylake System Manager Memory System

• Skylake can sustain two loads and one store of 512b per cycle • Uses split stores with the store address generation uop being sent to the Store AGU and the store data being sent to a separate execution unit • L1 DTLB: “There are 64, 32, and 4 entries respectively for 4KB, 2MB, and 1GB pages, all the translation arrays are still 4-way associative.” • “Misses in the L1 DTLB are serviced by the unified L2 TLB” which has 1024 entries and is 8- way associative Memory System

• “A dedicated store AGU is slightly less expensive than a more general AGU. Store uops only need to write the address (and eventually data) into the store buffer. In contrast, load uops must write into the load buffer and also probe the store buffer to check for any forwarding or conflicts.” • L2 can sustain refilling a complete 64B cache line into the L2 per cycle • L2 is private to the core and is “neither inclusive nor exclusive of the L1 data cache.” • L2 is non-blocking and sustain up to 16 outstanding misses Memory System Parameters Last-Level Cache In-Package Embedded DRAM L4 Cache Intel Skylake • App Req vs Tech Constraints • Skylake System Overview • Skylake Processor • Skylake Memory • Skylake Network • Skylake System Manager Pipelined Bus Interconnect

IVY BRIDGE-EX UPDATES XEON E7 LINE More Than Just Another Brickland in Intel’s Wall

By David Kanter (February 24, 2014) ......

Although high-end server products are released infre- and scalable server processors move at a slower march. For quently, the impact of a new platform is great. Intel’s un- example, the company has introduced two new mainstream veiling of the Brickland platform is a perfect example. Like platforms and four different server processors since 2009. its predecessor, Brickland supports glueless four-socket and In contrast, the existing Boxboro-EX platform for high-end eight-socket servers, and it can be extended to larger systems servers made its debut in 2010 along with the Xeon 7500 using proprietary node controllers from third parties. series (code-named Nehalem-EX) but only saw one up- The cornerstone of Brickland is Ivy Bridge-EX, which grade: the Xeon E7 (code-named Westmere-EX) in 2Q11. will be billed as the Xeon E7 v2 family. Compared with the The Brickland platform brings comprehensive changes previous generation, Ivy Bridge-EX takes advantage of a to high-end servers, as Figure 1 shows. The memory is still new processor core and north-bridge microarchitecture, DDR3, but the bandwidth increases by 25% and, more new 22nm technology, and a new platform to deliver 1.7x importantly, the capacity triples. Intel changed the cache- to 2.3x higher performance on a variety of industry- coherence protocol from source snooping to a more standard benchmarks and customer applications. It also complex and scalable home-snooping design to improve integrates PCIe 3.0 on the die and only uses up to 15% performance in four- and eight-socket systems. True to more power per socket. Moore’s Law, Brickland shifts I/O from discrete to Intel’s mainstream server processors for one- and two- integrated PCIe links, substantially reducing system power, socket systemsIntel follow Brickland a fairly rapid Platform cadence, but high-end cost, and complexity. The processor and platform power

PCIe 2.0 PCIe 3.0 PCIe 3.0 PCIe 2.0 7500 IOH PCIe 2.0 36 lanes 4 lanes 32 lanes 32 lanes 4 lanes (Boxboro) QPI

DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3DDR3 DDR3DDR3 6xDDR3 DDR3 6xDDR3 DDR3 4x DDR3 Xeon E7 Xeon E7 4x DDR3 SDRAMSDRAM Xeon E7 v2v3 Xeon E7 v2v3 SDRAMSDRAM SDRAMSDRAM SMB SMB SDRAMSDRAM SDRAMDIMMs SMB QPI SMB DIMMsSDRAM SDRAMDIMMs QPI DIMMsSDRAM

QPI QPI QPI QPI DDR3 DDR3 DDR3DDR3 QPI DDR3DDR3 DDR3DDR3 QPI DDR3DDR3 6xDDR3 DDR3 v3 v3 6xDDR3 DDR3 4x DDR3 Xeon E7 Xeon E7 4x DDR3 SDRAMSDRAM Xeon E7 v2 Xeon E7 v2 SDRAMSDRAM SDRAMSDRAM SMB SMB SDRAMSDRAM SDRAMDIMMs SMB SMB DIMMsSDRAM SDRAMDIMMs DIMMsSDRAM DMI 2.0 QPI 4 lanes PCIe 3.0 PCIe 3.0 PCIe 2.0 7500 IOH PCIe 2.0 C602J 32 lanes 32 lanes 4 lanes ICH10 DMI (Boxboro) 36 lanes 4 lanes • Supports up to 1.5TB DRAM per socket (6TB total) 14x USB2 2x SATA 3 • 18 cores per socket (72 cores, 144 threads total) 14x USB2 6x SATA 2 4x SATA 2 • 115-165W per socket, 800W+ for four-socket system Figure 1. Intel’s Brickland• Fully connected and inter-socket Boxboro- networkEX platforms. Brickland improves memory capacity and substantially increases I/O capabilities while eliminating two discrete components.

© 5IF-JOMFZ(SPVQtMicroprocessor Report February 2014 2 Haswell-EX Brings TSX to Servers

As Figure 1(a) shows, Haswell-EX consists of 18 cores quarter delay. Moreover, customers may prefer to use and 18 cache slices arranged in four columns along with more-affordable DDR3 memory rather than paying the pre- two memory controllers, three QPI controllers, and other mium associated with a new (and hence lower-volume) north-bridge components, all connected using Intel’s co- memory type. herent fabric. Compared with Ivy Bridge-EX, shown in The scalable Intel memory buffers decouple the Figure 1(b), the major changes are the larger core and from the memory interface and enable cache-slice counts as well as the structure of the on-die the Brickland platform to weather the DDR4 transition fabric. As with all of Intel’s recent server designs, each ring smoothly, resolving the tension between platform longev- includes buses for request, snoop, acknowledge, and 32- ity and the DRAM-industry roadmap. The company’s ex- byte data. The Haswell-EX fabric has switches that buffer isting C10x (Jordan Creek) buffers use DDR3 memory and data between the two pairs of unidirectional rings, reduc- work with the Xeon E7v3 for customers that prefer the ing latency. older memories. The new C11x (Jordan Creek 2) buffers Since each core comes with a 2.5MB L3 cache slice, the support DDR4 DIMMs, including 64GB LRDIMMs. total capacity increases from 37.5MB to 45MB. Although the DDR4 brings a number of benefits. In performance data rates in the cores have doubled, the L3 cache band- mode, the SMI2 link operates at 3.2GT/s and the DDR4 width per slice is constant; the peak aggregate L3 read memory operates at 1.6GT/s, a 20% bandwidth improve- bandwidth is 1.44TB/s, but the fabric delivers only 0.5TB/s ment over the previous generation. The active power con- because of traffic patterns and routing limitations. sumption should decrease, since the DDR4 core voltage is 1.2V compared with 1.35V for DDR3L; idle power also Intel Buffers Against DRAM Changes decreases, since DDR4 terminates at ground instead of half One hallmark of Intel’s Xeon E7 line is massive memory VDDQ. In lockstep mode, the SMI2 link and DDR4 memory capacity: 1.5TB per socket, twice that of the Xeon E5. interface operate at 1.866GT/s, a 17% increase over DDR3. Haswell-EX has two memory controllers, each driving two From a reliability standpoint, DDR4 adds CRC and SMI2 (scalable memory interconnect) links to a total of four link-level retry to the command and address bus. The scalable memory buffers. In turn, the memory buffers drive Haswell-EX memory controllers include new reliability a total of eight memory channels. features such as address-based mirroring, which allow the Memory is one of the main areas where Intel’s design firmware or OS to specify protected address ranges and goals—specifically, platform longevity and capacity—conflict thereby reduce mirroring overhead. The memory control- with the industry’s overall trajectory. The Brickland plat- ler and firmware can also failover an entire memory rank form debuted in 1Q14, before the 2Q14 release of DDR4 to a spare rank in the same controller. DRAM. The heightened validation requirements of four- Haswell-EX also includes several features that im- and eight-socket server systems mean that waiting for the prove cache coherence. The QPI links can run at 9.6GT/s, availability of DDR4Intel would Haswell-EX have entailed at least aProcessor two- a modest improvement over Ivy Bridge-EX’s 8GT/s. Each • Two bidirectional PCI Express 2xQPI QPI rings with (32 lanes) PCIe QPI QPI PCIintermediate Express 2xQPI DMI 2.0 Control Control Control QPI (4 lanes) (32 lanes) PCIe QPI QPI DMIswitches 2.0 Control Control Agent (4 lanes) 2.5MB HSW Switch • Each ring has 32B L3$ CPU IVB(256b) 2.5MBchannelsIVB 2.5MB 2.5MB IVB 2.5MB HSW 2.5MB 2.5MB HSW HSW 2.5MB HSW CPU L3$ CPU L3$ L3$ CPU CPU L3$ L3$ CPU CPU L3$ L3$ CPU • Total L3 capacity is IVB 2.5MB IVB 2.5MB 2.5MB IVB HSW 2.5MB 2.5MB HSW HSW 2.5MB 2.5MB HSW 45MB CPU L3$ CPU L3$ L3$ CPU CPU L3$ L3$ CPU CPU L3$ L3$ CPU HSW 2.5MB 2.5MB HSW HSW 2.5MB 2.5MB HSW • IVBRing and2.5MB L3 IVB 2.5MB 2.5MB IVB CPU L3$ L3$ CPU CPU L3$ L3$ CPU CPUoperate L3$at coreCPU L3$ L3$ CPU HSW 2.5MB 2.5MB HSW HSW 2.5MB 2.5MB HSW IVBfrequency2.5MB IVB 2.5MB 2.5MB IVB CPU L3$ L3$ CPU CPU L3$ L3$ CPU CPU(2.5GHz)L3$ CPU L3$ L3$ CPU 2.5MB HSW IVB 2.5MB IVB 2.5MB 2.5MB IVB Switch L3$ CPU • CPUEntire systemL3$ isCPU L3$ L3$ CPU completely cache Memory Memory coherent DRAM DRAM (a) Control Control (b) Control Control Supports 2x SMI2 DDR3 2x SMI2 DDR3/4 2x SMI2 DDR3/4 • 2x SMI2 DDR3 transactional Figure 1. Block diagram of (a) Haswell-EX and (b) Ivy Bridge-EX. Both devicesmemory share the same platform and I/O configuration. The internal fabric is similar, but Haswell-EX scales it to account for the greater number of cores and L3 cache slices. • $7,175!

© 5IF-JOMFZ(SPVQtMicroprocessor Report May 2015

Haswell-EX Brings TSX to Servers 3 memory controller also contains a 14KB sectored directory number to 20. Oddly enough, the company has now re- cache with eight ways and 256 sets that tracks the sharing versed course: as Table 1 shows, the Xeon E7v3 family has status of a cache line across up to eight different sockets. only eight regular models and four specialty models (two The on-die cache reduces directory updates and eliminates database, one power efficient, and one HPC). snoop traffic, since any directory-cache hit will send a The E7-8890v3 is the highest-throughput model, directed snoop instead of a broadcast. sporting 18 cores operating at a 2.5GHz base frequency and 2.9GHz peak frequency as well as 45MB of L3 cache and a Same Socket, Different Power Delivery 165W TDP rating. The specialty models come in several A more complicated aspect of the Brickland platform is flavors. The E7-8893v3 and 8891 are optimized for per-core power delivery. The Ivy Bridge generation relies on con- performance to reduce the overall solution cost for licensed ventional VRMs to deliver power, whereas the Haswell software running on the core (e.g., SAP or Oracle). The 8893 generation uses an integrated voltage regulator (see MPR delivers 43% higher SPECint_rate2006 per core than the 7/29/13, “Haswell’s FIVR Extends Battery Life”). It was 8890. The Xeon E7 family isn’t low power in an absolute not immediately clear how Intel would resolve such a mid- sense, as the memory buffers alone consume 36W per stream platform-level change or whether the E7 line would socket, but the E7-8880L is the most power-efficient model: simply forgo the FIVR. it reduces frequency by 20% in exchange for 30% lower For Brickland, Intel specified that the current carried power relative to the 8890. The 8867 has only 16 cores be- over the main voltage rail would supply the entire chip cause certain HPC applications experience significant per- through the FIVR. As a result, Haswell-EX can reap some formance losses for non-power-of-two core counts. benefits of the FIVR, such as fast and fine-grain voltage The four E7-88x0v3 models are intended for up to control. For example, each core can operate at a separate eight-socket systems, and they largely differ in frequency, voltage/frequency point, and the L3 cache and ring can be core count, and TDP. The E7-48xxv3 models are limited to separately optimized. But the full board-level cost reduc- just four-socket systems and enable customers to trade in tion and compaction will remain elusive until Skylake, which performance for lower price. will probably have a different power-delivery architecture. Although prices for the Xeon E7v3 family are higher The Xeon E7v3 and Xeon E5v3 derive from the same than for the Xeon E7v2, the increase is quite modest. physical chip, which contains a superset of functions from Comparing similar model numbers (e.g., E7-8890v2 versus the two designs. For example, the full die has three QPI E7-8890v3), the greatest price difference was 6%, and the links (one of which is disabled in the E5), 40 PCIe 3.0 lanes average a mere 3%. Considering that for those models, (8 of which are disabled in the E7), and a memory interface SPECint_rate2006 performance increases by 18–56%, the that supports both SMI2 and DDR4. price/performance gains are compelling. After removing As Figure 2 shows, the cores and L3 cache slices are several models from the family, the price for the least- distributed unevenly across the die. The left three columns expensive eight-socket processor jumps by about $1,000 contain only four cores each, whereas the rightmost col- (or 33%), but the least expensive four-socket processor, the umn contains six. The memory controllers and PHYs oc- One Die For Both Haswell-EP and -EX cupy the bottom-left portion of the chip, and the die’s top strip comprises QPI and PCI Express controllers and QPI 0, 1 PCIe QPI 2 PHYs, as well as assorted logic such as power management. Intel chose this physical configuration to easily create three different physical die for the high-volume E5 line simply by removing vertical slices of the design. The mid- range version, for example, eliminates the rightmost col- Haswell-EP Haswell-EP Haswell-EP umn, and the low-end version cuts the die nearly in half. 8 Cores 12 Cores 18 Cores The Xeon E7 only has a single physical instantiation, which it shares with the largest Xeon E5; pinouts, packaging, and fused-off functions differentiate the two. This extensive sharing reduces Intel’s development costs and validation burden for the Xeon E7. DDR DDR

Intel Kills Off the Models Figure 2. Die micrograph of Haswell-EX/EP. Both the EX and Over the last several years, Intel’s marketing department EP versions use the same 662mm2 chip. The high-core-count has energetically produced myriad model numbers at a variants of Haswell-EP employ the same silicon as Haswell-EX, pace that threatens to outstrip that of Moore’s Law. The but with different I/O configurations. The latter chip enables Xeon E7v1 (Westmere-EX) family contains 18 different more QPI links and fewer PCIe lanes, and it uses SMI2 mem- models, and the E7v2 (Ivy Bridge-EX) family increases that ory interfaces. (Source: Intel)

© 5IF-JOMFZ(SPVQtMicroprocessor Report May 2015

Intel Skylake • App Req vs Tech Constraints • Skylake System Overview • Skylake Processor • Skylake Memory • Skylake Network • Skylake System Manager

4 Skylake SpeedShifts to Next Gear improves scheduling and arbitration, reducing overhead SpeedShift allows the processor to work with the OS and raising the effective bandwidth. The processor contin- to implement finer-grain power management. Skylake’s ues support for embedded DRAM (eDRAM), which caches integrated power control unit (a small microcontroller) both graphics and CPU data (see MPR 9/9/13, “Iris Pro monitors the CPU load and determines the most effective Takes On Discrete GPUs”). clock speed. The P-state control loop for SpeedShift is Skylake is Intel’s first PC processor to handle DDR4 dramatically faster than the OS: it can change states every DRAM, reaching data rates of 2,133MHz compared with 1ms. As a result, the processor can better match the CPU 1,600MHz for Broadwell’s DDR3 interface. DDR4 memory or GPU speed to the workload, even when the load is still more expensive than DDR3, but its price is falling changes rapidly or in ways that are difficult for the OS to rapidly. Servers are already shifting to DDR4, and we expect anticipate. the faster memory to be popular in Skylake PCs, particularly In particular, the PCU continually checks on-chip in 2016. For those OEMs that don’t want to upgrade, the test circuits to determine the most power-efficient operat- processor also supports DDR3L. Skylake additionally up- ing frequency, which is not necessary the lowest frequency, grades to 16 lanes of PCI Express 3.0, which doubles as Figure 3 illustrates. To maximize battery life, the pro- bandwidth relative to version 2.0. For PCs, this change cessor operates at that frequency until the task is complete, improves the connection to the external graphics card. The then it enters sleep mode. (Intel calls this approach processor supports up to three high-resolution (4K) dis- “energy-aware race to halt,” or Earth.) For critical appli- plays. The new south bridge adds support for USB3.1 and cations, however, the OS can request higher performance, the new reversible Type-C connector. in which case the PCU will run the CPU at the maximum possible frequency within thermal constraints, up to the Power Management Gets Hard Turbo Boost limit or some other limit that the OEM can Like Haswell, the Skylake PC processor has four major preset. voltage domains: the CPU subsystem, the GPU shaders These power-management changes have little effect (which Intel calls the “slice”), the GPU fixed-function logic on the processor’s idle power, nor do they affect TDP, (e.g., video engine), and the north bridge (“system agent”). which represents a thermal limit achieved when the pro- Each domain’s voltage can be adjusted independently of cessor is running at full speed for an extended time. the others, or the entire domain can be gated off when not Instead, SpeedShift saves power by briefly turning off por- in use. Within each domain, the power to unused sections tions of the chip when they are unneeded. The benefits of can be gated off. For example, Skylake can power-gate this approach are difficult to quantify, as they vary widely sections of a CPU (e.g., AVX units), full CPU cores, indi- depending on the application. vidual sub-slices in the GPU, the L3 cache, and the ring. In The initial Skylake products exclude the integrated addition, clock gating provides finer granularity for power voltage regulator (FIVR) found in Haswell and Broadwell management, but unlike power gating, it doesn’t eliminate (see MPR 7/29/13, “Haswell’s FIVR Extends Battery static power. Life”). Intel gave no specific reason for eliminating the Traditionally, the operating system sets the P-state of FIVR, which reduces system power, but sources indicate each core; the processor then applies the appropriate fre- that the task of bringing so many Skylake products to mar- quency and voltage. This approach works well for predict- ket at once left insufficient time in the schedule to validate table and steady workloads, but the OS P-state control loop them with that feature. To duplicate the function of the only makesTrading adjustments Off Energy every 30ms. vs Performance FIVR, Skylake motherboards require multiple expensive discrete voltage regulators, increasing board area and sys- tem cost. The FIVR’s benefits are greater at higher TDPs, Total Energy = so the technology may reappear in Skylake server products. Compute + system energy Desktops Leap Two Generations For desktop PCs, Intel rolled out new S-Series and K-Series products in the Core i7, Core i5, Core i3, and Pentium Energy Pe=PMost Efficient lines. In addition, it offers new low-power T-Series prod- Compute energy ucts in the same lines; these processors are popular where SoC and ~f2 system energy ~1/f cooling is constrained, such as in mini-PC and all-in-one designs. The quad-core S-Series chips have a 65W TDP, Performance ~ f whereas the dual-core versions run at 35W. Overall, this Figure 3. Autonomous power management. Every milli- launch updates nearly the entire desktop line in one fell second, the processor determines its most power-efficient swoop, omitting only the ultra-low-cost Celeron products. operating frequency on the basis of workload and system It excludes vPro models; these products, which target cor- characteristics. porate PCs, are due in the fourth quarter.

© 5IF-JOMFZ(SPVQtMicroprocessor Report September 2015

SKYLAKE SPEEDSHIFTS TO NEXT GEAR Hardware Power Management, Wider CPU Improve Efficiency

By Linley Gwennap (September 7, 2015) ......

In one of its largest product introductions ever, Intel the first time, Intel is developing different versions of the rolled out 48 new PC processors based on its new Skylake new CPU, omitting AVX512 for PCs but including it for CPU design. Skylake uses the same 14nm process as Broad- servers. The latter version of Skylake will appear in Xeon well but offers several improvements. Perhaps the most E5 and E7 server processors in 2016. significant is a move to hardware power management, a feature Intel calls SpeedShift. The sixth-generation CPU has Improving the PC instruction-set extensions dubbed SGX and MPX that work Intel’s vision of the future includes PCs with no wires. The with Windows 10 to improve security, and AVX512 doubles company’s WiDi technology, which allows a PC to connect the processing speed for vector operations. wirelessly to a TV or monitor via Wi-Fi, is starting to gain Owing to delays in the 14nm ramp, only the Y-Series users. A new technology called WiGig delivers up to 7Gbps Broadwell launched in 2014; mainstream notebook models using the 60GHz band (see NWR 5/19/14, “Millimeter appeared in January of this year, and a few high-end desk- Wave Seeks Best Home”); some Skylake notebooks will top models rolled out a few months ago. Now that the connect to a docking station using this technology, elimi- 14nm process is in full production, Skylake is launching nating a wired connection. The docking station then links with a broad set of notebook and desktopECE chips, 4750including Conceptsto a monitor, keyboard, and other peripherals. dual- and quad-core versions. This debut marks the first update in two years for mainstream desktop processors, Up to 4K Graphics Display PCIe 3.0 which entirely skipped the Broadwell generation. Display Card North For systems that are moving up from the 22nm Bridge Intel DMI/OPI MC DIMMs Haswell, Skylake provides significant gains in either clock 100 Series South Skylake L3 2 channel speed or power. Compared with Broadwell, however, the USB 3.1 gains are minimal; these products see little or no clock- Bridge CPU Core Cache DDR3/DDR4 speed increase. We estimate the enhancements to the CPU Skylake L3 design yield about 5% more performance per clock on Wi-Fi + CPU Core Cache Bluetooth most PC applications. Skylake’s advances are mainly out- Skylake L3 CPU Core Cache side the CPU: the processor includes a new image- Ethernet Skylake L3 processing engine (ISP) and adds DDR4 and PCI Express CPU Core Cache 3.0 interfaces, as Figure 1 shows. Revisions to the GPU and Thunderbolt 64MB or 128MB HD 500 Graphics Unit video engine boost performance per watt. eDRAM WiGig Most of the newly announced products are available ISP Video immediately, but the Iris and Pentium models will ship in theProcessors fourth quarter, along with new vPro products. By early Figure 1. Intel Skylake processor. In addition to CPU and 2016, the architecture will extend all the way to the low- Memoriesgraphics improvements, the design includes hardware power end •Celeron Pipelining brand as well as the high-end Iris Pro line. For • Multi-Levelmanagement, a new Caches ISP, and DDR4 capability. • Superscalar Execution • Private/Shared Caches • OOO Execution © 5IF-JOMFZ(SPVQtMicroprocessor Report • Consistency, Coherence September 2015 • Register Renaming • Translation/Protection TLB • Memory Disambiguation • Branch Prediction & Spec Exec Networks • SIMD Extensions • On-Chip Ring, Inter-Socket All-to-All • Multithreading • Routing, Buffering