...... HASWELL:THE FOURTH-GENERATION CORE PROCESSOR

...... HASWELL, THE FOURTH-GENERATION PROCESSOR ARCHITECTURE, DELIVERS A Per Hammarlund RANGE OF CLIENT PARTS, A CONVERGED CORE FOR THE CLIENT AND SERVER, AND Alberto J. Martinez TECHNOLOGIES USED ACROSS MANY PRODUCTS.ITUSESANOPTIMIZEDVERSIONOFINTEL Atiq A. Bajwa David L. Hill 22-NM PROCESS TECHNOLOGY.HASWELL PROVIDES ENHANCEMENTS IN POWER- Erik Hallnor PERFORMANCE EFFICIENCY, , FORM FACTOR AND COST, CORE AND Hong Jiang , AND THE CORE’S INSTRUCTION SET. Martin Dixon ...... Haswell, the fourth-generation Haswell is a “tock”—a significant micro- Intel Core Processor, delivers a family of pro- architecture change over the previous- Michael Derr cessors with new innovations.1,2 Haswell generation Ivy Bridge. Haswell is built with delivers a range of client parts, a converged an SoC design approach that allows fast and Mikal Hunsaker core for the client and server, easy creation of derivatives and variations on Rajesh Kumar and technologies used across many products. the baseline. Graphics and media come with Many of Haswell’s innovations are in the more scalability that lets designers build effi- Randy B. Osborne areas of improving power-performance effi- cient configurations from the lowest to highest ciency and power management. Power- end. The core comes with power-performance Ravi Rajwar performance efficiency has been enhanced to enhancements and a set of new instructions, increase the processor’s operating range and such as floating-point fused multiply-add Ronak Singhal improve its inherent performance in power- (FMA) and transactional synchronization limited scenarios and its battery life. extensions (TSX). Reynold D’Sa Improvements in power management in- Haswell uses an enhanced version of Intel’s Robert Chappell clude additional idle states, specifically the 22-nm process technology, which has en- new active idle state S0ix, which enables 20 hanced tri-gate transistors to reduce leakage Shiv Kaushik reduction in idle power. One key enabler for current by a factor of 2 to 3 with the same power-performance improvements is the frequency capability. Haswell’s version of the Srinivas Chennupaty fully integrated voltage regulator (FIVR), 22-nm process has 11 metal interconnect which also improves board space and cost. layers, compared to nine for Ivy Bridge, to opti- Stephan Jourdan Performance improvements in the core and mize for better performance, area, and cost. graphics come with corresponding improve- Steve Gunther ments in hierarchies; the first two cache levels have twice the bandwidth. For the Power efficiency and management Tom Piazza top graphics configurations, Intel Iris Pro Current processors operate in power- Graphics, Haswell also introduces a new constrained modes; they must maximize the Ted Burton fourth-level, 128-Mbyte on-package cache performance they deliver inside a fixed power Intel that enables a new level of integrated graphics envelope. This power constraint is true for performance. both server and mobile applications. One of ......

6 Published by the IEEE Computer Society 0272-1732/14/$31.00 c 2014 IEEE DMI PCI Express* Power System IMC Display agent

Core LLC

Core Performance LLC

Figure 1. Power and performance voltage- Core LLC frequency scaling improvements. The baseline (solid line) is improved (dashed line) by being lowered and by being Core extended for better burst and Turbo LLC headroom. the most important goals of a new processor Processor graphics generation is to dramatically improve power- performance efficiency. In Figure 1, the basic nonlinear relationship between power and Figure 2. Conceptual block diagram of the performance is shown in the solid line. To im- Haswell processor showing the different prove power-performance efficiency across the independent voltage domains. The figure voltage-frequency scaling range, we must also shows Haswell’s and achieve three goals, as shown in the dashed line: , which features extending the operating range down- bandwidth, load balancing, and DRAM ward to allow the processor to go into efficiency improvements. smaller form factors that are even Optimized microarchitecture and more power constrained, algorithms. In each generation, we improving the basic power-perform- evaluate for sufficient power- ance efficiency of the processor by performance efficiency. Areas that fall pushing each operating point to the below our goals will be reimple- right and down, and mented in ways that improve the extending the operating range upward power-performance efficiency. for more burst and Turbo headroom. Optimization of design and imple- In Haswell, we employ multiple techniques mentation through continued focus to improve power-performance efficiency. We on gating unused logic and using can describe them in three categories: low-level low-power modes. implementation, high-level architecture, and An example of a high-level architecture platform power management. improvement in Haswell is extending the use Examples of low-level implementation of independent voltage-frequency domains. improvements include the following: Figure 2 shows a conceptual block diagram of Optimized manufacturing, process tech- the different voltage-frequency domains. nology, and circuits help achieve all Cores, caches, graphics, and the system agent three goals just listed. These improve- are all running at dedicated, individually con- ments are enabled by Intel’s manu- trolled voltage-frequency points. A power con- facturing capability and a deep trol unit (PCU) dynamically allocates the collaboration across the different Intel power budget among the domains to maxi- teams. mize performance. Prioritization based on ...... MARCH/APRIL 2014 7 ...... HOT CHIPS

runtime characteristics select the domain with converts into burst performance), a substan- the highest-performance return. For example, tial battery life increase, and a 70 to 80 per- for a graphics-focused workload, most of the cent platform footprint reduction. processor power is allocated to the graphics Figure 3 gives an overview of FIVR. A domain. Sufficient power is allocated to the first-stage voltage regulator (VR), which is on rest of the blocks that the graphics domain the motherboard, converts from the power depends on for performance, such as the sys- supply or battery voltage (12 to 20 V) to tem agent to provide memory bandwidth. approximately 1.8 V, and the second conver- At a platform level, we improved battery sion stage is provided by parallel FIVRs (one life to deliver “all-day experiences.” To achieve for each major architectural domain). As this, we focused both on active workloads, illustrated, FIVR eliminates four VRs from such as media playback, and on idle power. the prior platform. To support the new Intel Haswell achieves a 20 improvement in Iris Pro Graphics variants of Haswell, those idle power. Haswell has evolutionary power- platform VRs would have grown in both size management improvements, such as improve- and number. With FIVR, a platform-size ments in C-states (CPU idle states). Haswell reduction opportunity was achieved instead has both new, deeper C-states and improve- of what would have been a substantial ments in the entry-exit latencies to C-states. growth. That platform space can be used to These latency improvements let Haswell more add platform features, increase the battery aggressively enter deep C-states. size, and reduce the platform dimensions in Haswell also has revolutionary power- many Haswell mobile products. management improvements—for example, At the onset of the Haswell design, FIVR’s the introduction of a new active idle-power expected benefits fell into half a dozen state, S0ix. We leverage learnings from past categories: phone and tablet development to deliver 20 Battery life increase. FIVR’s 140-MHz improvements in idle power compared to the switching frequency enables several prior generation. This improvement enables orders of magnitude less output decou- significant improvements in realizable battery pling and much lower input decoupling life. S0ix appears to software as an active state, than the prior generation’s voltage rails, while in actuality the hardware autonomously allowing input and output voltages to enters and exits deep idle states with low be quickly reduced or powered off to latency. The new power state is transparent to save power, and quickly ramped back well-written software. Power management of up for brief high-performance bursts. platform components is continuous and fine Increased available power for in- grained; everything that is not needed is indi- creased burst performance, where vidually turned off. FIVR can direct the entire package Fully integrated voltage regulator power to the unit that needs the most power, compared to separate VRs for Power delivery to higher-performance separate units in the previous platform. processors comes with many conflicting Decreased power required for a given requirements, such as the need for higher level of performance or, almost power for extended burst capability, a greater equivalently, increased performance number of individually controlled voltage for a given power consumed. rails, and the need for a physically smaller Decreased platform cost and size footprint for new form factors. In response from removal of components and to these requirements, Haswell processors external power rails. are powered by a 140-MHz, multiphase Improved product flexibility and FIVR.2-4 The industry’s first large-scale scalability; for example, new units deployment of high current switching regula- can be added with little impact on tors integrated into a VLSI die and package. the platform power-delivery systems. FIVR is the enabling technology behind key Haswell improvements, including a 2 to FIVR delivered benefits in every category, 3 increase in peak available power (which some larger than expected...... 8 IEEE MICRO 3.5 to 1 surface area

Vccin

(a)

Ivy Bridge platform Haswell platform

Haswell processor PLL VR 1.8 V Core VR FIVR VRs: variable Vccsa voltage VccIn Vccio 0 V-1.2 V Vccioa 0 V-1.8 V input/output VccCore 1 VR 1.0 V Ivy Bridge VccCore 2 Graphics processor VccCore 3 Logic VR VccCore 4 blocks variable VccCache voltage System agent Graphics0 0 V-1.2 V VR Graphics1 DDRx VccEDRAM DDRx DDR VccOPIO VR DDR DDRx VR DDRx Example voltage planes (b)

Figure 3. Example of possible platform improvements. Haswell’s fully integrated voltage regulator (FIVR), shown at the bottom, enables substantial board space and cost savings compared to the Ivy Bridge platform, shown at the top (a). Multiple voltage regulators (VRs) in the previous platform are combined into one VR (b).

Haswell Peripheral Controller Hub providing I/O management, and its activity Significant power reduction was achieved depends on direct-memory-access traffic to by the power-management improvements in and from peripheral devices in the PCI, the Peripheral Controller Hub (PCH). USB, Serial ATA, Wi-Fi and audio-voice- The Haswell PCH is responsible for speech subsystems...... MARCH/APRIL 2014 9 ...... HOT CHIPS

A significant innovation is the introduc- The graphics and media configurations tion of exit latency timers that are pro- can vary by the number of subslices, video grammed with the latency demands of devices decoders, and samplers to vary power and attached to the I/O links. The power manage- performance profiles. ment controller (PMC) in the PCH uses these The generic and traditional rendering latencies to calculate and gracefully adapt the pipeline—fetch!shade (vertex, hull, do- I/O subsystem from “high performance with main, geometry), rasterize, pixel shade— low latency” to “low power with higher maps onto the global assets. A global (across latency” by stopping clocks, shutting down graphics) cache provides storage and coher- phase-locked loops and, finally, locally power- ence between shared elements and allows gating the I/O control modules of each system specific cache configurations for GPGPU independently. These changes allowed for a (general-purpose computing on GPUs) 40 power reduction in the I/O subsystem computing. from the previous-generation PCH. The media functions are woven into the graphics architecture and provide a scalable Haswell graphics and media and programmable option for video encod- Haswell graphics and media are built to ing, decoding, and postprocessing. The func- scale across a wide range of processor config- tions include the following: urations, from low to very high integrated fixed function, such as multiformat graphics and media performance. To achieve video decoders sized for ultra high- this scale, Haswell graphics were built from power efficiency; the start with scale in mind (see Figure 4). scalable assets, such as motion esti- Haswell graphics and media are roughly mation hardware; and split into six domains: programmable postprocessing filters. Global assets, including geometry Haswell graphics are designed to be front-end up to setup. latency tolerant, because they share the mem- Slice common—shared functions, ory subsystem with latency-sensitive CPU including a rasterizer, level-3 cache cores. Graphics and media share last-level (internal to graphics), and pixel back- cache in a programmable fashion with the end. CPU, allowing for a software-tunable Subslice, including (execu- cache-sharing policy for graphics and tion units [EUs]), instruction caches, CPU hardware for optimal performance. and texture samplers. These subslices This is further enhanced in the presence of are scalable in number to achieve the the larger embedded DRAM (eDRAM) desired performance. cache, which lets graphics access a high Multiformat video codec engine. bandwidth. Video quality enhancement engine. Display pipelines. EUs are general-purpose programmable Cache hierarchy and the eDRAM cache cores that support a rich instruction set that Haswell delivers substantial performance has been optimized to support various 3D improvements in cores, media, and graphics, API languages as well as media func- and needs a corresponding improvement in tions (primarily video) processing. memory bandwidth. In addition to the tradi- Shared functions are hardware units that tional double-data-rate (DDR) memory- provide specialized supplemental functional- speed enhancements, Haswell has further ity for the EUs. A shared function is imple- improvements in cache hierarchy pipelines, mented where the demand for a given single- streaming write bandwidth, load specialized function is insufficient to justify balancing, and memory-scheduling efficiency. the costs on a per-EU basis. Instead, a single We optimized all of the pipelines of the instantiation of that specialized function is cache and memory hierarchy for efficiency implemented as a stand-alone entity outside improvements. Microarchitecture work im- the EUs and shared among the EUs. proved pipeline efficiency. For example, ...... 10 IEEE MICRO Command streamer (CS) Video front end (VFE) Video Multi- quality format Blitter Display Vertex engine CODEC fetch (VF)

3D sampler Vertex EU EU shader (VS) Media sampler L1 IC$ Tex$

EU EU Data port Hull shader (HS)

Tessellator Thread dispatch Thread

Rasterizer/ L3$ Pixel Render$

Ring bus/LLC/memory Domain depth ops depth$ shader (DS)

Geometry shader (GS) 3D sampler EU EU Media Stream- sampler out (SOL) L1 IC$ Tex$

EU EU Data port

Clip/setup

Figure 4. Haswell graphics block diagram. Haswell graphics and media are built with a modular approach, which enables scale across a range of system configurations. optimizing the concurrency of requests— system achieves fairness between reliable that is, handling different kinds of requests in low-latency accesses to resources with the separate pipelines—resulted in efficiency possibility of high-bandwidth access for, improvement of up to 40 percent. Further- for example, the graphics engine. more, the cache hierarchy exploits the weakly Haswell also has improvements in the ordered nature of write-combining stores to memory schedulers for better write through- increase the maximum number of concurrent put. The Haswell memory controllers have requests from a single Intel Architecture (IA) deeper pending queues, more decoupling, core from 10 to 40þ. This change greatly and better scheduling algorithms. increased the streaming write bandwidth to These cache hierarchy improvements are memory for single-thread workloads. good for most of the Haswell configurations. Load balancing between request agents For the Haswell top-end graphics configura- was improved. The credit-based bandwidth tion, Intel Iris Pro Graphics, we needed even management system was optimized to effi- more bandwidth and developed a dedicated ciently share resources. The management solution...... MARCH/APRIL 2014 11 ...... HOT CHIPS

die is on-package to exploit close proximity for low latency and low-power interconnect. The L4 cache architecture with eDRAM gives the following benefits compared to other candidates: superior bandwidth per watt over DDRx and GDDRx, a unified memory bandwidth solu- tion for both IA and graphics, minimized motherboard real estate for small form factors, and an in-package memory solution to enable package-based performance (a) upgrade opportunities. Multichip package The Intel Iris Pro Graphics CPU and CPU/eDRAM eDRAM are in a multichip package (MCP) OPIO connected using a full-duplex on-package interface R T R T I/O (OPIO), as shown in Figure 5. The CPU x x x x hosts the L4 cache controller, and the tags C R T R T and the enhancements needed in the power o x x x x n data control unit. A low-power, high-bandwidth t C R R C link connects the CPU to the L4 and r L E E L CPU o K Q request Q K eDRAM eDRAM cache die. l l R T R T The eDRAM was architected to be inte- e x x x x grated with the existing cache hierarchy with r data R T R T minimal impact. The eDRAM acts primarily x x x x as a victim cache for the L3, being filled by Side Side band band evictions. Unlike the on-die L3 cache, the eDRAM is not inclusive of the core caches, allowing graphics data to be read and written (b) directly without the need to fill into the on- die L3, saving the L3 storage for more Figure 5. Intel Iris Pro Graphics multichip package ball grid array: photo of latency-critical IA core accesses. the package view (a) and block diagram (b). The eDRAM tags are stored in traditional static RAM (SRAM) on the processor die. To save power and area, each tag entry, called a eDRAM memory bandwidth solution superline, represents a 1-Kbyte region made Meeting Intel Iris Pro Graphics perform- up of sixteen 64- cache lines. All incom- ance targets required much more bandwidth ing requests look up the on-die L3 and than the two channels of DDRx for clients. eDRAM in parallel. Additional external memory channels would Intel Iris Pro Graphics adds several cache have been very costly and had adverse effects controls that help graphics use the L3 and on the physical size in the platform. The eDRAM caches. Either cache can be parti- bandwidth solution we implemented is a tioned between IA or graphics traffic to pro- 128-Mbyte L4 (fourth-level) cache providing vide quality of service. Graphics can prevent 102 Gbytes/second peak bandwidth surfaces from allocating in the caches if they (51 Gbytes/second read and simultaneously would not benefit from caching. Graphics is up to 51 Gbytes/second write). The L4 cache also able to use the eDRAM to cache display- data store is a discrete die made using Intel able surfaces for the first time. eDRAM process technology, providing both OPIO exploits the short trace lengths high-density memory and high-speed logic within the MCP to simplify I/O and clock- for high-bandwidth I/Os.5-7 The eDRAM ing circuits to significantly reduce power, ...... 12 IEEE MICRO 1.80 Speedup vs. no eDRAM 1.70 1.60 1.50 1.40 1.30 1.20 1.10 1.00

‐Aliasing

3DMark06* score DOTA2*DOTA2* 1920x1200 2048x1536 LateView1920x1200 DirectX*11 Civilization5*PCMark Vantage*Left4Dead2*2048x1536 score Left4Dead2* 1920x1200

DirectX*10 Hawx* 1920x1200 DirectX*10 Far Cry 2* 1920x1200 DirectX*11DirectX*11 Stone Giant Stone 1920x1200 Giant 2560x1600 DirectX*10 ResidentEvil5* 1920x1200 Left4Dead2* 1920x1200 Anti DirectX*10 Assassin’sDirectX*11 Creed* Alien 1920x1200 vs Predator* 2560x1600 DirectX*11 Civilization5* Leaders 1920x1200

Figure 6. Performance for graphics workloads using the Intel Iris Pro Graphics eDRAM cache. (See the disclaimer in the Acknowledgments section.) area, and latency while providing high band- maintains the eDRAM subsystem in one of width. OPIO uses only 1 W of total power to three states: on, off, or self-refresh. In the deliver 102 Gbytes/second. This is 3 band- “on” state, both eDRAM and OPIO are at width at 1/10 total power compared to the their active voltage, and the eDRAM control- 32 Gbytes/second DDR3. ler is sending refreshes to the eDRAM device The Intel Iris Pro Graphics eDRAM die is across the OPIO link. In the “off” state, both Intel’s first product instantiation of eDRAM eDRAM and OPIO clocks are stopped, and technology. The ability to combine dense voltage is reduced to 0 V. In “self-refresh,” eDRAM memory technology and high-speed EDRAM is kept at its active voltage, but logic gave the capability to have a single die clocks in the OPIO domain are stopped and with both large capacity storage and high voltage is dropped to 0 V. While the pro- bandwidth. cessor is active, the PCU can choose between TheeDRAMdieisimplementedinIntel on and off depending on the workload and 22-nm technology. The eDRAM array is a power and performance goals. While the pro- total of 128 Mbytes and architected for high- cessor is idle, the PCU can choose to turn bandwidth efficiency (also called low-loaded eDRAM off or put it into the self-refresh latency). There are 128 banks, each with a row state. The PCU periodically evaluates the cycle time of 4 ns, minimizing both the proba- eDRAM’s potential performance benefits bility of a bank conflict and the penalty when and decides whether to power on the it occurs. The array operates at 1.6 GHz, proc- eDRAM subsystem. essing a command per clock. The entire data- path to the array and within the array is full duplex, enabling write data transfers simulta- eDRAM performance neouswithreaddatatransfers.Eachdata The Haswell graphics performance im- transfer takes two clocks. provements are the main bandwidth driver The Intel Iris Pro Graphics Power Control that motivates eDRAM. Figure 6 shows the Unit(PCU)dynamicallymanagestheeDRAM performance gain for 128-Mbyte eDRAM device state depending on runtime evaluation over a baseline of two-channel DDR3-1600 of workload characteristics, performance goals, only, for a broad set of graphics benchmarks and energy efficiency considerations. The PCU and game titles (including a beta version of ...... MARCH/APRIL 2014 13 ...... HOT CHIPS

CRW loaded latency 110 105 Iris pro (eDRAM) Iris (no eDRAM) 100 95 90 85 80 75 70 65 Latency (ns) 60 55 50 45 40 35 30 0 5 10 15 20 25 30 35 40 45 50 100% Rd BW (GBps)

Figure 7. Loaded latency of the Intel Iris Pro Graphics eDRAM cache. (See the disclaimer in the Acknowledgments section.)

DOTA2). On average, eDRAM can increase operations (lops) for execution from two pri- performance by 30 to 40 percent. mary sources. The first source, a traditional The 128M eDRAM cache is large enough instruction cache/decoder pipeline, supplies to exploit both inter- and intraframe data lops by decoding up to 16 per cycle of reuse. It’s well known that graphics data, such complex instructions into up to four com- as textures, are reused multiple times inside pound lops. The second source, a lop cache, of a single frame. Extensive presilicon simula- stores decoded lops natively and supplies tion, confirmed by silicon measurements, them at a rate equivalent to 32 bytes per cycle. showed that data can also be reused between Up to four compound lops per cycle allo- frames. cate resources for out-of-order execution and In addition to providing high-bandwidth are split into simple lops. Up to eight simple power and energy efficiently to boost graphics lops per cycle can be executed by heteroge- performance, eDRAM is also carefully de- neous execution ports. Once complete, up to signed to minimize latency, as the outstand- four compound lops can be retired per cycle. ingly small latency sensitivity with load Each Haswell core shares its execution sustainable for random traffic shown in resources between two threads of execution Figure 7, which is especially important for via Intel Hyperthreading. eDRAM to benefit not only graphics but also Haswell’s core contains several innova- general CPU workloads. tions for performance and power, including the following:

Haswell core Intelligent speculation. Haswell decou- Haswell’s core is the next major evolution ples branch prediction, instruction in general-purpose out-of-order microarchi- tag accesses, and instruction transla- tecture, delivering both higher performance tion look-aside buffer (TLB) accesses and improved power efficiency across a broad from the supply of lops to execution. range of workloads and form factors. State-of-the-art advances in branch Like its predecessor, Ivy Bridge, the prediction algorithms enable accurate Haswell front-end pipeline supplies micro- fetch requests to “run ahead” of lop ...... 14 IEEE MICRO Unified reservation station Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 Port Port 6 Port 7 Port

Interger Interger Load and Store Integer Integer Store ALU and shift ALU and LEA store address data ALU and LEA ALU and shift address Vector FMA FMA FP mult 2xFMA shuffle FP multiply FP add • Doubles peak Vector int Vector int • Two FP multiples benefit Vector int multiply ALU legacy ALU

Vector Vector Vector logicals logicals logicals 4th ALU Branch Branch • Great for integer workloads • Frees Ports 0 and 1 for vector Divide

Vector New branch unit shifts New AGU for stores • Reduces Port 0 conflicts • Leaves Ports 2 and 3 open • 2nd EU for high branch code for loads

Figure 8. Haswell’s execution ports and execution units. The figure shows which functional units are mapped onto which port, and which new units and ports have been added. (LEA: load effective address.)

supply to hide instruction TLB and Aggressive clock and data gating, and cache misses. A new data TLB pre- new dynamic power-saving modes. The fetcher keeps page walk latency from Haswell core achieves power efficiency delaying many memory accesses. by aggressively gating idle logic. New A large out-of-order window. Haswell power-saving modes adapt to work- maintains 192 lops in flight in its load phases to save power without sac- reorder buffer and the supporting rificing performance. structures, such as load buffers (72), store buffers (42), reservation stations With these improvements, the Haswell core (60), and physical register files (168). delivers improvements across a range of This is approximately a 15 percent workloads. increase over Ivy Bridge to extract more parallelism. Instruction set enhancements Raw execution horsepower. Haswell Haswell delivers performance improve- provides eight heterogeneous execu- ments on legacy and unchanged codes, and tion ports, an increase over the six in also adds new instructions for even more per- Ivy Bridge. The two new ports and formance on suitable workloads. additions to existing ports combine to provide a fourth integer ALU, a Advanced Vector Extensions 2 second branch unit, a second float- With Advanced Vector Extensions 2 ing-point multiplier, and a third store (AVX2), Haswell adds instructions for FMA, address-generation unit. The addi- 256-bit integer vector computation, full- tional resources provide higher peak width element permute, and vector gather to throughput and fewer false resource benefit high-performance computing, audio conflicts. See Figure 8 for details. and video processing, and games...... MARCH/APRIL 2014 15 ...... HOT CHIPS

SHA-256 MB AVX2 – Wider 2.4 AES GSM: 2.2 AES-NI PCLMULQDQ SHA-256 2.0 RSA 2048: MULX – Multiply SHA-256 1.8 MultiBuffer RSA-2048 1.6

AES-GCM 1.4 SHA-256: RORX – Rotates, AVX2 1.2

1.0 Westmere Ivy Bridge Haswell

Figure 9. Examples of performance improvements with new instructions in the Haswell core. (See the disclaimer in the Acknowledgments section.)

Each Haswell core provides up to 32 Intel Transactional Synchronization single-precision or 16 double-precision float- Extensions ing-point operations per cycle using AVX2’s With Intel Transactional Synchronization FMA instructions and Haswell’s two FMA Extensions (Intel TSX), Haswell adds hard- hardware units. FMA operations can be done ware support to improve the performance of with the same latency as a floating-point lock-based synchronization commonly used multiply (five cycles) and are fully pipelined. in multithreaded applications. These applica- This achieves a 1.6 latency reduction versus tions take advantage of the increasing the previous generation. Haswell’s FMA can number of cores to improve performance. be used for both computational throughput However, when writing such applications, enhancement and for latency reductions. programmers must use concurrency-control The Haswell core complements the addi- protocols to ensure threads properly coordi- tion of AVX2 with doubled L1 and L2 cache nate access to shared data. Otherwise, threads bandwidth. Each L1 data cache port, two might observe unexpected data values, result- loads and one store, natively supports full- ing in failures. These protocols ensure that width AVX2 operations at 32 bytes. The L2 threads serialize access to shared data, often cache can return a full cache line of data, 64 by using a critical section. A software con- bytes, to the L1 data cache. Despite these struct, referred to as a lock, protects access to bandwidth increases, the L1 and L2 cache sizes the critical section. and latencies remain the same as Ivy Bridge. Because serialized access to a critical sec- tion limits parallelism, programmers try to New instructions for hashing and cryptography reduce the impact of critical sections, either Haswell provides several packages of new by minimizing synchronization or by using instructions to benefit high-value workloads fine-granularity locks, where multiple locks that rely on bit-field manipulation and arbi- protect different data. Unfortunately, this is a trary precision arithmetic. These new instruc- difficult and error-prone process and a com- tions benefit many algorithms ranging from mon source of bugs. Furthermore, pro- fastindexing,tocyclicalredundancychecking grammers must use information available (CRC), to widespread encryption algorithms. only at the time of program development to Figure 9 summarizes the performance. decide whether synchronization is required ...... 16 IEEE MICRO and when to serialize. This often leads pro- can execute the programmer-specified code grammers to use synchronization conserva- section optimistically without synchronizing tively, thus limiting parallelism. through the lock. If synchronization was To address this challenge, Intel TSX pro- unnecessary, the execution can commit with- vides hardware support for the processor to out any cross-thread serialization. determine dynamically if it should serialize critical section execution and expose any hid- Programming interface den concurrency. Intel TSX provides two programming interfaces to specify transactional regions. Overview The first interface, called Hardware Lock Eli- With Intel TSX, the processor executes sion (HLE), is a pair of legacy-compatible critical sections, also referred to as transac- prefixes called XACQUIRE and XRELEASE. tional regions, transactionally using a techni- These prefixes appear as NOPs to previous- que known as lock elision. Such an execution generation cores. The second interface, called only reads the lock, and does not acquire it. Restricted Transactional Memory (RTM), is This makes the critical section available to a pair of new instructions called XBEGIN other threads executing at the same time. and XEND. Programmers who also want to However, because the lock no longer seri- run Intel TSX-enabled software on hardware alizes access, hardware now ensures that without Intel TSX support would use the threads access shared data correctly by main- HLE interface to implement lock elision. taining the illusion of exclusive access to the Programmers who do not have legacy hard- data. It does so by tracking memory addresses ware requirements and who deal with more accessed and by buffering any memory complex locking primitives would use the updates performed within the transactional RTM interface to implement lock elision. execution. The processor also checks these Programmers can use the HLE interface accesses against conflicting accesses from by adding prefixes to the instructions that other threads. A conflicting access occurs if perform the lock acquire and release. Pro- another thread reads a location that this grammers can use the RTM interface by pro- transactional thread wrote, or another thread viding an additional code path to the existing writes a location that was accessed (either synchronization routines. In this path, using a read or a write) by the transactional instead of acquiring the lock, the routine uses thread. When such an access occurs, the the XBEGIN instruction and provides a fall- hardware alone cannot maintain the illusion back handler to execute if a transactional of exclusive access. abort occurs. The code also tests the lock var- To commit a transactional execution, the iable inside the transactional region to ensure hardware ensures that all memory operations that it’s free and to enable the hardware to performed during the execution appear to look for subsequent conflicts. These changes occur instantaneously when viewed from to enable lock elision are localized to syn- other threads. Furthermore, any memory chronization routines; the application itself updates during execution become visible to does not need to be changed. This substan- other threads only after a successful commit. tially eases enabling of applications to take Not all transactional executions can be suc- advantage of Intel TSX. cessfully committed. For example, the execu- Intel TSX provides two additional in- tion could encounter conflicting data accesses structions—XTEST and XABORT. The from other threads. In that event, the processor XTEST instruction tests if a logical processor performs a transactional abort. This process is executing transactionally, whereas the XA- discards all updates, memory, and registers BORT instruction can explicitly abort a during the execution and makes it appear as if transactional region. the transactional execution never occurred. The subsequent re-execution can retry lock eli- Implementation on Haswell sion or fall back to acquiring the lock. The first implementation of Intel TSX on Because a successful transactional execu- the fourth-generation core processor uses the tion ensures an atomic commit, the processor first-level 32-Kbyte data cache (L1) to track ...... MARCH/APRIL 2014 17 ...... HOT CHIPS

the memory addresses accessed (both read and have been optimized for performance only on written) during a transactional execution and Intel . Performance tests, such to buffer any transactional updates performed as SYSmark and MobileMark, are measured to memory. The implementation makes these using specific computer systems, components, updates visible to other threads only on a suc- software, operations, and functions. Any cessful commit. The implementation uses the change to any of those factors may cause the cache coherence protocol to detect conflicting results to vary. You should consult other infor- accesses from other threads. Because hardware mation and performance tests to assist you in is finite, transactional regions that access exces- fully evaluating your contemplated purchases, sive state can exceed hardware buffering. Evict- including the performance of that product ing a transactionally written line from the data when combined with other products. Results cache will cause a transactional abort. How- have been simulated and are provided for ever, evicting a transactionally read line does informational purposes only. Results were not immediately cause an abort. The hardware derived using simulations run on an architec- moves the line to a secondary structure for ture simulator or model. Any difference in subsequent tracking. system hardware or software design or con- The Intel 64 Architecture Software Devel- figuration may affect actual performance. oper Manual has a detailed specification for Requires a system with Intel TSX,8 and the Intel 64 Architecture Opti- Technology. Intel Turbo Boost Technology and mization Reference Manual provides detailed Intel Turbo Boost Technology 2.0 are only guidelines for program optimization with available on select Intel processors. Consult Intel TSX.9 The Intel TSX web resources site your system manufacturer. Performance varies (www.intel.com/software/tsx) presents informa- depending on hardware, software, and system tion on various tools and practical guidelines. configuration. For more information, visit http://www.intel.com/go/turbo. Iris graphics is aswell uses an optimized version of available on select systems. Consult your system H Intel 22-nm process technology to manufacturer. provide comprehensive enhancements in Optimization notice: Intel’s compilers power-performance efficiency, power man- may or may not optimize to the same degree agement, form factor and cost, core and for non-Intel microprocessors for optimiza- uncore microarchitecture enhancements, and tions that are not unique to Intel microproc- new instructions in the core. For example, essors. These optimizations include SSE2, the core delivers performance enhancements SSE3, and SSE3 instruction sets and other for high-performance computing with the optimizations. Intel does not guarantee the new FMA instructions and for parallel work- availability, functionality, or effectiveness of loads with the new Intel TSX synchroniza- any optimization on microprocessors not tion primitives.10,11 manufactured by Intel. Microprocessor- During Haswell’s design, we focused on dependent optimizations in this product are performance, power, and form factors to intended for use with Intel microprocessors. ensure that Haswell delivers a compelling Certain optimizations not specific to Intel user experience in new form factors. MICRO microarchitecture are reserved for Intel microprocessors. Please refer to the applica- Acknowledgments ble product User and Reference Guides for We thank Kevin Zhang, Fatih Hamzaoglu, more information regarding the specific Eric Wang, Ruth Brain and the Intel TMG instruction sets covered by this notice. Organization, Dave Dimarco, Manoj Lal, Steve Kulick, the Haswell architecture team, ...... and the entire Intel CCDO team, and Patty References Kummrow and the SDG Org for their signifi- 1. P. Hammarlund, “Intel 4th Generation Core cant contributions to the success of eDRAM Processor (Haswell),” Hot Chips 25, 2013. and the Intel Iris Pro Graphics product. 2. N. Siddique et al., “Haswell: A Family of IA Disclaimer for Figures 6, 7, and 9: Software 22nm Processors,” to be published in Proc. and workloads used in performance tests may IEEE Int’l Solid-State Circuits Conf., 2014...... 18 IEEE MICRO 3. E. Burton et al., “FIVR—Fully Integrated Volt- sible for the architectural definition and age Regulators on 4th Generation Intel Core development of microprocessors for Intel’s SoCs,” to be published in Proc. IEEE Applied mobile, desktop, workstation, and server Power Electronics Conf. and Exposition, 2014. computing segments. Bajwa has an MS in 4. D. Kanter, “Haswell’s FIVR Extends Battery electrical engineering from Yale University. Life,” Microprocessor Report, 30 July 2013. 5. R. Brain et al., “eDRAM process in 22nm David L. Hill is a senior principal engineer Technology,” Proc. Symp. VLSI Circuits, at Intel. His research interests include mod- 2013, pp. T16-T17. ular high-performance caches, coherent interconnects, and system memory technol- 6. Y. Wang et al., “Retention Time Optimiza- ogies. His team was responsible for the tion for eDRAM in 22nm Tri-Gate CMOS modular uncore architecture sections of the Technology,” Proc. IEEE Int’l Electron Devi- Haswell and Broadwell product families. ces Meeting, 2013, pp. 240-243. Hill has a BS in electrical engineering from 7. F. Hamzaoglu et al., “A 1Gb 2GHz the University of Minnesota. Embedded DRAM in 22nm Tri-Gate CMOS Technology,” to be published in Proc. IEEE Erik Hallnor is a cache and coherent fabric Int’l Solid-State Circuits Conf., 2014. architect, focusing on client and SoC prod- 8. Intel 64 and IA-32 Architectures Software ucts at Intel. He was the lead architect for Developer Manual, Intel, 2013. the Haswell coherent fabric, consisting of 9. Intel 64 and IA-32 Architectures Optimiza- the ring interconnect, LLC, and eDRAM tion Reference Manual, Intel, 2013. cache integration. Hallnor has a PhD in computer science and engineering from the 10. R. Yoo et al., “Performance Evaluation of Intel University of Michigan. Transactional Synchronization Extensions for High Performance Computing,” Proc. Int’l Hong Jiang is an Intel Fellow, the chief media Conf. High Performance Computing, Net- architect for the Platform Engineering Group, working, Storage and Analysis,2013, and the director of the Visual and Parallel doi:10.1145/2503210.2503232. Computing Group’s Media Architecture 11. T. Karnagel et al., “Improving In-Memory Team at Intel. He leads the media architecture Database Index Performance with Intel Trans- of processor graphics and its derivatives. Jiang actional Synchronization Extensions,” to be has a PhD in electrical engineering from the published in Proc. 20th Int’l Symp. High- University of Illinois at Urbana-Champaign. Performance Computer Architecture, 2014. Martin Dixon is a principal engineer in the Per Hammarlund is an Intel fellow. His Intel Product Development Group, where he’s research interests include performance mod- working to develop and enhance the overall eling, microarchitecture, SMT, power- instruction set and SoC architecture. Dixon performance efficiency and modeling, inte- has a BS in electrical and computer engineer- gration, and SoC design. Hammarlund has ing from Carnegie Mellon University. a PhD in computer science from the Royal Institute of Technology (KTH), Stockholm. Michael Derr is a principal engineer at Intel. His work focuses on power managing Alberto J. Martinez is a senior principal PC I/O architectures. Derr has an MS in engineer at Intel and the chief architect for electrical engineering from the Georgia the Embedded Subsystems and IP Group. Institute of Technology. His research focuses on embedded subsystems software and hardware and PC I/O architec- Mikal Hunsaker is a senior principal engi- tures. Martinez has an MS in electrical engi- neer at Intel. His research interests focus on neering from Sacramento State University. chipset high-speed serial I/O design, includ- ing PCI Express, SATA, and USB3. Hun- Atiq A. Bajwa is the director of micro- saker has an MS in electrical engineering processor architecture at Intel. He is respon- from Utah State University...... MARCH/APRIL 2014 19 ...... HOT CHIPS

Rajesh Kumar is a senior fellow, director of has a PhD in computer science from the Uni- circuit and power technologies, and the lead versity of Michigan. interface to process technology at Intel. For Haswell, he guided the development of power Shiv Kaushik is a fellow at Intel, where he delivery integration (FIVR), on-package I/O, leads the Windows OS Division in Intel’s Soft- and the novel process-technology needs. Kumar ware and Services Group. His research interests has a master’s degree in electrical engineering include the design of platform hardware and from the California Institute of Technology. firmware interfaces to operating systems and software for power management, Randy B. Osborne is a principal engineer scaling, performance, and reliability. Kaushik working to improve performance of mem- has a PhD in computer science and engineering ory hierarchies at Intel. His research inter- from Ohio State University. ests include memory controllers, memory Srinivas Chennupaty is a CPU and SoC interconnects, memory devices, and large architect at Intel. His research interests caches, including the introduction of eDRAM include the Intel microprocessor architec- into Intel products. Osborne has a PhD in ture and instruction set development. Chen- electrical engineering from the Massachusetts nupaty has an MS in computer engineering Institute of Technology. from the University of Texas at Austin.

Ravi Rajwar is a principal engineer in the Stephan Jourdan is a senior principal engi- Intel Product Development Group, working neer at Intel, where he leads the architecture on various aspects of SoC architecture and engineering teams responsible for defining development. His research interests include and developing SoCs for device products. the IA synchronization architecture. Rajwar He was the chief architect on HSW ULT. has a PhD in computer science from the Jourdan has a PhD in computer science University of Wisconsin–Madison. from the University of Toulouse.

Ronak Singhal is a senior principal engi- Steve Gunther is a senior principal engineer neer at Intel. His research interests include and a lead power architect for Intel, where server architecture development, ISA devel- he leads a team responsible for defining the opment, and performance analysis and mo- power management architecture for Intel’s deling. Singhal has an MS in electrical and microprocessor product line. His research computer engineering from Carnegie Mel- interests include power analysis, power lon University. management, and power reduction. Gunther has a BS in electrical engineering Reynold D’Sa is vice president of the Plat- from Oregon State University. form Engineering Group and general man- Tom Piazza is a senior fellow and director ager of the Devices Development Group at of graphics architecture at Intel. His research Intel. He leads the design engineering teams interests include computer graphics. Piazza responsible for designing and developing has a BS in electrical engineering from the SoC products for Intel’s next-generation cli- Pratt Institute. ent and mobile platforms, including tablets and smartphones. D’Sa has an MS in electri- Ted Burton is a senior principal engineer cal engineering from Cornell University. working on advanced technologies in the Devices Development Group at Intel. His Robert Chappell is a CPU architect at Intel, research interests include power delivery. where he works to define the performance, Burton has a BS in physics from Brigham power, reliability, and ISA features of the Young University. CPU cores used in various products ranging from phones to servers. He was the lead archi- Direct questions and comments about tect for the Haswell memory execution cluster, this article to Per Hammarlund, Intel, 2111 the Haswell core in its later stages, and the NE 25th Ave., Hillsboro, OR 97124; per. Haswell follow-on core (Broadwell). Chappell [email protected]...... 20 IEEE MICRO