<<

Intel® Architecture for Software Developers

1 Agenda

• Introduction • Processor Architecture Basics • ® Architecture . Intel® Core™ and Intel® ® . Intel® ™ . Intel® ™ Coprocessor • Use Cases for Software Developers . Intel® Core™ and Intel® Xeon® . Intel® Atom™ . Intel® Xeon Phi™ Coprocessor • Summary

11/26/2014 2 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Chips and Dies Graphics

• A chip is the package containing one or more dies (silicon) • Major components of the die can be easily identified • Example of a typical die:

Core Core Core Core System Graphics Agent

L3 Cache

Memory Controller I/O

“Die shot” of Intel® Core™ 11/26/2014 3 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Moore’s Law

“The number of transistors on a chip will double approximately every two years.” []

Moore's Law graph, 1965 11/26/2014 Source: http://en.wikipedia.org/wiki/Moore's_law 4 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Parallelism

Problem: Economical operation frequency of (CMOS) transistors is limited.  No free lunch anymore!

Solution: More transistors allow more gates/logic on the same die space and power envelop, improving parallelism:

. Thread level parallelism (TLP): Multi- and many-core

. Data level parallelism (DLP): Wider vectors (SIMD)

. Instruction level parallelism (ILP): improvements, e.g. threading, superscalarity, … 11/26/2014 5 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. History Intel® 64 and IA-32 Architecture

• 1978: 8086 and 8088 processors • 1982: Intel® 286 processor • 1985: Intel386™ processor • 1989: Intel486™ processor • 1993: Intel® ® • 1995-1999: family: . Intel processor . Intel Pentium II [Xeon] processor . Intel Pentium III [Xeon] processor . Intel ® processor • 2000-2006: Intel® Pentium® 4 processor • 2005: Intel® Pentium® processor Extreme Edition • 2001-2007: Intel® Xeon® processor • 2003-2006: Intel® Pentium® M processor • 2006/2007: Intel® Core™ Duo and Intel® Core™ Solo processors • 2006/2007: Intel® Core™2 processor • 2008: Intel® Atom™ processor and Intel® Core™ i7 processor family • 2010: Intel® Core™ processor family • 2011: Second generation Intel® Core™ processor family • 2012: Third generation Intel® Core™ processor family • 2013: Fourth generation Intel® Core™ processor family • 2013: Intel® Atom™ processor family based on microarchitecture

11/26/2014 6 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Pipeline

Computation of instructions requires several stages:

Front End Back End

Load or Fetch Decode Execute Commit Store Instructions

Registers Memory 1. Fetch: Read instruction (bytes) from memory 2. Decode: Translate instruction (bytes) to microarchitecture 3. Execute: Perform the operation with a functional unit 4. Memory: Load (read) or store (write) data, if required 5. Commit: Retire instruction and update micro-architectural state 11/26/2014 7 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Naïve Pipeline: Serial Execution t p

65 4

32 Store

1 Fetch Decode

or Load

Commit Execute Instructions

Registers Memory

Characteristics of strict serial execution: • Only one instruction for the entire pipeline (all stages) • Low complexity • Execution time: ninstructions * tp

Problem: • Inefficient because only one stage is active at a time 11/26/2014 8 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Pipeline Execution

ts

65 4

32 Store

1 Fetch Decode

or Load

Commit Execute Instructions

Registers Memory

Characteristics of pipeline execution: • Multiple instructions for the entire pipeline (one per stage) • Efficient because all stages kept active at every point in time • Execution time: ninstructions * ts

Problem: • Reality check: What happens if ts is not constant? 11/26/2014 9 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Pipeline Stalls

tp ts

1: mov $0x1, %rax

2: mov 0x1234, %rbx 65 3: add %rbx, %rax 4

32 4: … Store

1 Fetch 5: … Decode

or Load Commit Execute 6: … Instructions

Registers Memory

Pipeline stalls: • Caused by pipeline stages to take longer than a cycle • Caused by dependencies: order has to be maintained • Execution time: ninstructions * tavg with ts ≤ tavg ≤ tp

Problem: • Stalls slow down pipeline throughput and put stages idle. 11/26/2014 10 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Reducing Impact of Pipeline Stalls

Impact of pipeline stalls can be reduced by: • Branch prediction • Superscalarity + multiple issue fetch & decode • Out of Order execution • Cache • Non-temporal stores • Prefetching • Line fill buffers • Load/Store buffers • Alignment • Simultaneous Multithreading (SMT)

 Characteristics of the architecture that might require user action!

11/26/2014 11 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Superscalarity

LS Exe 0 Port 65 Com 4

3

2

1

Schedule

LS

Exe Port 1 Port

Com Instructions Fetch/Decode

Characteristics of a superscalar architecture: • Improves throughput by covering latency (ports are independent) • Ports can have different functionalities (floating point, integer, addressing, …) • Requires multiple issue fetch & decode (here: 2 issue) • Execution time: ninstructions * tavg / nports Problem: • More complex and prone in case of dependencies  Solution: Out of Order Execution 11/26/2014 12 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Processor Architecture Basics Out of Order Execution

1: mov $0x1, %rax

2: mov $0x2, %rbx 65 3: add %rbx, %rax 4

32 4: add $0x0, $r8 Store

1 Fetch 5: add $0x1, $r9 Load or Load

Decode

Commit

Execute Reorder Dispatch 6: add $0x2, $r10 Instructions

I-Queue ROB

Characteristics of out of order (OOO) execution: • Instruction queue (I-Queue) moves stalling instructions out of pipeline • Reorder buffer (ROB) maintains correct order of committing instructions • Reduces pipeline stalls, but not entirely! • Speculative execution possible • Opposite of OOO execution is in order execution

11/26/2014 13 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Cache

1: mov $0x1, %rax

2: mov 0x1234, %rbx 65 3: mov 0x1234, %rcx 4

32 4: … Store

1 Fetch 5: … Decode

or Load Commit Execute 6: … Instructions

Registers Cache CacheCache-miss-hit Cache: • Small, some KiB Memory • Faster than main memory 0x1234 • Cache hierarchy: L1, L2 & L3 (LLC) • Reused data can remain in cache (here: 0x1234) Problem: • Maintain data coherency  Cache coherency protocol (e.g. MESI) 11/26/2014 14 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Cache-Hierarchy & Cache-Line

1: mov 0x1234, %rbx

2: mov 0x5678, %rcx 65 3: mov %rcx, 0x1234 4

32 4: … Store

1 Fetch 5: … Decode

or Load Commit Execute 6: … Instructions

Cache-hierarchy: • For data and instructions 65 4 • Usually inclusive caches CacheL1 Caches-Cachemiss -miss3I D 2 0x1234 1 0x5678 • Races for resources Instructions • Can improve access speed … Cache-miss L2 Cache • Cache-misses & cache-hits Instructions 0x12340x5678 Memory

Cache-line: CacheCache-miss-hit L3 Cache • Always full 64 byte block Instructions 0x1234 0x5678 • Minimal granularity of every load/store • Modifications invalidate entire cache-line (dirty bit) 11/26/2014 15 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Prefetching

Act ahead of time: • Continuous data streams could stutter due to caching • Two solutions to avoid this: . Hardware prefetch: • Processor memory management tries to detect patterns and automatically load cache-lines before access. • Same as for branch prediction, subject of the underlying architecture and processor generation. . Software prefetch: • Programmer can manually add PREFETCH instructions to selectively load cache-lines before access. • Software prefetching also needs to be adjusted for each architecture and generation.

• Prefetching only helps smoothing the flow; it won’t increase bandwidth!

 Consider tuning prefetching only as the very last step of application tuning!

11/26/2014 16 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Example: 4th Generation Intel® Core™

From Intel® 64 and IA-32 Architectures Optimization Reference Manual: Fetch I (Pre-Decode) I-Queue Decode & BTB Dispatch

ROB, L/S BPU

Schedule

Port 0 Port 1 Port 5 Port 6 Port 4 Port 2 Port 3 Port 7

Exe Exe Exe

Exe Addres

[Int] [Int] [Int]

[Int]

Store Load & Load

[FP] [FP] [FP] & Load

Store

Store AddressStore Store Store Address

LFB D L2 Cache 11/26/2014 17 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Core vs. Uncore

• Core: Processor

2

Processor core’s logic:

. Execution units Core

Core 1 Core Core . Core caches (L1/L2) DDR3 n Core . Buffers & registers or . … DDR4 L3 Cache Clock Uncore

MC QPI & Graphics • Uncore: Power All outside a processor core: . /channels (MC) and Intel® QuickPath Interconnect (QPI) . L3 cache shared by all cores . Type of memory . Power management and clocking . Optionally: Integrated graphics

 Only uncore is differentiation within same processor family! 11/26/2014 18 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics UMA and NUMA Memory Memory Socket 1 Socket 2 Processor 1 Processor 2

MC QPI QPI MC

• UMA (aka. non-NUMA): . Uniform Memory Access (UMA) System . Addresses interleaved across memory nodes by cache line Memory . Accesses may or may not have to cross QPI link Map  Provides good portable performance without tuning

• NUMA: . Non-Uniform Memory Access (NUMA) System . Addresses not interleaved across memory nodes by cache line Memory . Each processor has direct access to contiguous block of memory Map  Provides peek performance but requires special handling

11/26/2014 19 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Processor Architecture Basics NUMA - Thread Affinity & Enumeration

Non-NUMA: Thread affinity might be beneficial (e.g. cache locality) but not required

NUMA: Thread affinity is required: • Improve accesses to local memory vs. remote memory • Ensure 3rd party components support affinity mapping, e.g.: . Intel® TBB via set_affinity() . Intel® OpenMP* via $OMP_PLACES . Intel® MPI via $I_MPI_PIN_DOMAIN . … • Right way to get enumeration of cores: Intel® 64 Architecture Processor Topology Enumeration https://software.intel.com/en-us/articles/intel-64-architecture- processor-topology-enumeration

11/26/2014 20 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics NUMA - Memory, Bandwidth & Latency

Memory allocation: • Differentiate: implicit vs. explicit memory allocation • Explicit allocation with NUMA aware libraries, e.g. libnuma (Linux*) • Bind memory  (SW) thread, and (SW) thread  processor • More information on optimizing for performance: https://software.intel.com/de-de/articles/optimizing-applications- for-numa Memory Memory Socket 1 Socket 2 Processor 1 Processor 2

MC QPI QPI MC

Local Remote Performance: Access Access • Remote memory access latency ~1.7x greater than local memory • Local memory bandwidth can be up to ~2x greater than remote 11/26/2014 21 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. and Power Gating

Control power utilization and performance: • Clock rate: . Idle components can be clocked lower: Save energy  Intel SpeedStep®

. If thermal specification allows, components can be over-clocked: Higher performance  Intel® Turbo Boost Technology . Base frequency at P1; P0 might be turbo, depending on processor to decide (how many cores/GPU are active) . Intel® Turbo Boost Technology 2.0: processor can decide whether turbo mode can exceed the TDP with higher frequencies

• Power gating: Turn off components to save power

• P-state: Processor state; low latency; combined with speed step 11/26/2014 • C-state: chip state; longer latency, more aggressive static power reduction 22 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Documentation

Intel® 64 and IA-32 Architectures Software Developer Manuals: • Intel® 64 and IA-32 Architectures Software Developer’s Manuals . Volume 1: Basic Architecture . Volume 2: Instruction Set Reference . Volume 3: System Programming Guide • Software Optimization Reference Manual • Related Specifications, Application Notes, and White Papers

https://www-ssl.intel.com/content/www/us/en/processors/architectures- software-developer-manuals.html?iid=tech_vt_tech+64-32_manuals

Intel® Processor Numbers (who type names are encoded): http://www.intel.com/products/processor_number/eng/

11/26/2014 23 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Desktop, Mobile & Server “Big Core”

24 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Desktop, Mobile & Server Tick/Tock Model

2nd Generation 3rd Generation 4th Generation Intel® Core™ Intel® Core™ Intel® Core™ Intel® Core™

Nehalem Westmere Sandy Ivy Bridge Haswell (2008) (2010) Bridge (2012) (2013) (2011) New New Process New New Process New Microarchitecture Technology Microarchitecture Technology Microarchitecture 45nm 32nm 32nm 22nm 22nm

Tock Tick Tock Tick Tock Future

Broadwell Skylake Tock: (2014) Innovate

New Process New New Process New Tick: Technology Microarchitecture Technology Microarchitecture 14nm 14nm 11nm 11nm Shrink Tick Tock Tick Tock 11/26/2014 25 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Desktop, Mobile & Server Characteristics

• Processor core: . 4 issue . Superscalar out-of-order execution . Simultaneous multithreading: Intel® Hyper-Threading Technology with 2 HW threads per core

• Multi-core: . Intel® Core™ processor family: up to 4 cores (desktop & mobile) . Intel® Xeon® processor family: up to 15 cores (server)

• Caches: . Three level cache hierarchy – L1/L2/L3 (Nehalem and later) . 64 byte cache line

11/26/2014 26 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Desktop, Mobile & Server Caches

Cache hierarchy:

Processor Core 1 Core 2 Core n

I D I D I D

L2 Cache L2 Cache L2 Cache

L3 Cache

Level Latency (cycles) Bandwidth Size (per core per cycle) L1-D 4 2x 16 bytes 32KiB L2 (unified) 12 1x 32 bytes 256KiB L3 (LLC) 26-31 1x 32 bytes varies (≥ 2MiB per core) L2 and L1 D-Cache in other cores 43 (clean hit), 60 (dirty hit) 11/26/2014 Example for Haswell 27 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Desktop, Mobile & Server Performance

• Following Moore’s Law: Microarchitecture Instruction SP FLOPs DP FLOPs L1 Cache Bandwidth L2 Cache Set per Cycle per Cycle (bytes/cycle) Bandwidth per Core per Core (bytes/cycle)

Nehalem SSE 8 4 32 32 (128-bits) (16B read + 16B write)

Sandy Bridge AVX 16 8 48 32 (256-bits) (32B read + 16B write) Haswell AVX2 32 16 96 64 (256-bits) (64B read + 32B write)

• Example of theoretic peak FLOP rates: . Intel® Core™ i7-2710QE (): 2.1 GHz * 16 SP FLOPs * 4 cores = 134.4 SP GFLOPs

. Intel® Core™ i7-4765T (Haswell): 2.0 GHz * 32 SP FLOPs * 4 cores = 256 SP GFLOPs

11/26/2014 28 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Thank you!

29 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel . Performance tests, such as SYSmark and MobileMark, are measured using specific systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. -dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

30 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 31