Intel Architecture for Software Developers

Intel® Architecture for Software Developers 1 Agenda • Introduction • Processor Architecture Basics • Intel® Architecture . Intel® Core™ and Intel® Xeon® . Intel® Atom™ . Intel® Xeon Phi™ Coprocessor • Use Cases for Software Developers . Intel® Core™ and Intel® Xeon® . Intel® Atom™ . Intel® Xeon Phi™ Coprocessor • Summary 11/26/2014 2 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Chips and Dies Graphics • A chip is the package containing one or more dies (silicon) • Major components of the die can be easily identified • Example of a typical die: Core Core Core Core System Graphics Agent L3 Cache Memory Controller I/O “Die shot” of Intel® Core™ 11/26/2014 3 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Moore’s Law “The number of transistors on a chip will double approximately every two years.” [Gordon Moore] Moore's Law graph, 1965 11/26/2014 Source: http://en.wikipedia.org/wiki/Moore's_law 4 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Parallelism Problem: Economical operation frequency of (CMOS) transistors is limited. No free lunch anymore! Solution: More transistors allow more gates/logic on the same die space and power envelop, improving parallelism: . Thread level parallelism (TLP): Multi- and many-core . Data level parallelism (DLP): Wider vectors (SIMD) . Instruction level parallelism (ILP): Microarchitecture improvements, e.g. threading, superscalarity, … 11/26/2014 5 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. History Intel® 64 and IA-32 Architecture • 1978: 8086 and 8088 processors • 1982: Intel® 286 processor • 1985: Intel386™ processor • 1989: Intel486™ processor • 1993: Intel® Pentium® • 1995-1999: P6 family: . Intel Pentium Pro processor . Intel Pentium II [Xeon] processor . Intel Pentium III [Xeon] processor . Intel Celeron® processor • 2000-2006: Intel® Pentium® 4 processor • 2005: Intel® Pentium® processor Extreme Edition • 2001-2007: Intel® Xeon® processor • 2003-2006: Intel® Pentium® M processor • 2006/2007: Intel® Core™ Duo and Intel® Core™ Solo processors • 2006/2007: Intel® Core™2 processor • 2008: Intel® Atom™ processor and Intel® Core™ i7 processor family • 2010: Intel® Core™ processor family • 2011: Second generation Intel® Core™ processor family • 2012: Third generation Intel® Core™ processor family • 2013: Fourth generation Intel® Core™ processor family • 2013: Intel® Atom™ processor family based on Silvermont microarchitecture 11/26/2014 6 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Pipeline Computation of instructions requires several stages: Front End Back End Load or Fetch Decode Execute Commit Store Instructions Registers Memory 1. Fetch: Read instruction (bytes) from memory 2. Decode: Translate instruction (bytes) to microarchitecture 3. Execute: Perform the operation with a functional unit 4. Memory: Load (read) or store (write) data, if required 5. Commit: Retire instruction and update micro-architectural state 11/26/2014 7 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Naïve Pipeline: Serial Execution tp 65 4 32 Store 1 Fetch Decode or Load Commit Execute Instructions Registers Memory Characteristics of strict serial execution: • Only one instruction for the entire pipeline (all stages) • Low complexity • Execution time: ninstructions * tp Problem: • Inefficient because only one stage is active at a time 11/26/2014 8 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Pipeline Execution ts 65 4 32 Store 1 Fetch Decode or Load Commit Execute Instructions Registers Memory Characteristics of pipeline execution: • Multiple instructions for the entire pipeline (one per stage) • Efficient because all stages kept active at every point in time • Execution time: ninstructions * ts Problem: • Reality check: What happens if ts is not constant? 11/26/2014 9 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Pipeline Stalls tp ts 1: mov $0x1, %rax 2: mov 0x1234, %rbx 65 3: add %rbx, %rax 4 32 4: … Store 1 Fetch 5: … Decode or Load Commit Execute 6: … Instructions Registers Memory Pipeline stalls: • Caused by pipeline stages to take longer than a cycle • Caused by dependencies: order has to be maintained • Execution time: ninstructions * tavg with ts ≤ tavg ≤ tp Problem: • Stalls slow down pipeline throughput and put stages idle. 11/26/2014 10 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Reducing Impact of Pipeline Stalls Impact of pipeline stalls can be reduced by: • Branch prediction • Superscalarity + multiple issue fetch & decode • Out of Order execution • Cache • Non-temporal stores • Prefetching • Line fill buffers • Load/Store buffers • Alignment • Simultaneous Multithreading (SMT) Characteristics of the architecture that might require user action! 11/26/2014 11 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Superscalarity LS Exe 0 Port 65 Com 4 3 2 1 Schedule LS Exe Port 1 Port Com Instructions Fetch/Decode Characteristics of a superscalar architecture: • Improves throughput by covering latency (ports are independent) • Ports can have different functionalities (floating point, integer, addressing, …) • Requires multiple issue fetch & decode (here: 2 issue) • Execution time: ninstructions * tavg / nports Problem: • More complex and prone in case of dependencies Solution: Out of Order Execution 11/26/2014 12 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Out of Order Execution 1: mov $0x1, %rax 2: mov $0x2, %rbx 65 3: add %rbx, %rax 4 32 4: add $0x0, $r8 Store 1 Fetch 5: add $0x1, $r9 Decode or Load Commit Execute Reorder Dispatch 6: add $0x2, $r10 Instructions I-Queue ROB Characteristics of out of order (OOO) execution: • Instruction queue (I-Queue) moves stalling instructions out of pipeline • Reorder buffer (ROB) maintains correct order of committing instructions • Reduces pipeline stalls, but not entirely! • Speculative execution possible • Opposite of OOO execution is in order execution 11/26/2014 13 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Cache 1: mov $0x1, %rax 2: mov 0x1234, %rbx 65 3: mov 0x1234, %rcx 4 32 4: … Store 1 Fetch 5: … Decode or Load Commit Execute 6: … Instructions Registers Cache CacheCache-miss-hit Cache: • Small, some KiB Memory • Faster than main memory 0x1234 • Cache hierarchy: L1, L2 & L3 (LLC) • Reused data can remain in cache (here: 0x1234) Problem: • Maintain data coherency Cache coherency protocol (e.g. MESI) 11/26/2014 14 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Cache-Hierarchy & Cache-Line 1: mov 0x1234, %rbx 2: mov 0x5678, %rcx 65 3: mov %rcx, 0x1234 4 32 4: … Store 1 Fetch 5: … Decode or Load Commit Execute 6: … Instructions Cache-hierarchy: • For data and instructions 65 4 • Usually inclusive caches CacheL1 Caches-Cachemiss -miss3I D 2 0x1234 1 0x5678 • Races for resources Instructions • Can improve access speed … Cache-miss L2 Cache • Cache-misses & cache-hits Instructions 0x12340x5678 Memory Cache-line: CacheCache-miss-hit L3 Cache • Always full 64 byte block Instructions 0x1234 0x5678 • Minimal granularity of every load/store • Modifications invalidate entire cache-line (dirty bit) 11/26/2014 15 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Prefetching Act ahead of time: • Continuous data streams could stutter due to caching • Two solutions to avoid this: . Hardware prefetch: • Processor memory management tries to detect patterns and automatically load cache-lines before access. • Same as for branch prediction, subject of the underlying architecture and processor generation. Software prefetch: • Programmer can manually add PREFETCH instructions to selectively load cache-lines before access. • Software prefetching also needs to be adjusted for each architecture and generation. • Prefetching only helps smoothing the flow; it won’t increase bandwidth! Consider tuning prefetching only as the very last step of application tuning! 11/26/2014 16 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture

Load more