Intel Architecture for Software Developers

Intel® Architecture for Software Developers 1 Agenda • Introduction • Processor Architecture Basics • Intel® Architecture . Intel® Core™ and Intel® Xeon® . Intel® Atom™ . Intel® Xeon Phi™ Coprocessor • Use Cases for Software Developers . Intel® Core™ and Intel® Xeon® . Intel® Atom™ . Intel® Xeon Phi™ Coprocessor • Summary 11/26/2014 2 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Chips and Dies Graphics • A chip is the package containing one or more dies (silicon) • Major components of the die can be easily identified • Example of a typical die: Core Core Core Core System Graphics Agent L3 Cache Memory Controller I/O “Die shot” of Intel® Core™ 11/26/2014 3 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Moore’s Law “The number of transistors on a chip will double approximately every two years.” [Gordon Moore] Moore's Law graph, 1965 11/26/2014 Source: http://en.wikipedia.org/wiki/Moore's_law 4 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Parallelism Problem: Economical operation frequency of (CMOS) transistors is limited. No free lunch anymore! Solution: More transistors allow more gates/logic on the same die space and power envelop, improving parallelism: . Thread level parallelism (TLP): Multi- and many-core . Data level parallelism (DLP): Wider vectors (SIMD) . Instruction level parallelism (ILP): Microarchitecture improvements, e.g. threading, superscalarity, … 11/26/2014 5 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. History Intel® 64 and IA-32 Architecture • 1978: 8086 and 8088 processors • 1982: Intel® 286 processor • 1985: Intel386™ processor • 1989: Intel486™ processor • 1993: Intel® Pentium® • 1995-1999: P6 family: . Intel Pentium Pro processor . Intel Pentium II [Xeon] processor . Intel Pentium III [Xeon] processor . Intel Celeron® processor • 2000-2006: Intel® Pentium® 4 processor • 2005: Intel® Pentium® processor Extreme Edition • 2001-2007: Intel® Xeon® processor • 2003-2006: Intel® Pentium® M processor • 2006/2007: Intel® Core™ Duo and Intel® Core™ Solo processors • 2006/2007: Intel® Core™2 processor • 2008: Intel® Atom™ processor and Intel® Core™ i7 processor family • 2010: Intel® Core™ processor family • 2011: Second generation Intel® Core™ processor family • 2012: Third generation Intel® Core™ processor family • 2013: Fourth generation Intel® Core™ processor family • 2013: Intel® Atom™ processor family based on Silvermont microarchitecture 11/26/2014 6 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Pipeline Computation of instructions requires several stages: Front End Back End Load or Fetch Decode Execute Commit Store Instructions Registers Memory 1. Fetch: Read instruction (bytes) from memory 2. Decode: Translate instruction (bytes) to microarchitecture 3. Execute: Perform the operation with a functional unit 4. Memory: Load (read) or store (write) data, if required 5. Commit: Retire instruction and update micro-architectural state 11/26/2014 7 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Naïve Pipeline: Serial Execution tp 65 4 32 Store 1 Fetch Decode or Load Commit Execute Instructions Registers Memory Characteristics of strict serial execution: • Only one instruction for the entire pipeline (all stages) • Low complexity • Execution time: ninstructions * tp Problem: • Inefficient because only one stage is active at a time 11/26/2014 8 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Pipeline Execution ts 65 4 32 Store 1 Fetch Decode or Load Commit Execute Instructions Registers Memory Characteristics of pipeline execution: • Multiple instructions for the entire pipeline (one per stage) • Efficient because all stages kept active at every point in time • Execution time: ninstructions * ts Problem: • Reality check: What happens if ts is not constant? 11/26/2014 9 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Pipeline Stalls tp ts 1: mov $0x1, %rax 2: mov 0x1234, %rbx 65 3: add %rbx, %rax 4 32 4: … Store 1 Fetch 5: … Decode or Load Commit Execute 6: … Instructions Registers Memory Pipeline stalls: • Caused by pipeline stages to take longer than a cycle • Caused by dependencies: order has to be maintained • Execution time: ninstructions * tavg with ts ≤ tavg ≤ tp Problem: • Stalls slow down pipeline throughput and put stages idle. 11/26/2014 10 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Reducing Impact of Pipeline Stalls Impact of pipeline stalls can be reduced by: • Branch prediction • Superscalarity + multiple issue fetch & decode • Out of Order execution • Cache • Non-temporal stores • Prefetching • Line fill buffers • Load/Store buffers • Alignment • Simultaneous Multithreading (SMT) Characteristics of the architecture that might require user action! 11/26/2014 11 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Superscalarity LS Exe 0 Port 65 Com 4 3 2 1 Schedule LS Exe Port 1 Port Com Instructions Fetch/Decode Characteristics of a superscalar architecture: • Improves throughput by covering latency (ports are independent) • Ports can have different functionalities (floating point, integer, addressing, …) • Requires multiple issue fetch & decode (here: 2 issue) • Execution time: ninstructions * tavg / nports Problem: • More complex and prone in case of dependencies Solution: Out of Order Execution 11/26/2014 12 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Out of Order Execution 1: mov $0x1, %rax 2: mov $0x2, %rbx 65 3: add %rbx, %rax 4 32 4: add $0x0, $r8 Store 1 Fetch 5: add $0x1, $r9 Decode or Load Commit Execute Reorder Dispatch 6: add $0x2, $r10 Instructions I-Queue ROB Characteristics of out of order (OOO) execution: • Instruction queue (I-Queue) moves stalling instructions out of pipeline • Reorder buffer (ROB) maintains correct order of committing instructions • Reduces pipeline stalls, but not entirely! • Speculative execution possible • Opposite of OOO execution is in order execution 11/26/2014 13 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Cache 1: mov $0x1, %rax 2: mov 0x1234, %rbx 65 3: mov 0x1234, %rcx 4 32 4: … Store 1 Fetch 5: … Decode or Load Commit Execute 6: … Instructions Registers Cache CacheCache-miss-hit Cache: • Small, some KiB Memory • Faster than main memory 0x1234 • Cache hierarchy: L1, L2 & L3 (LLC) • Reused data can remain in cache (here: 0x1234) Problem: • Maintain data coherency Cache coherency protocol (e.g. MESI) 11/26/2014 14 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Cache-Hierarchy & Cache-Line 1: mov 0x1234, %rbx 2: mov 0x5678, %rcx 65 3: mov %rcx, 0x1234 4 32 4: … Store 1 Fetch 5: … Decode or Load Commit Execute 6: … Instructions Cache-hierarchy: • For data and instructions 65 4 • Usually inclusive caches CacheL1 Caches-Cachemiss -miss3I D 2 0x1234 1 0x5678 • Races for resources Instructions • Can improve access speed … Cache-miss L2 Cache • Cache-misses & cache-hits Instructions 0x12340x5678 Memory Cache-line: CacheCache-miss-hit L3 Cache • Always full 64 byte block Instructions 0x1234 0x5678 • Minimal granularity of every load/store • Modifications invalidate entire cache-line (dirty bit) 11/26/2014 15 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture Basics Prefetching Act ahead of time: • Continuous data streams could stutter due to caching • Two solutions to avoid this: . Hardware prefetch: • Processor memory management tries to detect patterns and automatically load cache-lines before access. • Same as for branch prediction, subject of the underlying architecture and processor generation. Software prefetch: • Programmer can manually add PREFETCH instructions to selectively load cache-lines before access. • Software prefetching also needs to be adjusted for each architecture and generation. • Prefetching only helps smoothing the flow; it won’t increase bandwidth! Consider tuning prefetching only as the very last step of application tuning! 11/26/2014 16 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Processor Architecture

Intel Architecture for Software Developers

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support