Computer Architecture

ESE 545 Computer Architecture

Limits to ILP, Thread- level Parallelism (TLP), Data-level Parallelism (DLP) and Single- Instruction-Multiple-Data (SIMD) Computing

ILP: Limits to ILP, TLP, and DLP 1 Limits to ILP

 Conflicting studies of amount

 Benchmarks (vectorized Fortran FP vs. integer C programs)

 Hardware sophistication

 Compiler sophistication

 How much ILP is available using existing mechanisms with increasing HW budgets?

 Do we need to invent new HW/SW mechanisms to keep on performance curve?

 Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints

 Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock

 Motorola AltaVec: 128 bit ints and FPs

 Supersparc Multimedia ops, etc.

ILP: Limits to ILP, TLP, and DLP 2 Limits to ILP Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions 3. Jump prediction – all jumps perfectly predicted (returns, case statements) 2 & 3  no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle;

ILP: Limits to ILP, TLP, and DLP 3 Limits to ILP HW Model Comparison Model Power 5 Instructions Issued Infinite 4 per clock Instruction Window Infinite 200 Size Renaming Infinite 48 integer + Registers 40 Fl. Pt. Branch Prediction Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect ?? Analysis

ILP: Limits to ILP, TLP, and DLP 4 Upper Limit to ILP: Ideal Machine

160 150.1 140 FP: 75 - 150 118.7 120 Integer: 18 - 60 100

75.2 80 62.6 54.8 60

40 17.9 20

0 gcc espresso li fpppp doducd tomcatv

Instructions Clock Per Programs

ILP: Limits to ILP, TLP, and DLP 5 Limits to ILP HW Model Comparison New Model Model Power 5 Instructions Infinite Infinite 4 Issued per clock Instruction Infinite, 2K, 512, Infinite 200 Window Size 128, 32 Renaming Infinite Infinite 48 integer + Registers 40 Fl. Pt. Branch Perfect Perfect 2% to 6% Prediction misprediction (Tournament Branch Predictor) Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect Perfect ?? Alias

ILP: Limits to ILP, TLP, and DLP 6 More Realistic HW: Window Impact Change160 from Infinite window 150 2048, 512, 128, 32 FP: 9 - 150 140 119 120 k 100 Integer: 8 - 63 75 80 63 61 59 60 55 60 IPC 49 45 41 36 35 34 Instructions Per Cloc 40

18 16 1513 1512 14 15 14 20 101088119 9

0 gcc espresso li fpppp doduc tomcatv Inf inite 2048 512 128 32

ILP: Limits to ILP, TLP, and DLP 7 Limits to ILP HW Model Comparison New Model Model Power 5 Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming Infinite Infinite 48 integer + Registers 40 Fl. Pt. Branch Perfect vs. 8K Perfect 2% to 6% Prediction Tournament vs. misprediction 512 2-bit vs. (Tournament Branch profile vs. none Predictor) Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect Perfect ?? Alias

ILP: Limits to ILP, TLP, and DLP 8 More Realistic HW: Branch Impact FP: 15 - 45 Change from Infinite window 61 60 60 58 to examine to 2048 and maximum issue of 64 50 48 46 instructions per clock cycle 45 46 45 45

41 40 Integer: 6 - 12 35

30 29

20 19

IPC 16 15 13 14 12 10 10 9 7 7 6 666 4 2 2 2

0 gcc espresso li fpppp doducd tomcatv Program

Perfect Selective predictor Standard 2-bit Static None

Perfect Tournament BHT (512) Profile No prediction ILP: Limits to ILP, TLP, and DLP 9 Misprediction Rates

35% 30% 30%

25% 23%

20% 18% 18% 16% 14% 14% 15% 12% 12% 10% 6% Misprediction Rate Misprediction 5% 4% 5% 3% 1%1% 2% 2% 0% 0% tomcatv doduc fpppp li espresso gcc

Profile-based 2-bit counter Tournament

ILP: Limits to ILP, TLP, and DLP 10 Limits to ILP HW Model Comparison New Model Model Power 5

Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming Infinite v. 256, Infinite 48 integer + Registers 128, 64, 32, none 40 Fl. Pt. Branch 8K 2-bit Perfect Tournament Branch Prediction Predictor Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect Perfect Perfect Alias

ILP: Limits to ILP, TLP, and DLP 11 More Realistic HW: Renaming Register Impact (N int + N fp) 70 FP: 11 - 45 59 60

Change 2048 instr 54

49 50 window, 64 instr issue, 45 8K 2 level Prediction 44 40 35

30 29 28 Integer: 5 - 15 20 20 16 IPC 15 15 15 13 12 12 12 11 11 11 10 10 10 10 9 6 7 5 5 5 555 5 4 4 4

0 gcc espresso li fpppp doducd tomcatv Program

Infinite 256 128 64 32 None Infinite 256 128 64 32 None ILP: Limits to ILP, TLP, and DLP 12 Limits to ILP HW Model Comparison New Model Model Power 5

Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming 256 Int + 256 FP Infinite 48 integer + Registers 40 Fl. Pt. Branch 8K 2-bit Perfect Tournament Prediction Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect v. Stack Perfect Perfect Alias v. Inspect v. none

ILP: Limits to ILP, TLP, and DLP 13 More Realistic HW: Memory Address Alias Impact 49 49 50 45 45 45 Change 2048 instr 40 window, 64 instr issue, FP: 4 - 45 35 8K 2 level Prediction, (Fortran, 30 256 renaming registers no heap) 25

20 16 16 Integer:15 4 - 9 15 12 10 10 9

IPC 7 7 5 5 6 4 4 4 5 5 3 3 3 4 4

0

gcc espresso li fpppp doducd tomcatv

Program

Perfect Global/stack Perfect Inspection None

Perfect Global/Stack perf; Inspec. None heap conflicts Assem. ILP: Limits to ILP, TLP, and DLP 14 Limits to ILP HW Model Comparison New Model Model Power 5

Instructions 64 (no Infinite 4 Issued per restrictions) clock Instruction Infinite vs. 256, Infinite 200 Window Size 128, 64, 32 Renaming 64 Int + 64 FP Infinite 48 integer + Registers 40 Fl. Pt. Branch 1K 2-bit Perfect Tournament Prediction Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory HW Perfect Perfect Alias disambiguation

ILP: Limits to ILP, TLP, and DLP 15 Realistic HW: Window Impact

60 56

Perfect disambiguation 52

50 (HW), 1K Selective 47 FP: 8 - 45 45 Prediction, 16 entry 40 35 return, 64 registers, issue 34

30 as many as window

22 22

20 Integer: 6 - 12 17 16 15 15 15 14 IPC 13 14 12 12 11 12 10 10 10 10 11 9 9 9 9 10 8 8 8 6 7 66 5 6 4 4 4 4 3 2 3 3 3 3

0

gcc expresso li fpppp doducd tomcatv

Program

Infinite 256 128 64 32 16 8 4

Infinite 256 128 64 32 16 8 4

ILP: Limits to ILP, TLP, and DLP 16 HW vs. SW to Increase ILP

 Memory disambiguation: HW best

 WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory usage

 Can get conflicts via allocation of stack frames as a called procedure reuses the memory addresses of a previous frame on the stack

 Speculation:

 HW best when dynamic branch prediction better than compile time prediction

 Exceptions easier for HW

 HW doesn’t need bookkeeping code or compensation code

 Very complicated to get right

 Scheduling: SW can look ahead to schedule better

 Compiler independence: does not require new compiler, recompilation to run well

ILP: Limits to ILP, TLP, and DLP 17 In Conclusion

 Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for practical options

 These are not laws of physics; just practical limits for today, and perhaps overcome via research

 Compiler and ISA advances could change results

ILP: Limits to ILP, TLP, and DLP 18 Performance Beyond Single Thread ILP

 There can be much higher natural parallelism in some applications (e.g., Database or Scientific codes)  Explicit Thread Level Parallelism or Data Level Parallelism (next lectures)  Thread: process with own instructions and data  thread may be a process part of a parallel program of multiple processes, or it may be an independent program  Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute  Data Level Parallelism: Perform identical operations on data, and lots of data

Multithreaded & vector processors 19 Multithreading

 Difficult to continue to extract instruction- level parallelism (ILP) from a single sequential thread of control

 Many workloads can make use of thread- level parallelism (TLP)

 TLP from multiprogramming (run independent sequential jobs)

 TLP from multithreaded applications (run one job faster using parallel threads)

 Multithreading uses TLP to improve utilization of a single processor

Multithreaded & vector processors 20 Pipeline Hazards

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10t11t12t13t14 LW r1, 0(r2) F D X M W LW r5, 12(r1) F D D D D X M W ADDI r5, r5, #12 F F F F D D D D X M W SW 12(r1), r5 F F F F D D D D

 Each instruction may depend on the next

What is usually done to cope with this? – interlocks (slow) – or bypassing (needs hardware, doesn’t help all hazards)

Multithreaded & vector processors 21 Multithreading

How can we guarantee no dependencies between instructions in a pipeline? -- One way is to interleave execution of instructions from different program threads on same pipeline

Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 T1: LW r1, 0(r2) F D X M W Prior instruction in T2: ADD r7, r1, r4 F D X M W a thread always completes write- T3: XORI r5, r4, #12 F D X M W back before next T4: SW 0(r7), r5 F D X M W instruction in same thread reads T1: LW r5, 12(r1) F D X M W register file

Multithreaded & vector processors 22 CDC 6600 Peripheral Processors (Cray, 1964)

 First multithreaded hardware  10 “virtual” I/O processors  Fixed interleave on simple pipeline  Pipeline has 100ns cycle time  Each virtual processor executes one instruction every 1000ns  Accumulator-based instruction set to reduce processor state

Multithreaded & vector processors 23 Simple Multithreaded Pipeline

PC X PC PC1 Inst GPR1GPR1 PC1 IR GPR1 1 cache GPR1 1 D$ Y

+1

2 Thread 2 select

 Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage

 Appears to software (including OS) as multiple, albeit slower, CPUs

Multithreaded & vector processors 24 Multithreading Costs

 Each thread requires its own user state  PC  GPRs

 Also, needs its own system state  Virtual-memory page-table-base register  Exception-handling registers

 Other overheads:  Additional cache/TLB conflicts from competing threads  (or add larger cache/TLB capacity)  More OS overhead to schedule more threads

Multithreaded & vector processors 25 Thread Scheduling Policies  Fixed interleave (CDC 6600 PPUs, 1964)  Each of N threads executes one instruction every N cycles  If thread not ready to go in its slot, insert pipeline bubble  Software-controlled interleave in TI ASC with 8 Virtual Processors in the Peripheral Processor with 16 pipeline slots, 1971  OS allocates 16 pipeline slots amongst 8 threads  Hardware performs fixed interleave over 16 slots, executing whichever thread is in that slot

 Hardware-controlled thread scheduling (HEP, 1982)  Hardware keeps track of which threads are ready to go  Picks next thread to execute based on hardware priority scheme

Multithreaded & vector processors 26 Denelcor HEP (Burton Smith, 1982)

First commercial machine to use hardware threading in main CPU

 120 threads per processor

 10 MHz clock rate

 Up to 8 processors

 precursor to Tera MTA (Multithreaded Architecture)

Multithreaded & vector processors 27 Tera MTA (1990-)  Up to 256 processors  Up to 128 active threads per processor  Processors and memory modules populate a sparse 3D torus interconnection fabric  Flat, shared main memory  No data cache  Sustains one main memory access per cycle per processor  GaAs logic in prototype, 1KW/processor @ 260MHz  Second version CMOS, MTA-2, 50W/processor  New version, XMT, fits into AMD Opteron socket, runs at 500MHz

Multithreaded & vector processors 28 MTA Pipeline

Issue Pool Inst Fetch • Every cycle, one VLIW instruction from W one active thread is M A C launched into pipeline • Instruction pipeline is 21 cycles long

• Memory operations W incur ~150 cycles of latency

Write Pool W Memory Pool

Retry Pool Assuming a single thread issues one instruction every 21 cycles, and clock Interconnection Network rate is 260 MHz… What is single-thread performance? Memory pipeline Effective single-thread issue rate is 260/21 = 12.4 MIPS

Multithreaded & vector processors 29 Coarse-Grain Multithreading Tera MTA designed for supercomputing applications with large data sets and low locality

 No data cache

 Many parallel threads needed to hide large memory latency

Other applications are more cache friendly

 Few pipeline bubbles if cache mostly has hits

 Just add a few threads to hide occasional cache miss latencies

 Swap threads on cache misses

Multithreaded & vector processors 30 MIT Alewife (1990)

Modified SPARC chips

 register windows hold different thread contexts

Up to four threads per node

Thread switch on local cache miss

Multithreaded & vector processors 31 IBM PowerPC RS64-IV (2000)

 Commercial coarse-grain multithreading CPU

 Based on PowerPC with quad-issue in-order five-stage pipeline

 Each physical CPU supports two virtual CPUs

 On L2 cache miss, pipeline is flushed and execution switches to second thread

 short pipeline minimizes flush penalty (4 cycles), small compared to memory access latency

 flush pipeline to simplify exception handling

Multithreaded & vector processors 32 Oracle/Sun Niagara processors

 Target is datacenters running web servers and databases, with many concurrent requests

 Provide multiple simple cores each with multiple hardware threads, reduced energy/operation though much lower single thread performance

 Niagara-1 [2004], 8 cores, 4 threads/core

 Niagara-2 [2007], 8 cores, 8 threads/core

 Niagara-3 [2009], 16 cores, 8 threads/core

Multithreaded & vector processors 33 Oracle/Sun Niagara-3, “Rainbow Falls” 2009

Multithreaded & vector processors 34 Simultaneous Multithreading (SMT) for OoO Superscalars

 Techniques presented so far have all been “vertical” multithreading where each pipeline stage works on one thread at a time

 SMT uses fine-grain control already present inside an OoO superscalar to allow instructions from multiple threads to enter execution on same clock cycle. Gives better utilization of machine resources.

Multithreaded & vector processors 35 For most apps, most execution units lie idle in an OoO superscalar For an 8-way superscalar.

From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism”, ISCA 1995. Multithreaded & vector processors 36 Superscalar Machine Efficiency

Issue width Instruction issue Completely idle cycle (vertical waste)

Time Partially filled cycle, i.e., IPC < 4 (horizontal waste)

Multithreaded & vector processors 37 Vertical Multithreading

Issue width Instruction issue

Second thread interleaved cycle-by-cycle

Time Partially filled cycle, i.e., IPC < 4 (horizontal waste)

 What is the effect of cycle-by-cycle interleaving?  removes vertical waste, but leaves some horizontal waste

Multithreaded & vector processors 38 Chip Multiprocessing (CMP)

Issue width

Time

 What is the effect of splitting into multiple processors?  reduces horizontal waste,  leaves some vertical waste, and  puts upper limit on peak throughput of each thread.

Multithreaded & vector processors 39 Ideal Superscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995]

Issue width

Time

 Interleave multiple threads to multiple issue slots with no restrictions

Multithreaded & vector processors 40 O-o-O Simultaneous Multithreading [Tullsen, Eggers, Emer, Levy, Stamm, Lo, DEC/UW, 1996]

 Add multiple contexts and fetch engines and allow instructions fetched from different threads to issue simultaneously

 Utilize wide out-of-order issue queue to find instructions to issue from multiple threads

 OOO instruction window already has most of the circuitry required to schedule from multiple threads

 Any single thread can utilize whole machine

Multithreaded & vector processors 41 MARS-M: Truly First SMT System – Russia 1988 demonstrated well before Tulsen’s SMT paper, IBM, and Intel!

 Decoupled multithreaded architecture with Control, Address, and Execution multithreaded processors and multi-ported multi-bank memory sub-system

 Simultaneous multithreading in the Address and Execution VLIW processors

 Up to 4 threads can issue VLIW instructions each cycle in each Address/Execution processor

 Coarse-grain multithreading in Control VLIW processor with 8 thread contexts

 thread switch on issuing load instructions (no cache)

 Three water-cooled cabinets

 memory-subsystem cabinet shown on the left photo

 739 boards with up 100 MSI and LSI ECL ICs mounted on both sides Dorozhevets M.N. and Wolcott P. The El'brus- 3 and MARS-M: Recent Advances in Russian High-performance Computing, The Journal of Supercomputing, USA, 6, pp. 5-48, 1992

Multithreaded & vector processors 42 IBM Power 4

Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle.

Multithreaded & vector processors 43 Power 4

2 commits Power 5 (architected register sets)

2 fetch (PC), 2 initial decodes Multithreaded & vector processors 44 Power 5 data flow ...

Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck

Multithreaded & vector processors 45 Changes in Power 5 to support SMT

 Increased associativity of L1 instruction cache and the instruction address translation buffers  Added per-thread load and store queues  Increased size of the L2 (1.92 vs. 1.44 MB) and L3 caches  Added separate instruction prefetch and buffering per thread  Increased the number of virtual registers from 152 to 240  Increased the size of several issue queues  The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support

Multithreaded & vector processors 46 Pentium-4 Hyperthreading (2002)

 First commercial SMT design (2-way SMT)  Hyperthreading == SMT  Logical processors share nearly all resources of the physical processor  Caches, execution units, branch predictors  Die area overhead of hyperthreading ~ 5%  When one logical processor is stalled, the other can make progress  No logical processor can use all entries in queues when two threads are active  Processor running only one active software thread runs at approximately same speed with or without hyperthreading  Hyperthreading dropped on OoO P6 based follow-ons to Pentium-4 (Pentium-M, Core Duo, Core 2 Duo), until revived with Nehalem generation machines in 2008.  Intel Atom (in-order x86 core) has two-way vertical multithreading

Multithreaded & vector processors 47 Initial Performance of SMT

 Pentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate benchmark and 1.07 for SPECfp_rate

 Pentium 4 is dual threaded SMT

 SPECRate requires that each SPEC benchmark be run against a vendor-selected number of copies of the same benchmark

 Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (262 runs) speed-ups from 0.90 to 1.58; average was 1.20

 Power 5, 8-processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_rate

 Power 5 running 2 copies of each app speedup between 0.89 and 1.41

 Most gained some

 Float.pt. apps had most cache conflicts and least gains

Multithreaded & vector processors 48 SMT adaptation to parallelism type

For regions with high thread level For regions with low thread level parallelism (TLP) entire machine parallelism (TLP) entire machine width is shared by all threads width is available for instruction level parallelism (ILP) Issue width

Issue width

Time Time

Multithreaded & vector processors 49 Summary: Multithreaded Categories Simultaneous Superscalar Fine-Grained Coarse-Grained Multiprocessing Multithreading Time (processor cycle) Time

Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot

Multithreaded & vector processors 50 SIMD Architectures

 Data-Level Parallelism (DLP): Executing one operation on multiple data streams

 Example: Multiplying a coefficient vector by a data vector (e.g. in filtering) y[i] := c[i]  x[i], 0i

 Sources of performance improvement:

 One instruction is fetched & decoded for entire operation

 Multiplications are known to be independent

 Pipelining/concurrency in memory access as well

CA: DLP and SIMD 51 Intel’s SIMD Instructions

CA: DLP and SIMD 52 Example: SIMD Array Processing

CA: DLP and SIMD 53 SSE Instruction Categories for Multimedia Support

 Intel processors are CISC vs.RISC MIPS  SSE-2+ supports wider data types to allow 16 × 8-bit and 8 × 16-bit operands

CA: DLP and SIMD 54 Intel Architecture SSE2+ 128-Bit SIMD Data Types

CA: DLP and SIMD 55 XMM Registers

CA: DLP and SIMD 56 SSE/SSE2 Floating Point Instructions

CA: DLP and SIMD 57 SSE/SSE2 Floating Point Instructions

CA: DLP and SIMD 58 Example: Add Single Precision FP Vectors

CA: DLP and SIMD 59 Packed and Scalar Double-Precision Floating-Point Operations

CA: DLP and SIMD 60 Example: Image Converter (1/5)

CA: DLP and SIMD 61 Example: Image Converter (2/5)

CA: DLP and SIMD 62 Example: Image Converter (3/5)

CA: DLP and SIMD 63 Example: Image Converter (4/5)

CA: DLP and SIMD 64 Example: Image Converter (5/5)

CA: DLP and SIMD 65 Intel SSE Intrinsics

CA: DLP and SIMD 66 Sample of SSE Intrinsics

CA: DLP and SIMD 67 Example: 2 × 2 Matrix Multiply

CA: DLP and SIMD 68 Example: 2 × 2 Matrix Multiply

CA: DLP and SIMD 69 Example: 2 × 2 Matrix Multiply

CA: DLP and SIMD 70 Example: 2 × 2 Matrix Multiply

CA: DLP and SIMD 71 Example: 2 × 2 Matrix Multiply

CA: DLP and SIMD 72 Inner loop from gcc -O -S

CA: DLP and SIMD 73 Performance-Driven ISA Extensions

CA: DLP and SIMD 74 Architectures for Data Parallelism SONY/IBM PS3 Cell processor in 2006. The Current Landscape: Chips that deliver TeraOps/s in 2014, and how they differ. E5-26xx v2: Stretching the Xeon server approach for compute-intensive apps. GK110: nVidia’s Kepler GPU, customized for compute applications. GM107: nVidia’s Maxwell GPU, customized for energy-efficiency.

CA: DLP and SIMD 75 Sony/IBM Playstation PS3 Cell Chip - Released 2006

8 SPEs, 3.2 GHz clock, 200 GigaOps/s (peak)

CA: DLP and SIMD 76 Sony PS3 Cell Processor SPE Floating-Point Unit Single-Instruction Multiple-Data

4 single-precision multiply-adds issue in lockstep (SIMD) per cycle. 6-cycle latency

3.2 GHz clock, --> 25.6 SP GFLOPS

CA: DLP and SIMD 77 Intel Xeon Ivy Bridge E5-2697v2 (2013) 12-core Xeon Ivy Bridge 0.52 TeraOps/s (130W) Haswell: 1.04 TeraOps/s

12 cores @ 2.7 GHz Each core can issue 16 single- precision operations per cycle.

$2,600 per chip

CA: DLP and SIMD 78 Intel E5-2697v2 vs. Haswell

12 cores @ 2.7 GHz Each core can issue 16 single-precision operations per cycle.

Haswell cores issue 32 SP FLOPS/cycle.

How?

CA: DLP and SIMD 79 Advanced Vector Extension (AVX) unit Smaller than L3 cache, but larger than L2 cache. Relative area has increased in Haswell

Die closeup of one Sandy Bridge core

CA: DLP and SIMD 80 AVX: Not Just Single-precision Floating-point AVX instruction variants interpret 128-bit registers as 4 floats, 2 doubles, 16 8-bit integers, etc ...

256-bit version -> double-precision vectors of length 4

CA: DLP and SIMD 81 Sandy Bridge, Haswell (2014) Sandy Bridge extends register set to 256 bits: vectors are twice the size. IA-64 AVX/AVX2 has 16 registers (IA-32: 8)

Haswell adds 3-operand instructions a*b + c (Fused multiply-add (FMA)) 2 EX units with FMA --> 2X increase in ops/cycle

CA: DLP and SIMD 82 Out of Order Issue in Haswell (2014)

Haswell has two copies of the FMA engine, on separate ports. Haswell sustains 4 micro-op issues per cycle. One possibility: 2 for AVX, and 2 for Loads, Stores and book-keeping.

CA: DLP and SIMD 83 Graphical Processing Units

 Given the hardware invested to do graphics well, how can be supplement it to improve performance of a wider range of applications?

 Basic idea:

 Heterogeneous execution model

 CPU is the host, GPU is the device

 Develop a C-like programming language for GPU

 Unify all forms of GPU parallelism as CUDA thread

 Programming model is “Single Instruction Multiple Thread”

CA: DLP and SIMD 84 Kepler GK 110 nVidia GPU 5.12 TeraOps/s 2880 MACs @ 889 MHz single-precision multiply-adds $999 GTX Titan Black with 6GB GDDR5 (and 1 GPU)

CA: DLP and SIMD 85 SIMD Summary

 Intel SSE SIMD Instructions

 One instruction fetch that operates on multiple operands simultaneously

 256/128/64 bit multimedia (XMM & YMM registers

 Embed the SSE machine instructions directly into C programs through use of intrinsics

CA: DLP and SIMD 86 Acknowledgements

 These slides contain material developed and copyright by:

 Morgan Kauffmann (Elsevier, Inc.)

 Arvind (MIT)

 Krste Asanovic (MIT/UCB)

 Joel Emer (Intel/MIT)

 James Hoe (CMU)

 John Kubiatowicz (UCB)

 David Patterson (UCB)

 Justin Hsia (UCB)

CA: DLP and SIMD 87