Computer Architecture
ESE 545 Computer Architecture
Limits to ILP, Thread- level Parallelism (TLP), Data-level Parallelism (DLP) and Single- Instruction-Multiple-Data (SIMD) Computing
ILP: Limits to ILP, TLP, and DLP 1 Limits to ILP
Conflicting studies of amount
Benchmarks (vectorized Fortran FP vs. integer C programs)
Hardware sophistication
Compiler sophistication
How much ILP is available using existing mechanisms with increasing HW budgets?
Do we need to invent new HW/SW mechanisms to keep on processor performance curve?
Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints
Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock
Motorola AltaVec: 128 bit ints and FPs
Supersparc Multimedia ops, etc.
ILP: Limits to ILP, TLP, and DLP 2 Limits to ILP Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions 3. Jump prediction – all jumps perfectly predicted (returns, case statements) 2 & 3 no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle;
ILP: Limits to ILP, TLP, and DLP 3 Limits to ILP HW Model Comparison Model Power 5 Instructions Issued Infinite 4 per clock Instruction Window Infinite 200 Size Renaming Infinite 48 integer + Registers 40 Fl. Pt. Branch Prediction Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect ?? Analysis
ILP: Limits to ILP, TLP, and DLP 4 Upper Limit to ILP: Ideal Machine
160 150.1 140 FP: 75 - 150 118.7 120 Integer: 18 - 60 100
75.2 80 62.6 54.8 60
40 17.9 20
0 gcc espresso li fpppp doducd tomcatv
Instructions Clock Per Programs
ILP: Limits to ILP, TLP, and DLP 5 Limits to ILP HW Model Comparison New Model Model Power 5 Instructions Infinite Infinite 4 Issued per clock Instruction Infinite, 2K, 512, Infinite 200 Window Size 128, 32 Renaming Infinite Infinite 48 integer + Registers 40 Fl. Pt. Branch Perfect Perfect 2% to 6% Prediction misprediction (Tournament Branch Predictor) Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect Perfect ?? Alias
ILP: Limits to ILP, TLP, and DLP 6 More Realistic HW: Window Impact Change160 from Infinite window 150 2048, 512, 128, 32 FP: 9 - 150 140 119 120 k 100 Integer: 8 - 63 75 80 63 61 59 60 55 60 IPC 49 45 41 36 35 34 Instructions Per Cloc 40
18 16 1513 1512 14 15 14 20 101088119 9
0 gcc espresso li fpppp doduc tomcatv Inf inite 2048 512 128 32
ILP: Limits to ILP, TLP, and DLP 7 Limits to ILP HW Model Comparison New Model Model Power 5 Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming Infinite Infinite 48 integer + Registers 40 Fl. Pt. Branch Perfect vs. 8K Perfect 2% to 6% Prediction Tournament vs. misprediction 512 2-bit vs. (Tournament Branch profile vs. none Predictor) Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect Perfect ?? Alias
ILP: Limits to ILP, TLP, and DLP 8 More Realistic HW: Branch Impact FP: 15 - 45 Change from Infinite window 61 60 60 58 to examine to 2048 and maximum issue of 64 50 48 46 instructions per clock cycle 45 46 45 45
41 40 Integer: 6 - 12 35
30 29
20 19
IPC 16 15 13 14 12 10 10 9 7 7 6 666 4 2 2 2
0 gcc espresso li fpppp doducd tomcatv Program
Perfect Selective predictor Standard 2-bit Static None
Perfect Tournament BHT (512) Profile No prediction ILP: Limits to ILP, TLP, and DLP 9 Misprediction Rates
35% 30% 30%
25% 23%
20% 18% 18% 16% 14% 14% 15% 12% 12% 10% 6% Misprediction Rate Misprediction 5% 4% 5% 3% 1%1% 2% 2% 0% 0% tomcatv doduc fpppp li espresso gcc
Profile-based 2-bit counter Tournament
ILP: Limits to ILP, TLP, and DLP 10 Limits to ILP HW Model Comparison New Model Model Power 5
Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming Infinite v. 256, Infinite 48 integer + Registers 128, 64, 32, none 40 Fl. Pt. Branch 8K 2-bit Perfect Tournament Branch Prediction Predictor Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect Perfect Perfect Alias
ILP: Limits to ILP, TLP, and DLP 11 More Realistic HW: Renaming Register Impact (N int + N fp) 70 FP: 11 - 45 59 60
Change 2048 instr 54
49 50 window, 64 instr issue, 45 8K 2 level Prediction 44 40 35
30 29 28 Integer: 5 - 15 20 20 16 IPC 15 15 15 13 12 12 12 11 11 11 10 10 10 10 9 6 7 5 5 5 555 5 4 4 4
0 gcc espresso li fpppp doducd tomcatv Program
Infinite 256 128 64 32 None Infinite 256 128 64 32 None ILP: Limits to ILP, TLP, and DLP 12 Limits to ILP HW Model Comparison New Model Model Power 5
Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming 256 Int + 256 FP Infinite 48 integer + Registers 40 Fl. Pt. Branch 8K 2-bit Perfect Tournament Prediction Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect v. Stack Perfect Perfect Alias v. Inspect v. none
ILP: Limits to ILP, TLP, and DLP 13 More Realistic HW: Memory Address Alias Impact 49 49 50 45 45 45 Change 2048 instr 40 window, 64 instr issue, FP: 4 - 45 35 8K 2 level Prediction, (Fortran, 30 256 renaming registers no heap) 25
20 16 16 Integer:15 4 - 9 15 12 10 10 9
IPC 7 7 5 5 6 4 4 4 5 5 3 3 3 4 4
0
gcc espresso li fpppp doducd tomcatv
Program
Perfect Global/stack Perfect Inspection None
Perfect Global/Stack perf; Inspec. None heap conflicts Assem. ILP: Limits to ILP, TLP, and DLP 14 Limits to ILP HW Model Comparison New Model Model Power 5
Instructions 64 (no Infinite 4 Issued per restrictions) clock Instruction Infinite vs. 256, Infinite 200 Window Size 128, 64, 32 Renaming 64 Int + 64 FP Infinite 48 integer + Registers 40 Fl. Pt. Branch 1K 2-bit Perfect Tournament Prediction Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory HW Perfect Perfect Alias disambiguation
ILP: Limits to ILP, TLP, and DLP 15 Realistic HW: Window Impact
60 56
Perfect disambiguation 52
50 (HW), 1K Selective 47 FP: 8 - 45 45 Prediction, 16 entry 40 35 return, 64 registers, issue 34
30 as many as window
22 22
20 Integer: 6 - 12 17 16 15 15 15 14 IPC 13 14 12 12 11 12 10 10 10 10 11 9 9 9 9 10 8 8 8 6 7 66 5 6 4 4 4 4 3 2 3 3 3 3
0
gcc expresso li fpppp doducd tomcatv
Program
Infinite 256 128 64 32 16 8 4
Infinite 256 128 64 32 16 8 4
ILP: Limits to ILP, TLP, and DLP 16 HW vs. SW to Increase ILP
Memory disambiguation: HW best
WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory usage
Can get conflicts via allocation of stack frames as a called procedure reuses the memory addresses of a previous frame on the stack
Speculation:
HW best when dynamic branch prediction better than compile time prediction
Exceptions easier for HW
HW doesn’t need bookkeeping code or compensation code
Very complicated to get right
Scheduling: SW can look ahead to schedule better
Compiler independence: does not require new compiler, recompilation to run well
ILP: Limits to ILP, TLP, and DLP 17 In Conclusion
Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for practical options
These are not laws of physics; just practical limits for today, and perhaps overcome via research
Compiler and ISA advances could change results
ILP: Limits to ILP, TLP, and DLP 18 Performance Beyond Single Thread ILP
There can be much higher natural parallelism in some applications (e.g., Database or Scientific codes) Explicit Thread Level Parallelism or Data Level Parallelism (next lectures) Thread: process with own instructions and data thread may be a process part of a parallel program of multiple processes, or it may be an independent program Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute Data Level Parallelism: Perform identical operations on data, and lots of data
Multithreaded & vector processors 19 Multithreading
Difficult to continue to extract instruction- level parallelism (ILP) from a single sequential thread of control
Many workloads can make use of thread- level parallelism (TLP)
TLP from multiprogramming (run independent sequential jobs)
TLP from multithreaded applications (run one job faster using parallel threads)
Multithreading uses TLP to improve utilization of a single processor
Multithreaded & vector processors 20 Pipeline Hazards
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10t11t12t13t14 LW r1, 0(r2) F D X M W LW r5, 12(r1) F D D D D X M W ADDI r5, r5, #12 F F F F D D D D X M W SW 12(r1), r5 F F F F D D D D
Each instruction may depend on the next
What is usually done to cope with this? – interlocks (slow) – or bypassing (needs hardware, doesn’t help all hazards)
Multithreaded & vector processors 21 Multithreading
How can we guarantee no dependencies between instructions in a pipeline? -- One way is to interleave execution of instructions from different program threads on same pipeline
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 T1: LW r1, 0(r2) F D X M W Prior instruction in T2: ADD r7, r1, r4 F D X M W a thread always completes write- T3: XORI r5, r4, #12 F D X M W back before next T4: SW 0(r7), r5 F D X M W instruction in same thread reads T1: LW r5, 12(r1) F D X M W register file
Multithreaded & vector processors 22 CDC 6600 Peripheral Processors (Cray, 1964)
First multithreaded hardware 10 “virtual” I/O processors Fixed interleave on simple pipeline Pipeline has 100ns cycle time Each virtual processor executes one instruction every 1000ns Accumulator-based instruction set to reduce processor state
Multithreaded & vector processors 23 Simple Multithreaded Pipeline
PC X PC PC1 Inst GPR1GPR1 PC1 IR GPR1 1 cache GPR1 1 D$ Y
+1
2 Thread 2 select
Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage
Appears to software (including OS) as multiple, albeit slower, CPUs
Multithreaded & vector processors 24 Multithreading Costs
Each thread requires its own user state PC GPRs
Also, needs its own system state Virtual-memory page-table-base register Exception-handling registers
Other overheads: Additional cache/TLB conflicts from competing threads (or add larger cache/TLB capacity) More OS overhead to schedule more threads
Multithreaded & vector processors 25 Thread Scheduling Policies Fixed interleave (CDC 6600 PPUs, 1964) Each of N threads executes one instruction every N cycles If thread not ready to go in its slot, insert pipeline bubble Software-controlled interleave in TI ASC with 8 Virtual Processors in the Peripheral Processor with 16 pipeline slots, 1971 OS allocates 16 pipeline slots amongst 8 threads Hardware performs fixed interleave over 16 slots, executing whichever thread is in that slot
Hardware-controlled thread scheduling (HEP, 1982) Hardware keeps track of which threads are ready to go Picks next thread to execute based on hardware priority scheme
Multithreaded & vector processors 26 Denelcor HEP (Burton Smith, 1982)
First commercial machine to use hardware threading in main CPU
120 threads per processor
10 MHz clock rate
Up to 8 processors
precursor to Tera MTA (Multithreaded Architecture)
Multithreaded & vector processors 27 Tera MTA (1990-) Up to 256 processors Up to 128 active threads per processor Processors and memory modules populate a sparse 3D torus interconnection fabric Flat, shared main memory No data cache Sustains one main memory access per cycle per processor GaAs logic in prototype, 1KW/processor @ 260MHz Second version CMOS, MTA-2, 50W/processor New version, XMT, fits into AMD Opteron socket, runs at 500MHz
Multithreaded & vector processors 28 MTA Pipeline
Issue Pool Inst Fetch • Every cycle, one VLIW instruction from W one active thread is M A C launched into pipeline • Instruction pipeline is 21 cycles long
• Memory operations W incur ~150 cycles of latency
Write Pool W Memory Pool
Retry Pool Assuming a single thread issues one instruction every 21 cycles, and clock Interconnection Network rate is 260 MHz… What is single-thread performance? Memory pipeline Effective single-thread issue rate is 260/21 = 12.4 MIPS
Multithreaded & vector processors 29 Coarse-Grain Multithreading Tera MTA designed for supercomputing applications with large data sets and low locality
No data cache
Many parallel threads needed to hide large memory latency
Other applications are more cache friendly
Few pipeline bubbles if cache mostly has hits
Just add a few threads to hide occasional cache miss latencies
Swap threads on cache misses
Multithreaded & vector processors 30 MIT Alewife (1990)
Modified SPARC chips
register windows hold different thread contexts
Up to four threads per node
Thread switch on local cache miss
Multithreaded & vector processors 31 IBM PowerPC RS64-IV (2000)
Commercial coarse-grain multithreading CPU
Based on PowerPC with quad-issue in-order five-stage pipeline
Each physical CPU supports two virtual CPUs
On L2 cache miss, pipeline is flushed and execution switches to second thread
short pipeline minimizes flush penalty (4 cycles), small compared to memory access latency
flush pipeline to simplify exception handling
Multithreaded & vector processors 32 Oracle/Sun Niagara processors
Target is datacenters running web servers and databases, with many concurrent requests
Provide multiple simple cores each with multiple hardware threads, reduced energy/operation though much lower single thread performance
Niagara-1 [2004], 8 cores, 4 threads/core
Niagara-2 [2007], 8 cores, 8 threads/core
Niagara-3 [2009], 16 cores, 8 threads/core
Multithreaded & vector processors 33 Oracle/Sun Niagara-3, “Rainbow Falls” 2009
Multithreaded & vector processors 34 Simultaneous Multithreading (SMT) for OoO Superscalars
Techniques presented so far have all been “vertical” multithreading where each pipeline stage works on one thread at a time
SMT uses fine-grain control already present inside an OoO superscalar to allow instructions from multiple threads to enter execution on same clock cycle. Gives better utilization of machine resources.
Multithreaded & vector processors 35 For most apps, most execution units lie idle in an OoO superscalar For an 8-way superscalar.
From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism”, ISCA 1995. Multithreaded & vector processors 36 Superscalar Machine Efficiency
Issue width Instruction issue Completely idle cycle (vertical waste)
Time Partially filled cycle, i.e., IPC < 4 (horizontal waste)
Multithreaded & vector processors 37 Vertical Multithreading
Issue width Instruction issue
Second thread interleaved cycle-by-cycle
Time Partially filled cycle, i.e., IPC < 4 (horizontal waste)
What is the effect of cycle-by-cycle interleaving? removes vertical waste, but leaves some horizontal waste
Multithreaded & vector processors 38 Chip Multiprocessing (CMP)
Issue width
Time
What is the effect of splitting into multiple processors? reduces horizontal waste, leaves some vertical waste, and puts upper limit on peak throughput of each thread.
Multithreaded & vector processors 39 Ideal Superscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995]
Issue width
Time
Interleave multiple threads to multiple issue slots with no restrictions
Multithreaded & vector processors 40 O-o-O Simultaneous Multithreading [Tullsen, Eggers, Emer, Levy, Stamm, Lo, DEC/UW, 1996]
Add multiple contexts and fetch engines and allow instructions fetched from different threads to issue simultaneously
Utilize wide out-of-order superscalar processor issue queue to find instructions to issue from multiple threads
OOO instruction window already has most of the circuitry required to schedule from multiple threads
Any single thread can utilize whole machine
Multithreaded & vector processors 41 MARS-M: Truly First SMT System – Russia 1988 demonstrated well before Tulsen’s SMT paper, IBM, and Intel!
Decoupled multithreaded architecture with Control, Address, and Execution multithreaded processors and multi-ported multi-bank memory sub-system
Simultaneous multithreading in the Address and Execution VLIW processors
Up to 4 threads can issue VLIW instructions each cycle in each Address/Execution processor
Coarse-grain multithreading in Control VLIW processor with 8 thread contexts
thread switch on issuing load instructions (no cache)
Three water-cooled cabinets
memory-subsystem cabinet shown on the left photo
739 boards with up 100 MSI and LSI ECL ICs mounted on both sides Dorozhevets M.N. and Wolcott P. The El'brus- 3 and MARS-M: Recent Advances in Russian High-performance Computing, The Journal of Supercomputing, USA, 6, pp. 5-48, 1992
Multithreaded & vector processors 42 IBM Power 4
Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle.
Multithreaded & vector processors 43 Power 4
2 commits Power 5 (architected register sets)
2 fetch (PC), 2 initial decodes Multithreaded & vector processors 44 Power 5 data flow ...
Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck
Multithreaded & vector processors 45 Changes in Power 5 to support SMT
Increased associativity of L1 instruction cache and the instruction address translation buffers Added per-thread load and store queues Increased size of the L2 (1.92 vs. 1.44 MB) and L3 caches Added separate instruction prefetch and buffering per thread Increased the number of virtual registers from 152 to 240 Increased the size of several issue queues The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support
Multithreaded & vector processors 46 Pentium-4 Hyperthreading (2002)
First commercial SMT design (2-way SMT) Hyperthreading == SMT Logical processors share nearly all resources of the physical processor Caches, execution units, branch predictors Die area overhead of hyperthreading ~ 5% When one logical processor is stalled, the other can make progress No logical processor can use all entries in queues when two threads are active Processor running only one active software thread runs at approximately same speed with or without hyperthreading Hyperthreading dropped on OoO P6 based follow-ons to Pentium-4 (Pentium-M, Core Duo, Core 2 Duo), until revived with Nehalem generation machines in 2008. Intel Atom (in-order x86 core) has two-way vertical multithreading
Multithreaded & vector processors 47 Initial Performance of SMT
Pentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate benchmark and 1.07 for SPECfp_rate
Pentium 4 is dual threaded SMT
SPECRate requires that each SPEC benchmark be run against a vendor-selected number of copies of the same benchmark
Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (262 runs) speed-ups from 0.90 to 1.58; average was 1.20
Power 5, 8-processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_rate
Power 5 running 2 copies of each app speedup between 0.89 and 1.41
Most gained some
Float.pt. apps had most cache conflicts and least gains
Multithreaded & vector processors 48 SMT adaptation to parallelism type
For regions with high thread level For regions with low thread level parallelism (TLP) entire machine parallelism (TLP) entire machine width is shared by all threads width is available for instruction level parallelism (ILP) Issue width
Issue width
Time Time
Multithreaded & vector processors 49 Summary: Multithreaded Categories Simultaneous Superscalar Fine-Grained Coarse-Grained Multiprocessing Multithreading Time (processor cycle) Time
Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot
Multithreaded & vector processors 50 SIMD Architectures
Data-Level Parallelism (DLP): Executing one operation on multiple data streams
Example: Multiplying a coefficient vector by a data vector (e.g. in filtering) y[i] := c[i] x[i], 0i Sources of performance improvement: One instruction is fetched & decoded for entire operation Multiplications are known to be independent Pipelining/concurrency in memory access as well CA: DLP and SIMD 51 Intel’s SIMD Instructions CA: DLP and SIMD 52 Example: SIMD Array Processing CA: DLP and SIMD 53 SSE Instruction Categories for Multimedia Support Intel processors are CISC vs.RISC MIPS SSE-2+ supports wider data types to allow 16 × 8-bit and 8 × 16-bit operands CA: DLP and SIMD 54 Intel Architecture SSE2+ 128-Bit SIMD Data Types CA: DLP and SIMD 55 XMM Registers CA: DLP and SIMD 56 SSE/SSE2 Floating Point Instructions CA: DLP and SIMD 57 SSE/SSE2 Floating Point Instructions CA: DLP and SIMD 58 Example: Add Single Precision FP Vectors CA: DLP and SIMD 59 Packed and Scalar Double-Precision Floating-Point Operations CA: DLP and SIMD 60 Example: Image Converter (1/5) CA: DLP and SIMD 61 Example: Image Converter (2/5) CA: DLP and SIMD 62 Example: Image Converter (3/5) CA: DLP and SIMD 63 Example: Image Converter (4/5) CA: DLP and SIMD 64 Example: Image Converter (5/5) CA: DLP and SIMD 65 Intel SSE Intrinsics CA: DLP and SIMD 66 Sample of SSE Intrinsics CA: DLP and SIMD 67 Example: 2 × 2 Matrix Multiply CA: DLP and SIMD 68 Example: 2 × 2 Matrix Multiply CA: DLP and SIMD 69 Example: 2 × 2 Matrix Multiply CA: DLP and SIMD 70 Example: 2 × 2 Matrix Multiply CA: DLP and SIMD 71 Example: 2 × 2 Matrix Multiply CA: DLP and SIMD 72 Inner loop from gcc -O -S CA: DLP and SIMD 73 Performance-Driven ISA Extensions CA: DLP and SIMD 74 Architectures for Data Parallelism SONY/IBM PS3 Cell processor in 2006. The Current Landscape: Chips that deliver TeraOps/s in 2014, and how they differ. E5-26xx v2: Stretching the Xeon server approach for compute-intensive apps. GK110: nVidia’s Kepler GPU, customized for compute applications. GM107: nVidia’s Maxwell GPU, customized for energy-efficiency. CA: DLP and SIMD 75 Sony/IBM Playstation PS3 Cell Chip - Released 2006 8 SPEs, 3.2 GHz clock, 200 GigaOps/s (peak) CA: DLP and SIMD 76 Sony PS3 Cell Processor SPE Floating-Point Unit Single-Instruction Multiple-Data 4 single-precision multiply-adds issue in lockstep (SIMD) per cycle. 6-cycle latency 3.2 GHz clock, --> 25.6 SP GFLOPS CA: DLP and SIMD 77 Intel Xeon Ivy Bridge E5-2697v2 (2013) 12-core Xeon Ivy Bridge 0.52 TeraOps/s (130W) Haswell: 1.04 TeraOps/s 12 cores @ 2.7 GHz Each core can issue 16 single- precision operations per cycle. $2,600 per chip CA: DLP and SIMD 78 Intel E5-2697v2 vs. Haswell 12 cores @ 2.7 GHz Each core can issue 16 single-precision operations per cycle. Haswell cores issue 32 SP FLOPS/cycle. How? CA: DLP and SIMD 79 Advanced Vector Extension (AVX) unit Smaller than L3 cache, but larger than L2 cache. Relative area has increased in Haswell Die closeup of one Sandy Bridge core CA: DLP and SIMD 80 AVX: Not Just Single-precision Floating-point AVX instruction variants interpret 128-bit registers as 4 floats, 2 doubles, 16 8-bit integers, etc ... 256-bit version -> double-precision vectors of length 4 CA: DLP and SIMD 81 Sandy Bridge, Haswell (2014) Sandy Bridge extends register set to 256 bits: vectors are twice the size. IA-64 AVX/AVX2 has 16 registers (IA-32: 8) Haswell adds 3-operand instructions a*b + c (Fused multiply-add (FMA)) 2 EX units with FMA --> 2X increase in ops/cycle CA: DLP and SIMD 82 Out of Order Issue in Haswell (2014) Haswell has two copies of the FMA engine, on separate ports. Haswell sustains 4 micro-op issues per cycle. One possibility: 2 for AVX, and 2 for Loads, Stores and book-keeping. CA: DLP and SIMD 83 Graphical Processing Units Given the hardware invested to do graphics well, how can be supplement it to improve performance of a wider range of applications? Basic idea: Heterogeneous execution model CPU is the host, GPU is the device Develop a C-like programming language for GPU Unify all forms of GPU parallelism as CUDA thread Programming model is “Single Instruction Multiple Thread” CA: DLP and SIMD 84 Kepler GK 110 nVidia GPU 5.12 TeraOps/s 2880 MACs @ 889 MHz single-precision multiply-adds $999 GTX Titan Black with 6GB GDDR5 (and 1 GPU) CA: DLP and SIMD 85 SIMD Summary Intel SSE SIMD Instructions One instruction fetch that operates on multiple operands simultaneously 256/128/64 bit multimedia (XMM & YMM registers Embed the SSE machine instructions directly into C programs through use of intrinsics CA: DLP and SIMD 86 Acknowledgements These slides contain material developed and copyright by: Morgan Kauffmann (Elsevier, Inc.) Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) Justin Hsia (UCB) CA: DLP and SIMD 87