<<

Vector Processors CS252 • Initially developed for super- applications, Graduate Architecture today important for multimedia. Lecture 20 • Vector processors have high-level operations that Vector Processing => Multimedia work on linear arrays of numbers: "vectors"

SCALAR VECTOR David E. Culler (1 operation) (N operations)

r1 r2 v1 v2 + + Many slides due to Christoforos E. Kozyrakis r3 v3 vector length add r3, r1, r2 vadd.vv v3, v1, v2 CS252/Culler Lec 20. 2 4/9/02

Properties of Vector Processors Styles of Vector Architectures

• Single vector instruction implies lots of work (loop) • Memory-memory vector processors – Fewer instruction fetches – All vector operations are memory to memory • Each result independent of previous result • Vector-register processors – Multiple operations can be executed in parallel – All vector operations between vector registers (except – Simpler design, high vector load and store) – Compiler (programmer) ensures no dependencies – Vector equivalent of load-store architectures • Reduces branches and branch problems in pipelines – Includes all vector machines since late 1980s – We assume vector-register for rest of the lecture • Vector instructions access memory with known pattern – Effective prefetching – Amortize memory latency of over large number of elements – Can exploit a system – No (data) caches required!

CS252/Culler CS252/Culler Lec 20. 3 Lec 20. 4 4/9/02 4/9/02

Historical Perspective -1 Breakthrough

• Mid-60s fear perf. stagnates • Fast, simple scalar • SIMD processor arrays – 80 MHz! actively developed during late – single-phase, latches 60’s – mid 70’s • Exquisite electrical and mechanical design – bit-parallel machines for image • processing • Vector register concept • pepe, staran, mpp – vast simplification of instruction set – word-parallel for scientific – reduced necc . memory bandwidth • Illiac IV • Tight integration of vector and scalar • Cray develops fast scalar • Piggy-back off 7600 stacklib – CDC 6600, 7600 • Later vectorizing compilers developed • CDC bets of vectors with • Owned high-performance computing for a decade Star-100 – what happened then? • Amdahl argues against vector – VLIW competition

CS252/Culler CS252/Culler Lec 20. 5 Lec 20. 6 4/9/02 4/9/02

1 Components of a Cray-1

• Scalar CPU: registers, , instruction fetch logic Block • Vector register – Fixed length memory bank holding a single vector Diagram – Typically 8-32 vector registers, each holding 1 to 8 Kbits – Has at least 2 read and 1 write ports • Simple 16-bit RR instr – MM: Can be viewed as array of 64b, 32b, 16b, or 8b elements • Vector functional units (FUs) • 32-bit with immed – Fully pipelined, start new operation every clock • Natural combinations of – Typically 2 to 8 FUs: integer and FP scalar and vector – Multiple datapaths (pipelines) used for each unit to multiple elements per cycle • Scalar bit-vectors • Vector load-store units (LSUs) match vector length – Fully pipelined unit to load or store a vector – Multiple elements fetched/stored per cycle • Gather/scatter M-R – May have multiple LSUs • Cond. merge • Cross-bar to connect FUs , LSUs, registers

CS252/Culler CS252/Culler Lec 20. 7 Lec 20. 8 4/9/02 4/9/02

Basic Vector Instructions Vector Memory Operations

Instr. Operands Operation Comment • Load/store operations move groups of data VADD.VV V1,V2,V3 V1=V2+V3 vector + vector between registers and memory VADD.SV V1,R0,V2 V1=R0+V2 scalar + vector • Three types of addressing VMUL.VV V1,V2,V3 V1=V2xV3 vector x vector – Unit stride VMUL.SV V1,R0,V2 V1=R0xV2 scalar x vector • Fastest VLD V1,R1 V1=M[R1..R1+63] load, stride=1 – Non-unit (constant) stride VLDS V1,R1,R2 V1=M[R1..R1+63*R2] load, stride=R2 – Indexed (gather-scatter) • Vector equivalent of register indirect VLDX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed("gather") • Good for sparse arrays of data VST V1,R1 M[R1..R1+63]=V1 store, stride=1 • Increases number of programs that vectorize VSTS V1,R1,R2 V1=M[R1..R1+63*R2] store, stride=R2 • compress/expand variant also VSTX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed(“scatter") • Support for various combinations of data widths in memory + all the regular scalar instructions (RISC style)… – {.L,.W,.H.,.B} x {64b, 32b, 16b, 8b}

CS252/Culler CS252/Culler Lec 20. 9 Lec 20. 10 4/9/02 4/9/02

Vector Code Example Vector Length

Y[0:63] = Y[0:653] + a*X[0:63] • A vector register can hold some maximum number of elements for each data width (maximum vector length 64 element SAXPY: scalar 64 element SAXPY: vector or MVL) LD R0,a LD R0,a #load scalar a • What to do when the application vector length is not ADDI R4,Rx,#512 VLD V1,Rx #load vector X exactly MVL? loop: LD R2, 0(Rx) VMUL.SV V2,R0,V1 #vector mult MULTD R2,R0,R2 VLD V3,Ry #load vector Y • Vector-length (VL) register controls the length of any LD R4, 0(Ry) VADD.VV V4,V2,V3 #vector add vector operation, including a vector load or store ADDD R4,R2,R4 – E.g. vadd.vv with VL=10 is SD R4, 0(Ry) VST Ry,V4 #store vector Y ADDI Rx,Rx,#8 for (I=0; I<10; I++) V1[I]=V2[I]+V3[I] ADDI Ry,Ry,#8 • VL can be anything from 0 to MVL SUB R20,R4,Rx BNZ R20,loop • How do you code an application where the vector length is not known until run-time?

CS252/Culler CS252/Culler Lec 20. 11 Lec 20. 12 4/9/02 4/9/02

2 Strip Mining Optimization 1: Chaining

• Suppose application vector length > MVL • Suppose: • Strip mining vmul.vv V1,V2,V3 – Generation of a loop that handles MVL elements per iteration vadd.vv V4,V1,V5 # RAW hazard – A set operations on MVL elements is translated to a single vecto r • Chaining instruction – Vector register (V1) is not as a single entity but as a • Example: vector saxpy of N elements group of individual registers – First loop handles (N mod MVL) elements, the rest handle MVL – forwarding can work on individual vector elements

VL = (N mod MVL); // set VL = N mod MVL • Flexible chaining: allow vector to chain to any other for (I=0; I more read/write ports Y[I]=A*X[I]+Y[I]; // vector instructions low = (N mod MVL); Unchained vmul vadd Cray X-mp VL = MVL; // set VL to MVL introduces for (I=low; I

Optimization 2: Multi-lane Implementation Chaining & Multi-lane Example

Pipelined Scalar LSU FU0 FU1 Lane vld

Vector Reg. vmul.vv Partition vadd.vv Functional addu Unit Time vld To/From Memory System vmul.vv vadd.vv • Elements for vector registers interleaved across the lanes addu • Each lane receives identical control • Multiple element operations executed per cycle • Modular, scalable design Element Operations: Instr. Issue: • No need for inter-lane communication for most vector instructions • VL=16, 4 lanes, 2 FUs, 1 LSU, chaining -> 12 ops/cycle CS252/Culler CS252/Culler Lec 20. 15 • Just one new instruction issued per cycle !!!! Lec 20. 16 4/9/02 4/9/02

Optimization 3: Conditional Execution Two Ways to View Vectorization

• Suppose you want to vectorize this: • Inner loop vectorization (Classic approach) for (I=0; I

–Cray uses vector mask & merge CS252/Culler CS252/Culler Lec 20. 17 Lec 20. 18 4/9/02 4/9/02

3 Vectorizing Matrix Mult Parallelize Inner Product

// Matrix-matrix multiply: // sum a[i][t] * b[t][j] to get [i][j] Sum of Partial Products for (i=1; i

CS252/Culler CS252/Culler Lec 20. 19 Lec 20. 20 4/9/02 4/9/02

Outer-loop Approach Approaches to Mediaprocessing

// Outer -loop Matrix-matrix multiply: // sum a[i][t] * b[t][j] to get c[i][j] // 32 elements of the result calculated in parallel // with each iteration of the j- loop (c[i][j:j+31]) General-purpose for (i=1; i

DSPs ASICs/FPGAs

CS252/Culler CS252/Culler Lec 20. 21 Lec 20. 22 4/9/02 4/9/02

What is Multimedia Processing? The Need for Multimedia ISAs

• Desktop: • Why aren’t general-purpose processors and ISAs – 3D graphics (games) sufficient for multimedia (despite Moore’s law)? – Speech recognition (voice input) • Performance – Video/audio decoding (mpeg-mp3 playback) – A 1.2GHz Athlon can do MPEG-4 encoding at 6.4fps – One 384Kbps W-CDMA channel requires 6.9 GOPS • Servers: • Power consumption – Video/audio encoding (video servers, IP telephony) – A 1.2GHz Athlon consumes ~60W – Digital libraries and media mining (video servers) – Power consumption increases with clock frequency and – Computer animation, 3D modeling & rendering (movies) complexity • Embedded: • Cost – 3D graphics (game consoles) – A 1.2GHz Athlon costs ~$62 to manufacture and has a list – Video/audio decoding & encoding (set top boxes) price of ~$600 (module) – Image processing (digital cameras) – Cost increases with complexity, area, , power, etc – Signal processing (cellular phones)

CS252/Culler CS252/Culler Lec 20. 23 Lec 20. 24 4/9/02 4/9/02

4 Example: MPEG Decoding Example: 3D Graphics

Input Stream Display Lists Load Breakdown Load Breakdown Transform Parsing 10% Geometry Pipe Lighting 10%

Setup Dequantization 20% 10%

Rasterization IDCT 25% Anti-aliasing 35% Shading, fogging Rendering Pipe Texture mapping Block Reconstruction 30% Alpha blending Z-buffer Clipping 55% RGB->YUV 15% Frame-buffer ops

CS252/Culler CS252/Culler Output to Screen Lec 20. 25 Lec 20. 26 4/9/02 4/9/02 Output to Screen

Characteristics of Multimedia Apps (1) Characteristics of Multimedia Apps (2)

• Requirement for real-time response • Coarse-grain parallelism – “Incorrect” result often preferred to slow result – Most apps organized as a pipeline of functions – Unpredictability can be bad (e.g. dynamic execution) – Multiple threads of execution can be used • Narrow data-types • Memory requirements – Typical width of data in memory: 8 to 16 bits – High bandwidth requirements but can tolerate high – Typical width of data during computation: 16 to 32 bits latency – 64-bit data types rarely needed – High spatial locality (predictable pattern) but low – Fixed-point arithmetic often replaces floating-point temporal locality – bypassing and prefetching can be crucial • Fine-grain (data) parallelism – Identical operation applied on streams of input data – Branches have high predictability – High instruction locality in small loops or kernels

CS252/Culler CS252/Culler Lec 20. 27 Lec 20. 28 4/9/02 4/9/02

Examples of Media Functions SIMD Extensions for GPP

• Matrix transpose/multiply (3D graphics) • Motivation • DCT/FFT (Video, audio, communications) – Low media-processing performance of GPPs • Motion estimation (Video) – Cost and lack of flexibility of specialized ASICs for • Gamma correction (3D graphics) graphics/video – Underutilized datapaths and registers • Haar transform (Media mining) • Median filter (Image processing) • Basic idea: sub-word parallelism – Treat a 64-bit register as a vector of 2 32-bit or 4 16-bit • Separable convolution (Image processing) or 8 8-bit values (short vectors) • Viterbi decode (Communications, speech) – Partition 64-bit datapaths to handle multiple narrow • Bit packing (Communications, cryptography) operations in parallel • Galois-fields arithmetic (Communications, cryptography) • Initial constraints • … – No additional architecture state (registers) – No additional exceptions – Minimum area overhead

CS252/Culler CS252/Culler Lec 20. 29 Lec 20. 30 4/9/02 4/9/02

5 Overview of SIMD Extensions Summary of SIMD Operations (1)

• Integer arithmetic Vendor Extension Year # Instr Registers – Addition and subtraction with saturation HP MAX-1 and 2 94,95 9,8 (int) Int 32x64b – Fixed-point rounding modes for multiply and shift – Sum of absolute differences Sun VIS 95 121 (int) FP 32x64b – Multiply-add, multiplication with reduction Intel MMX 97 57 (int) FP 8x64b – Min, max AMD 3DNow! 98 21 (fp) FP 8x64b • Floating-point arithmetic Motorola Altivec 98 162 (int,fp) 32x128b (new) – Packed floating-point operations Intel SSE 98 70 (fp) 8x128b (new) – Square root, reciprocal MIPS MIPS-3D ? 23 (fp) FP 32x64b – Exception masks AMD E 3DNow! 99 24 (fp) 8x128 (new) • Data communication – Merge, insert, extract Intel SSE-2 01 144 (int,fp) 8x128 (new) – Pack, unpack (width conversion) – Permute, shuffle

CS252/Culler CS252/Culler Lec 20. 31 Lec 20. 32 4/9/02 4/9/02

Summary of SIMD Operations (2) Programming with SIMD Extensions

• Comparisons • Optimized shared libraries – Integer and FP packed comparison – Written in assembly, distributed by vendor – Need well defined API for data format and use – Compare absolute values • Language macros for variables and operations – Element masks and bit vectors – C/C++ wrappers for short vector variables and function calls • Memory – Allows instruction scheduling and register allocation optimizations for specific processors – No new load-store instructions for short vector – Lack of portability, non standard • No support for strides or indexing • Compilers for SIMD extensions – Short vectors handled with 64b load and store – No commercially available compiler so far instructions – Problems – Pack, unpack, shift, rotate, shuffle to handle alignment of • Language support for expressing fixed-point arithmetic and narrow data-types within a wider one SIMD parallelism • Complicated model for loading/storing vectors – Prefetch instructions for utilizing temporal locality • Frequent updates • Assembly coding

CS252/Culler CS252/Culler Lec 20. 33 Lec 20. 34 4/9/02 4/9/02

SIMD Performance A Closer Look at MMX/SSE

Arithmetic Mean Geometic Mean PentiumIII (500MHz) with MMX/SSE 31.1 10 8 7.6 8 6.4 5.6 6 6 4.9 4.7 3.8 4 2.8 2 2.5 2.2 1.3 1.7 1.5 1.3 1.8 4 2 2 0

0 over Base Architecture Speedup over Base Media Benchmarks Athlon Alpha Pentium III PowerPC UltraSparc

Architecture for Berkeley 21264 G4 IIi • Higher speedup for kernels with narrow data where 128b Limitations SSE instructions can be used • Memory bandwidth • Lower speedup for those with irregular or strided accesses • Overhead of handling alignment and data width adjustments CS252/Culler CS252/Culler Lec 20. 35 Lec 20. 36 4/9/02 4/9/02

6 Choosing the Data Type Width Other Features for Multimedia

• Alternatives for selecting the width of elements in • Support for fixed-point arithmetic a vector register (64b, 32b, 16b, 8b) – Saturation, rounding-modes etc • Separate instructions for each width • Permutation instructions of vector registers – E.g. vadd64, vadd32, vadd16, vadd8 – For reductions and FFTs – Popular with SIMD extensions for GPPs – Not general permutations (too expensive) – Uses too many opcodes • Specify it in a control register • Example: permutation for reductions nd – Virtual-processor width (VPW) – Move 2 half a a vector register into another one – Updated only on width changes – Repeatedly use with vadd to execute reduction • NOTE – Vector length halved after each step – MVL increases when width (VPW) gets narrower 0 15 16 63 – E.g. with 2Kbits for register, MVL is 32,64,128,256 for V0 64-,32-,16-,8-bit data respectively – Always pick the narrowest VPW needed by the application 0 15 16 63 V1

CS252/Culler CS252/Culler Lec 20. 37 Lec 20. 38 4/9/02 4/9/02

Designing a Vector Processor Changes to

• Changes to scalar core • Decode vector instructions • How to pick the maximum vector length? • Send scalar registers to vector unit • How to pick the number of vector registers? (vector-scalar ops) • Context overhead? • Synchronization for results back from vector • Exception handling? register, including exceptions • Things that don’t run in vector don’t have high ILP, • Masking and flag instructions? so can make scalar CPU simple

CS252/Culler CS252/Culler Lec 20. 39 Lec 20. 40 4/9/02 4/9/02

How to Pick Max. Vector Length? How to Pick # of Vector Registers?

• Vector length => Keep all VFUs busy: • More vector registers: – Reduces vector register “spills” (save/restore) – Aggressive scheduling of vector instructions: better (# lanes) X (# VFUs ) compiling to take advantage of ILP • Vector length >= # Vector instr. issued/cycle • Fewer – Fewer bits in instruction format (usually 3 fields)

• Notes: – Single instruction issue is always the simplest • 32 vector registers are usually enough – Don’t forget you have to issue some scalar instructions as well – Cray get mileage from VL <= word length

CS252/Culler CS252/Culler Lec 20. 41 Lec 20. 42 4/9/02 4/9/02

7 Context Switch Overhead? Exception Handling: Arithmetic

• The vector holds a huge amount of architectural • Arithmetic traps are hard state • Precise interrupts => large performance loss – To expensive to save and restore all on each context switch – Cray: exchange packet – Multimedia applications don’t care much about arithmetic traps anyway • Extra dirty bit per processor – If vector registers not written, don’t need to save on context • Alternative model switch – Store exception information in vector flag registers • Extra valid bit per vector register, cleared on process start – A set flag bit indicates that the corresponding element – Don’t need to restore on context switch until needed operation caused an exception – Software inserts trap barrier instructions from SW to • Extra tip: check the flag bits as needed – Save/restore vector state only if the new context needs to – IEEE floating point requires 5 flag registers (5 types of issue vector instructions traps)

CS252/Culler CS252/Culler Lec 20. 43 Lec 20. 44 4/9/02 4/9/02

Exception Handling: Page Faults Exception Handling: Interrupts

• Page faults must be precise • Interrupts due to external sources – Instruction page faults not a problem – I/O, timers etc – Data page faults harder • Handled by the scalar core • Option 1: Save/restore internal vector unit state – Freeze pipeline, (dump all vector state), fix fault, • Should the vector unit be interrupted? (restore state and) continue vector pipeline – Not immediately (no context switch) • Option 2: expand memory pipeline to check all – Only if it causes an exception or the interrupt handler addresses before send to memory needs to execute a vector instruction – Requires address and instruction buffers to avoid stalls during address checks – On a page-fault on only needs to save state in those buffers – Instructions that have cleared the buffer can be allowed to complete

CS252/Culler CS252/Culler Lec 20. 45 Lec 20. 46 4/9/02 4/9/02

Vector Power Consumption Why Vectors for Multimedia?

• Can trade-off parallelism for power • Natural match to parallelism in multimedia – Power = C *Vdd 2 *f – Vector operations with VL the image or frame width – Easy to efficiently support vectors of narrow data types – If we double the lanes, peak performance doubles • High performance at low cost – Halving f restores peak performance but also allows – Multiple ops/cycle while issuing 1 instr/cycle halving of the Vdd – Multiple ops/cycle at low power consumption 2 – Power new = (2C)*(Vdd/2) *(f/2) = Power/4 – Structured access pattern for registers and memory • Simpler logic • Scalable – Get higher performance by adding lanes without architecture – Replicated control for all lanes modifications – No multiple issue or dynamic execution logic • Compact code size • Simpler to gate clocks – Describe N operations with 1 short instruction (v. VLIW) – Each vector instruction explicitly describes all the • Predictable performance resources it needs for a number of cycles – No need for caches, no dynamic execution • Mature, developed compiler technology – Conditional execution leads to further savings

CS252/Culler CS252/Culler Lec 20. 47 Lec 20. 48 4/9/02 4/9/02

8 A Vector Media-Processor: VIRAM Performance Comparison

• Technology: IBM SA-27E – 0.18mm CMOS, 6 copper layers VIRAM MMX • 280 mm2 die area iDCT 0.75 3.75 (5.0x) – 158 mm2 DRAM, 50 mm2 logic Color Conversion 0.78 8.00 (10.2x) • Transistor count: ~115M Image Convolution 1.23 5.49 (4.5x) – 14 Mbytes DRAM QCIF (176x144) 7.1M 33M (4.6x)

• Power supply & consumption CIF (352x288) 28M 140M (5.0x) – 1.2V for logic, 1.8V for DRAM – 2W at 1.2V • Peak performance • QCIF and CIF numbers are in clock cycles per frame – 1.6/3.2 /6.4 Gops (64/32/16b ops) – 3.2/6.4/12.8 Gops (with madd) • All other numbers are in clock cycles per pixel – 1.6 Gflops (single-precision) • MMX results assume no first level cache misses • Designed by 5 graduate students

CS252/Culler CS252/Culler Lec 20. 49 Lec 20. 50 4/9/02 4/9/02

FFT (1) FFT (2)

FFT (Floating-point, 1024 points) FFT (Fixed-point, 256 points)

160 160 151

124.3 120 VIRAM 120 VIRAM Pathfinder-1 92 Pathfinder-2 87 Wildstar Carmel 80 69 80 TigerSHARC TigerSHARC ADSP-21160 PPC 604E 36 40 TMS320C6701 40 Pentium Execution Time (usec)

Execution Time (usec) 25 16.8 7.2 8.1 9 7.3

0 0

CS252/Culler CS252/Culler Lec 20. 51 Lec 20. 52 4/9/02 4/9/02

SIMD Summary Vector Summary

• Narrow vector extensions for GPPs • Alternative model for explicitly expressing data – 64b or 128b registers as vectors of 32b, 16b, and 8b parallelism elements • If code is vectorizable, then simpler hardware, • Based on sub-word parallelism and partitioned more power efficient, and better real-time model datapaths than out-of-order machines with SIMD support • Instructions • Design issues include number of lanes, number of – Packed fixed- and floating-point, multiply-add, reductions functional units, number of vector registers, length – Pack, unpack, permutations of vector registers, exception handling, conditional – Limited memory support operations • 2x to 4x performance improvement over base • Will multimedia popularity revive vector architecture architectures? – Limited by memory bandwidth • Difficult to use (no compilers)

CS252/Culler CS252/Culler Lec 20. 53 Lec 20. 54 4/9/02 4/9/02

9