<<

CS252 Recall: Vector Programming Model Graduate Architecture Scalar Registers Vector Registers Lecture 13 r15 v15

Vector Processing (Con’t) r0 v0 Intro to Multiprocessing [0] [1] [2] [VLRMAX-1] March 8th, 2010 Vector Length Register VLR v1 John Kubiatowicz Vector Arithmetic v2 Instructions + + + + + + Electrical Engineering and Computer Sciences ADDV v3, v1, v2 v3 University of California, Berkeley [0] [1] [VLR-1] Vector Load and Vector Register Store Instructions v1 http://www.eecs.berkeley.edu/~kubitron/cs252 LV v1, r1, r2

Memory 3/8/2010Base, r1 Stride, r2cs252-S10, Lecture 13 2

Vector Memory-Memory vs. Recall: Vector Unit Structure Vector Register Machines Functional Unit • Vector memory-memory instructions hold all vector operands in main memory • The first vector machines, CDC Star-100 (‘73) and TI ASC (‘71), were memory-memory machines • Cray-1 (’76) was first vector register machine Vector Vector Memory-Memory Code Registers Elements Elements Elements Elements Example Source Code 0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, … ADDV C, A, B for (i=0; i

Vector Stripmining Vector Instruction Parallelism Problem: Vector registers have finite length Can overlap execution of multiple vector instructions Solution: Break loops into pieces that fit into vector – example machine has 32 elements per vector register and 8 lanes registers, “Stripmining” ANDI R1, N, 63 # N mod 64 Load Unit Multiply Unit Add Unit for (i=0; i

Load V V V V V LV v1 Mul 1 2 3 4 5 MULV v3,v1,v2 Time Add ADDV v5, v3, v4 Chain Chain • With chaining, can start dependent instruction as soon as first result appears Load Unit Mult. Add Load Mul Memory Add

3/8/2010 cs252-S10, Lecture 13 9 3/8/2010 cs252-S10, Lecture 13 10

Vector Startup Dead Time and Short Vectors Two components of vector startup penalty – functional unit latency (time through pipeline) – dead time or recovery time (time before another vector instruction can start down pipeline) Functional Unit Latency No dead time

R X X X W 4 cycles dead time T0, Eight lanes No dead time R X X X W First Vector Instruction 100% efficiency with 8 element R X X X W vectors R X X X W

R X X X W Dead Time R X X X W 64 cycles active

R X X X W Cray C90, Two lanes R X X X W 4 cycle dead time Dead Time R X X X W Second Vector Instruction Maximum efficiency 94% with 128 element vectors 3/8/2010 cs252-S10,R LectureX X 13X W 11 3/8/2010 cs252-S10, Lecture 13 12 Vector Scatter/Gather Vector Conditional Execution Problem: Want to vectorize loops with conditional code: Want to vectorize loops with indirect accesses: for (i=0; i0) then for (i=0; i0 LV vA, rB # Load B vector into A under mask SV vA, rA # Store A back to memory under mask 3/8/2010 cs252-S10, Lecture 13 13 3/8/2010 cs252-S10, Lecture 13 14

Masked Vector Instructions Compress/Expand Operations Simple Implementation Density-Time Implementation • Compress packs non-masked elements from one – execute all N operations, turn off – scan mask vector and only vector register contiguously at start of destination result writeback according to mask execute elements with non-zero vector register masks – population count of mask vector gives packed vector length M[7]=1 A[7] B[7] M[7]=1 M[6]=0 A[6] B[6] • Expand performs inverse operation M[6]=0 M[5]=1 A[5] B[5] A[7] B[7] M[5]=1 A[7] M[4]=1 A[4] B[4] M[7]=1 A[7] A[7] M[7]=1 M[4]=1 A[5] M[3]=0 A[3] B[3] M[6]=0 A[6] B[6] M[6]=0 C[5] M[3]=0 M[5]=1 A[5] A[4] A[5] M[5]=1 M[2]=0 C[4] M[4]=1 A[4] A[1] A[4] M[4]=1 M[2]=0 C[2] M[1]=1 M[3]=0 A[3] A[7] B[3] M[3]=0 M[1]=1 C[1] M[0]=0 C[1] M[2]=0 A[2] A[5] B[2] M[2]=0 M[1]=1 A[1] A[4] A[1] M[1]=1 Write data port M[0]=0 A[0] A[1] B[0] M[0]=0 M[0]=0 C[0] Write Enable Write data port Compress Expand Used for density-time conditionals and also for general selection operations 3/8/2010 cs252-S10, Lecture 13 15 3/8/2010 cs252-S10, Lecture 13 16 Administrivia Vector Reductions • Exam: one week from Wednesday (3/17) Problem: Loop-carried dependence on reduction variables Location: 310 Soda sum = 0; TIME: 6:00-9:00 for (i=0; i1)

3/8/2010 cs252-S10, Lecture 13 17 3/8/2010 cs252-S10, Lecture 13 18

Novel Matrix Multiply Solution Optimized Vector Example • Consider the following: /* Multiply a[m][k] * b[k][n] to get c[m][n] */ for (i=1; i

3/8/2010 cs252-S10, Lecture 13 19 3/8/2010 cs252-S10, Lecture 13 20 Multimedia Extensions “Vector” for Multimedia? • Very short vectors added to existing ISAs for micros • Intel MMX: 57 additional 80x86 instructions (1st since • Usually 64-bit registers split into 2x32b or 4x16b or 8x8b 386) • Newer designs have 128-bit registers (Altivec, SSE2) – similar to Intel 860, Mot. 88110, HP PA-71000LC, UltraSPARC • 3 data types: 8 8-bit, 4 16-bit, 2 32-bit in 64bits • Limited instruction set: – reuse 8 FP registers (FP and MMX cannot mix) – no vector length control – no strided load/store or scatter/gather • - short vector: load, add, store 8 8-bit operands – unit-stride loads must be aligned to 64/128-bit boundary + • Limited vector register length: – requires superscalar dispatch to keep multiply/add/load units busy – loop unrolling to hide latencies increases register pressure • Claim: overall speedup 1.5 to 2X for 2D/3D graphics, • Trend towards fuller vector support in audio, video, speech, comm., ... – use in drivers or added to library routines; no compiler

3/8/2010 cs252-S10, Lecture 13 21 3/8/2010 cs252-S10, Lecture 13 22

VLIW: Very Large Instruction Word (revisited) Recall: Loop Unrolling in VLIW

Memory Memory FP FP Int. op/ Clock • Each “instruction” has explicit coding for multiple reference 1 reference 2 operation 1 op. 2 branch operations L.D F0,0(R1) L.D F6,-8(R1) 1 – In IA-64, grouping called a “packet” L.D F10,-16(R1) L.D F14,-24(R1) 2 – In Transmeta, grouping called a “molecule” (with “atoms” as L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3 ops) L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4 ADD.D F20,F18,F2 ADD.D F24,F22,F2 5 • Tradeoff instruction space for simple decoding S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6 – The long instruction word has room for many operations S.D -16(R1),F12 S.D -24(R1),F16 7 – By definition, all the operations the compiler puts in the long S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8 instruction word are independent => execute in parallel S.D -0(R1),F28 BNEZ R1,LOOP 9 – E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch Unrolled 7 times to avoid delays » 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) – Need compiling technique that schedules across several Average: 2.5 ops per clock, 50% efficiency branches Note: Need more registers in VLIW (15 vs. 6 in SS)

3/8/2010 cs252-S10, Lecture 13 23 3/8/2010 cs252-S10, Lecture 13 24 Paper Discussion: VLIW and the ELI-512 Joshua Fisher Problems with 1st Generation VLIW • Increase in code size – generating enough operations in a straight-line code fragment requires ambitiously unrolling loops – whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding • Operated in lock-step; no hazard detection HW – a stall in any functional unit pipeline caused entire to stall, since all functional units must be kept • Trace Scheduling: synchronized – Find common paths through code (“Traces”) – Compiler might prediction function units, but caches hard – Compact them to predict – Build fixup code for trace exits (the “split” boxes) • Binary code compatibility – Must not overwrite live variables use extra variables to store results – Pure VLIW => different numbers of functional units and unit • N+1 way jumps latencies require different versions of the code – Used to handle exit conditions from traces • Software prediction of memory bank usage

3/8/2010 – Use it to avoid bank conflicts/dealcs252-S10, Lecture with limited13 routing 25 3/8/2010 cs252-S10, Lecture 13 26

Intel/HP IA-64 “Explicitly Parallel ™ EPIC Design Maximizes SW-HW Synergy Instruction Computer (EPIC)” (Copyright: Intel at Hotchips ’00) • IA-64: instruction set architecture Architecture Features programmed by compiler: – 128 64-bit integer regs + 128 82-bit floating point regs Register Branch Explicit Memory Stack Data & Control » Not separate register files per functional unit as in old VLIW Hints Parallelism Predication Hints – Hardware checks dependencies & Rotation Speculation (interlocks  binary compatibility over time) • 3 Instructions in 128 bit “bundles”; field determines if instructions dependent or independent Micro-architecture Features in hardware: – Smaller code size than old VLIW, larger than /RISC Fetch Issue Register Control Memory Fetch Issue Register Control Parallel Resources Memory – Groups can be linked to show independence > 3 instr Handling Subsystem

Handling Bypasses &Dependencies Subsystem 4 Integer + • Predicated execution (select 1 out of 64 1-bit flags) Fast, Simple6-Issue  40% fewer mispredictions? 128 GR & 4 MMX Units • Speculation Support: Instruction 128 FR, Three – deferred exception handling with “poison bits” 2 FMACs Cache Register levels of – Speculative movement of loads above stores + check to see if incorect (4 for SSE) & Branch Remap cache: • Itanium™ was first implementation (2001) & Predictors L1, L2, L3 – Highly parallel and deeply pipelined hardware at 800Mhz Stack 2 LD/ST units – 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process Engine • Itanium 2™ is name of 2nd implementation (2005) 32 entry ALAT – 6-wide, 8-stage pipeline at 1666Mhz on 0.13 µ process – Caches: 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D, 9216 KB L3 Speculation Deferral Management 3/8/2010 cs252-S10, Lecture 13 27 3/8/2010 cs252-S10, Lecture 13 28 10 Stage In-Order Core Pipeline (Copyright: Intel at Hotchips ’00) What is Parallel Architecture? • A parallel computer is a collection of processing Front End Execution elements that cooperate to solve large problems •Pre-fetch/Fetch of up • 4 single cycle ALUs, 2 ld/str – Most important new element: It is all about communication! to 6 instructions/cycle • Advanced load control •Hierarchy of branch • Predicate delivery & branch • What does programmer (or OS or Compiler) think about? predictors • Nat/Exception//Retirement – Models of computation: •Decoupling buffer » PRAM? BSP? Sequential Consistency? – Resource Allocation: WORD-LINE EXPAND RENAME DECODE REGISTER READ » how powerful are the elements? » how much memory? IPG FET ROT EXP RENWLD REG EXE DET WRB • What mechanisms must be in hardware vs software INST POINTER FETCH ROTATE EXECUTEEXCEPTION WRITE-BACK GENERATION DETECT – What does a single processor look like? » High performance general purpose processor Instruction Delivery Operand Delivery » SIMD processor/ •Dispersal of up to 6 • Reg read + Bypasses – Data access, Communication and Synchronization instructions on 9 ports • Register scoreboard » how do the elements cooperate and communicate? •Reg. remapping • Predicated » how are data transmitted between processors? •Reg. stack engine dependencies » what are the abstractions and primitives for cooperation? 3/8/2010 cs252-S10, Lecture 13 29 3/8/2010 cs252-S10, Lecture 13 30

Flynn’s Classification (1966) Examples of MIMD Machines Broad classification of parallel computing systems • Symmetric Multiprocessor P P P P • SISD: Single Instruction, Single Data – Multiple processors in box with shared memory communication Bus – conventional uniprocessor – Current MultiCore chips like this Memory • SIMD: Single Instruction, Multiple Data – Every processor runs copy of OS – one instruction stream, multiple data paths • Non-uniform shared-memory with P/M P/M P/M P/M – distributed memory SIMD (MPP, DAP, CM-1&2, Maspar) separate I/O through host – Multiple processors P/M P/M P/M P/M Host – shared memory SIMD (STARAN, vector ) » Each with local memory • MIMD: Multiple Instruction, Multiple Data » general scalable network P/M P/M P/M P/M – message passing machines (Transputers, nCube, CM-5) – Extremely light “OS” on node provides simple services P/M P/M P/M P/M – non-cache-coherent shared memory machines (BBN » Scheduling/synchronization Butterfly, T3D) – Network-accessible host for I/O – cache-coherent shared memory machines (Sequent, Sun • Cluster Starfire, SGI Origin) – Many independent machine connected • MISD: Multiple Instruction, Single Data with general network – Communication through messages – Not a practical configuration 3/8/2010 cs252-S10, Lecture 13 31 3/8/2010 cs252-S10, Lecture 13 Network 32 Categories of Thread Execution Parallel Programming Models Simultaneous Superscalar Fine-Grained Coarse-Grained Multiprocessing Multithreading • Programming model is made up of the languages and libraries that create an abstract view of the machine • Control – How is parallelism created? – What orderings exist between operations? – How do different threads of control synchronize? • Data – What data is private vs. shared? – How is logically shared data accessed or communicated? Time (processor cycle) Time (processor • Synchronization – What operations can be used to coordinate parallelism Thread 1 Thread 3 Thread 5 – What are the atomic (indivisible) operations? Thread 2 Thread 4 Idle slot • Cost – How do we account for the cost of each of the above? 3/8/2010 cs252-S10, Lecture 13 33 3/8/2010 cs252-S10, Lecture 13 34

Simple Programming Example Programming Model 1: Shared Memory Shared memory • Consider applying a function f to the elements s s = ... of an array A and then computing its sum: y = ..s ... n  1 i: 2 i: 5 Private i: 8 f ( A [ i ])  memory i  0 • Questions: P0 P1 Pn – Where does A live? All in single memory? Partitioned? • Program is a collection of threads of control. – Can be created dynamically, mid-execution, in some languages – What work will be done by each processors? • Each thread has a set of private variables, e.g., local stack – They need to coordinate to get a single result, how? variables A: • Also a set of shared variables, e.g., static variables, shared A = array of all data f common blocks, or global heap. fA = f(A) fA: – Threads communicate implicitly by writing and reading shared s = sum(fA) sum variables. s: – Threads coordinate by synchronizing on shared variables

3/8/2010 cs252-S10, Lecture 13 35 3/8/2010 cs252-S10, Lecture 13 36 Simple Programming Example: SM Shared Memory “Code” for sum

• Shared memory strategy: static int s = 0; n  1 – small number p << n=size(A) processors f ( A [ i ])  Thread 1 Thread 2 – attached to single memory i  0 • Parallel Decomposition: for i = 0, n/2-1 for i = n/2, n-1 – Each evaluation and each partial sum is a task. s = s + f(A[i]) s = s + f(A[i]) • Assign n/p numbers to each of p procs – Each computes independent “private” results and partial sum. – Collect the p partial sums and compute a global sum. • Problem is a race condition on variable s in the program •A race condition or data race occurs when: Two Classes of Data: - two processors (or two threads) access the same • Logically Shared variable, and at least one does a write. – The original n numbers, the global sum. - The accesses are concurrent (not synchronized) so • Logically Private they could happen simultaneously – The individual function evaluations. – What about the individual partial sums?

3/8/2010 cs252-S10, Lecture 13 37 3/8/2010 cs252-S10, Lecture 13 38

A Closer Look Improved Code for Sum

Af3 5 = square static int s = 0; static lock lk; static int s = 0;

Thread 1 Thread 2 Thread 1 Thread 2 …. … compute f([A[i]) and put in reg0 9 compute f([A[i]) and put in reg0 25 local_s1= 0 local_s2 = 0 reg1 = s 0 reg1 = s 0 for i = 0, n/2-1 for i = n/2, n-1 local_s1 = local_s1 + f(A[i]) local_s2= local_s2 + f(A[i]) reg1 = reg1 + reg0 9 reg1 = reg1 + reg0 25 lock(lk); lock(lk); s = reg1 9 s = reg1 25 … … s = s + local_s1 s = s +local_s2 unlock(lk); unlock(lk); • Assume A = [3,5], f is the square function, and s=0 initially • Since addition is associative, it’s OK to rearrange order • For this program to work, s should be 34 at the end • Most computation is on private variables • but it may be 34,9, or 25 - Sharing frequency is also reduced, which might improve speed •The atomic operations are reads and writes - But there is still a race condition on the update of shared s • Never see ½ of one number, but += operation is not atomic - The race condition can be fixed by adding locks (only one thread • All computations happen in (private) registers can hold a lock at a time; others wait for it) 3/8/2010 cs252-S10, Lecture 13 39 3/8/2010 cs252-S10, Lecture 13 40 What about Synchronization? Programming Model 2: Message Passing • All shared-memory programs need synchronization Private • Barrier – global (/coordinated) synchronization memory – simple use of barriers -- all threads hit the same one s: 12 s: 14 s: 11 work_on_my_subgrid(); receive Pn,s barrier; read_neighboring_values(); y = ..s ... i: 2 i: 3 i: 1 barrier; • Mutexes – mutual exclusion locks P0 P1 send P1,s Pn – threads are mostly independent and must access common data lock *l = alloc_and_init(); /* shared */ Network lock(l); access data • Program consists of a collection of named processes. unlock(l); – Usually fixed at program startup time • Need atomic operations bigger than loads/stores – Thread of control plus local address space -- NO shared data. – Actually – Dijkstra’s can get by with only loads/stores, but this is quite complex (and doesn’t work under all circumstances) – Logically shared data is partitioned over local processes. – Example: atomic swap, test-and-test-and-set • Processes communicate by explicit send/receive pairs • Another Option: Transactional memory – Hardware equivalent of optimistic concurrency – Coordination is implicit in every communication event. – Some think that this is the answer to all parallel programming – MPI (Message Passing Interface) is the most commonly used SW 3/8/2010 cs252-S10, Lecture 13 41 3/8/2010 cs252-S10, Lecture 13 42

Compute A[1]+A[2] on each processor MPI – the de facto standard ° First possible solution – what could go wrong? • MPI has become the de facto standard for parallel computing using message passing Processor 1 Processor 2 xlocal = A[1] xlocal = A[2] • Example: send xlocal, proc2 send xlocal, proc1 for(i=1;i

Example: Multidimensional Meshes and Tori Links and Channels

...ABC123 => ...QR67 =>

Transmitter Receiver

2D Grid 3D Cube 2D Torus • transmitter converts stream of digital symbols into signal that is driven down the link • n-dimensional array • receiver converts it back – N = kd-1 X ...X kO nodes – tran/rcv share physical protocol – described by n-vector of coordinates (in-1, ..., iO) • trans + link + rcv form Channel for digital info flow • n-dimensional k-ary mesh: N = kn between switches • link-level protocol segments stream of symbols into – k = nN larger units: packets or messages (framing) – described by n-vector of radix k coordinate • node-level protocol embeds commands for dest • n-dimensional k-ary torus (or k-ary n-cube)? communication assist within packet

3/8/2010 cs252-S10, Lecture 13 47 3/8/2010 cs252-S10, Lecture 13 48 Clock Synchronization? Conclusion • Receiver must be synchronized to transmitter • Vector is alternative model for exploiting ILP – To know when to latch data – If code is vectorizable, then simpler hardware, more energy efficient, and • Fully Synchronous better real-time model than Out-of-order machines – Same clock and phase: Isochronous – Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, – Same clock, different phase: Mesochronous conditional operations » High-speed serial links work this way » Use of encoding (8B/10B) to ensure sufficient high-frequency • VLIW: Explicitly Parallel component for clock recovery – Trace Scheduling: Select primary “trace” to compress + fixup code • Fully Asynchronous • Itanium/EPIC/VLIW is not a breakthrough in ILP – No clock: Request/Ack signals – If anything, it is as complex or more so than a dynamic processor – Different clock: Need some sort of clock recovery? • Multiprocessing – Multiple processors connect together Transmitter Asserts Data – It is all about communication! Data • Programming Models: – Shared Memory – Message Passing Req • Networking and Communication Interfaces – Fundamental aspect of multiprocessing Ack t0 t1 t2 t3 t4 t5 3/8/2010 cs252-S10, Lecture 13 49 3/8/2010 cs252-S10, Lecture 13 50