<<

Review: Simultaneous Multi-threading ... CS252 One thread, 8 units Two threads, 8 units Graduate Architecture Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC Lecture 11 1 1

2 2

Vector Processing 3 3

4 4

John Kubiatowicz 5 5

Electrical Engineering and Computer Sciences 6 6

University of California, Berkeley 7 7

8 8 http://www.eecs.berkeley.edu/~kubitron/cs252 9 9 http://www-inst.eecs.berkeley.edu/~cs252 M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes 2/28/2007 cs252-S07, Lecture 11 2

Review: Multithreaded Categories Design Challenges in SMT Simultaneous • Since SMT makes sense only with fine-grained Superscalar Fine-Grained Coarse-Grained Multiprocessing Multithreading implementation, impact of fine-grained scheduling on single thread performance? – A preferred thread approach sacrifices neither throughput nor single-thread performance? – Unfortunately, with a preferred thread, the is likely to sacrifice some throughput, when preferred thread stalls • Larger needed to hold multiple contexts • Clock cycle time, especially in: – Instruction issue - more candidate instructions need to be considered – Instruction completion - choosing which instructions to commit Time (processor cycle) Time (processor may be challenging • Ensuring that cache and TLB conflicts generated by SMT do not degrade performance Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot

2/28/2007 cs252-S07, Lecture 11 3 2/28/2007 cs252-S07, Lecture 11 4 PowerPower 44 Power 4 Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle. 2 commits PowerPower 55 (architected register sets)

2 fetch (PC), 2 initial decodes 2/28/2007 cs252-S07, Lecture 11 5 2/28/2007 cs252-S07, Lecture 11 6

Power 5 data flow ... Power 5 thread performance ... Relative priority of each thread controllable in hardware.

For balanced operation, both threads run slower than if Why only 2 threads? With 4, one of the shared they “owned” resources (physical registers, cache, memory the machine. bandwidth) would be prone to bottleneck 2/28/2007 cs252-S07, Lecture 11 7 2/28/2007 cs252-S07, Lecture 11 8 Changes in Power 5 to support SMT Initial Performance of SMT • Increased associativity of L1 instruction cache • Pentium 4 Extreme SMT yields 1.01 speedup for and the instruction address translation buffers SPECint_rate benchmark and 1.07 for SPECfp_rate • Added per thread load and store queues – Pentium 4 is dual threaded SMT • Increased size of the L2 (1.92 vs. 1.44 MB) and L3 – SPECRate requires that each SPEC benchmark be run against a caches vendor-selected number of copies of the same benchmark • Added separate instruction prefetch and • Running on Pentium 4 each of 26 SPEC buffering per thread benchmarks paired with every other (262 runs) • Increased the number of virtual registers from speed-ups from 0.90 to 1.58; average was 1.20 152 to 240 • Power 5, 8 processor server 1.23 faster for • Increased the size of several issue queues SPECint_rate with SMT, 1.16 faster for SPECfp_rate • The Power5 core is about 24% larger than the • Power 5 running 2 copies of each app speedup Power4 core because of the addition of SMT between 0.89 and 1.41 support – Most gained some – Fl.Pt. apps had most cache conflicts and least gains

2/28/2007 cs252-S07, Lecture 11 9 2/28/2007 cs252-S07, Lecture 11 10

Head to Head ILP competition Performance on SPECint2000 Processor Micro architecture Fetch / FU Clock Transis Power 2 Pentium 4 AMD Athlon 64 Pow er 5 Issue / Rate -tors Execute (GHz) Die size 3500 Intel Speculative 3/3/4 7 int. 3.8 125 M 115 Pentium dynamically 1 FP 122 W 3000 4 scheduled; deeply mm2 Extreme pipelined; SMT 2500 AMD Speculative 3/3/4 6 int. 2.8 114 M 104 Athlon 64 dynamically 3 FP 115 W 2000 FX-57 scheduled mm2 15 0 0 IBM Speculative 8/4/8 6 int. 1.9 200 M 80W SPEC Ratio Power5 dynamically 2 FP 300 (est.) (1 CPU scheduled; SMT; mm2 10 0 0 only) 2 CPU cores/chip (est.) 500 Intel Statically 6/5/11 9 int. 1.6 592 M 130 Itanium 2 scheduled 2 FP 423 W 2 0 VLIW-style mm gzip vpr gcc mcf craft y parser eon perlbmk gap vortex bzip2 twolf

2/28/2007 cs252-S07, Lecture 11 11 2/28/2007 cs252-S07, Lecture 11 12 Performance on SPECfp2000 Normalized Performance: Efficiency 14000 35 Itanium 2 Pentium 4 AMD Athlon 64 Power 5 Itanium 2 Pentium 4 AMD Athlon 64 POWER 5 I P t e 30 12000 a n A P n t t o i I h w 25 u u l e 10000 m m o r Rank 2 4 n 5

20 8000 Int/Trans 4 2 1 3

15 FP/Trans 4 2 1 3

SPEC Ratio SPEC 6000 Int/area 4 2 1 3

10 FP/area 4 2 1 3 4000

5 Int/Watt 4 3 1 2

2000 FP/Watt 2 4 3 1

0 SPECInt / M SPECFP / M SPECInt / SPECFP / SPECInt / SPECFP / 0 Transistors Transistors mm^2 mm^2 Watt Watt w upw ise sw im mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi

2/28/2007 cs252-S07, Lecture 11 13 2/28/2007 cs252-S07, Lecture 11 14

No Silver Bullet for ILP Limits to ILP • No obvious over all leader in performance • Doubling issue rates above today’s 3-6 • The AMD Athlon leads on SPECInt performance instructions per clock, say to 6 to 12 instructions, followed by the Pentium 4, Itanium 2, and Power5 probably requires a processor to – issue 3 or 4 data memory accesses per cycle, • Itanium 2 and Power5, which perform similarly on – resolve 2 or 3 branches per cycle, SPECFP, clearly dominate the Athlon and – rename and access more than 20 registers per cycle, and Pentium 4 on SPECFP – fetch 12 to 24 . • Itanium 2 is the most inefficient processor both • The complexities of implementing these for Fl. Pt. and integer code for all but one capabilities is likely to mean sacrifices in the efficiency measure (SPECFP/Watt) maximum • Athlon and Pentium 4 both make good use of – E.g, widest issue processor is the Itanium 2, but it also has transistors and area in terms of efficiency, the slowest clock rate, despite the fact that it consumes the most power! • IBM Power5 is the most effective user of energy on SPECFP and essentially tied on SPECINT

2/28/2007 cs252-S07, Lecture 11 15 2/28/2007 cs252-S07, Lecture 11 16 Limits to ILP Administrivia • Most techniques for increasing performance increase • Exam: Wednesday 3/14 power consumption Location: TBA • The key question is whether a technique is energy TIME: 5:30 - 8:30 efficient: does it increase power consumption faster than it increases performance? • This info is on the Lecture page (has been) • Multiple issue processors techniques all are energy inefficient: • Meet at LaVal’s afterwards for Pizza and 1. Issuing multiple instructions incurs some overhead Beverages in logic that grows faster than the issue rate grows • CS252 Project proposal due by Monday 3/5 2. Growing gap between peak issue rates and sustained – Need two people/project (although can justify three for right performance project) • Number of transistors switching = f(peak issue rate), – Complete Research project in 8 weeks and performance = f( sustained rate), » Typically investigate hypothesis by building an artifact and growing gap between peak and sustained performance measuring it against a “base case” ⇒ increasing energy per unit of performance » Generate conference-length paper/give oral presentation » Often, can lead to an actual publication. 2/28/2007 cs252-S07, Lecture 11 17 2/28/2007 cs252-S07, Lecture 11 18

Supercomputers Supercomputer Applications

Typical application areas Definition of a supercomputer: • Military research (nuclear weapons, cryptography) • Fastest machine in world at given task • Scientific research • A device to turn a compute-bound problem into an • Weather forecasting I/O bound problem • Oil exploration • Any machine costing $30M+ • Industrial design (car crash simulation) • Any machine designed by Seymour Cray All involve huge computations on large data sets CDC6600 (Cray, 1964) regarded as first supercomputer In 70s-80s, Supercomputer ≡ Vector Machine

2/28/2007 cs252-S07, Lecture 11 19 2/28/2007 cs252-S07, Lecture 11 20 Vector Supercomputers Cray-1 (1976)

Epitomized by Cray-1, 1976:

Scalar Unit + Vector Extensions • Load/Store Architecture • Vector Registers • Vector Instructions • Hardwired Control • Highly Pipelined Functional Units • Interleaved Memory System • No Data Caches • No Virtual Memory

2/28/2007 cs252-S07, Lecture 11 21 2/28/2007 cs252-S07, Lecture 11 22

Cray-1 (1976) Vector Programming Model V0 Vi Scalar Registers Vector Registers V1 V. Mask V2 Vj r15 v15 64 Element V3 V. Length V4 Vector Registers Vk Single Port V5 V6 Memory V7 FP Add r0 v0 [0] [1] [2] [VLRMAX-1] S0 Sj FP Mul 16 banks of ( (Ah) + j k m ) S1 S2 Sk FP Recip 64-bit words Si S3 Vector Length Register VLR (A0) 64 S4 Si Int Add + T S5 T Regs jk v1 S6 Int Logic 8-bit SECDED S7 Vector Arithmetic v2 Int Shift Instructions + + + + + + 80MW/sec data A0 ( (Ah) + j k m ) A1 Pop Cnt A2 ADDV v3, v1, v2 v3 load/store Aj Ai A3 (A0) 64 A4 Ak Addr Add [0] [1] [VLR-1] Bjk A5 A 320MW/sec B Regs A6 i Addr Mul A7 instruction Vector Load and Vector Register buffer refill Store Instructions v1 64-bitx16 NIP CIP LV v1, r1, r2 4 Instruction Buffers LIP memory bank cycle 50 ns processor cycle 12.5 ns (80MHz) Memory 2/28/2007 cs252-S07, Lecture 11 23 2/28/2007Base, r1 Stride, r2cs252-S07, Lecture 11 24 Vector Code Example Vector Instruction Set Advantages

# C code # Scalar Code # Vector Code • Compact LI R4, 64 LI VLR, 64 for (i=0; i<64; i++) – one short instruction encodes N operations C[i] = A[i] + B[i]; loop: LV V1, R1 L.D F0, 0(R1) LV V2, R2 • Expressive, tells hardware that these N operations: L.D F2, 0(R2) ADDV.D V3, V1, V2 – are independent ADD.D F4, F2, F0 SV V3, R3 – use the same functional unit S.D F4, 0(R3) – access disjoint registers DADDIU R1, 8 – access registers in the same pattern as previous instructions DADDIU R2, 8 – access a contiguous block of memory (unit-stride load/store) DADDIU R3, 8 – access memory in a known pattern (strided load/store) DSUBIU R4, 1 BNEZ R4, loop • Scalable – can run same object code on more parallel pipelines or lanes

2/28/2007 cs252-S07, Lecture 11 25 2/28/2007 cs252-S07, Lecture 11 26

Vector Arithmetic Execution Vector memory Subsystem • Use deep pipeline (=> fast Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency clock) to execute element V V V • Bank busy time: Cycles between accesses to same bank operations 1 2 3 • Simplifies control of deep Base Stride pipeline because elements in Vector Registers vector are independent (=> no hazards!) Address Generator +

Six stage multiply pipeline

0 1 2 3 4 5 6 7 8 9 A B C D E F V3 <- v1 * v2 Memory Banks

2/28/2007 cs252-S07, Lecture 11 27 2/28/2007 cs252-S07, Lecture 11 28 Vector Instruction Execution Vector Unit Structure ADDV C,A,B Functional Unit

Execution using Execution using one pipelined four pipelined functional unit functional units Vector Registers A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] Elements Elements Elements Elements A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] 0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, … A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]

C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] Lane

C[0] C[0] C[1] C[2] C[3] Memory Subsystem

2/28/2007 cs252-S07, Lecture 11 29 2/28/2007 cs252-S07, Lecture 11 30

Vector Memory-Memory versus T0 Vector (1995) Vector Register Machines • Vector memory-memory instructions hold all vector operands in main memory • The first vector machines, CDC Star-100 (‘73) and TI ASC (‘71), were memory-memory machines • Cray-1 (’76) was first vector register machine Vector register Lane Vector Memory-Memory Code elements striped Example Source Code ADDV C, A, B over lanes for (i=0; i

Vector Stripmining Vector Instruction Parallelism Problem: Vector registers have finite length Can overlap execution of multiple vector instructions Solution: Break loops into pieces that fit into vector – example machine has 32 elements per vector register and 8 lanes registers, “Stripmining” ANDI R1, N, 63 # N mod 64 for (i=0; i

Load V V V V V LV v1 Mul 1 2 3 4 5 MULV v3,v1,v2 Time Add ADDV v5, v3, v4 Chain Chain • With chaining, can start dependent instruction as soon as first result appears Load Unit Mult. Add Load Mul Memory Add

2/28/2007 cs252-S07, Lecture 11 37 2/28/2007 cs252-S07, Lecture 11 38

Vector Startup Dead Time and Short Vectors Two components of vector startup penalty – functional unit latency (time through pipeline) – dead time or recovery time (time before another vector instruction can start down pipeline)

Functional Unit Latency No dead time

R X X X W 4 cycles dead time T0, Eight lanes No dead time R X X X W First Vector Instruction 100% efficiency with 8 element R X X X W vectors R X X X W

R X X X W Dead Time R X X X W 64 cycles active

R X X X W Cray C90, Two lanes R X X X W 4 cycle dead time Dead Time R X X X W Second Vector Instruction Maximum efficiency 94% with 128 element vectors 2/28/2007 cs252-S07,R LectureX X 11X W 39 2/28/2007 cs252-S07, Lecture 11 40 Vector Scatter/Gather Vector Scatter/Gather Scatter example: Want to vectorize loops with indirect accesses: for (i=0; i

2/28/2007 cs252-S07, Lecture 11 41 2/28/2007 cs252-S07, Lecture 11 42

Vector Conditional Execution Masked Vector Instructions Simple Implementation Density-Time Implementation Problem: Want to vectorize loops with conditional code: – execute all N operations, turn off – scan mask vector and only execute for (i=0; i0) then A[i] = B[i]; M[7]=1 A[7] B[7] M[7]=1 M[6]=0 A[6] B[6] M[6]=0 A[7] B[7] Solution: Add vector mask (or flag) registers M[5]=1 A[5] B[5] M[5]=1 M[4]=1 A[4] B[4] M[4]=1 – vector version of predicate registers, 1 bit per element M[3]=0 A[3] B[3] M[3]=0 C[5] …and maskable vector instructions M[2]=0 C[4] – vector operation becomes NOP at elements where mask bit is clear M[1]=1 M[2]=0 C[2] Code example: M[0]=0 C[1] CVM # Turn on all elements M[1]=1 C[1] LV vA, rA # Load entire A vector Write data port SGTVS.D vA, F0 # Set bits in mask register where A>0 M[0]=0 C[0] LV vA, rB # Load B vector into A under mask Write Enable Write data port SV vA, rA # Store A back to memory under mask

2/28/2007 cs252-S07, Lecture 11 43 2/28/2007 cs252-S07, Lecture 11 44 Compress/Expand Operations Vector Reductions • Compress packs non-masked elements from one vector register contiguously at start of destination Problem: Loop-carried dependence on reduction variables vector register sum = 0; for (i=0; i1) selection operations 2/28/2007 cs252-S07, Lecture 11 45 2/28/2007 cs252-S07, Lecture 11 46

Novel Matrix Multiply Solution Optimized Vector Example /* Multiply a[m][k] * b[k][n] to get c[m][n] */ • Consider the following: for (i=1; i

2/28/2007 cs252-S07, Lecture 11 49 2/28/2007 cs252-S07, Lecture 11 50

MMX Instructions Vector Summary

• Move 32b, 64b • Vector is alternative model for exploiting ILP • Add, Subtract in parallel: 8 8b, 4 16b, 2 32b • If code is vectorizable, then simpler hardware, – opt. signed/unsigned saturate (set to max) if overflow more energy efficient, and better real-time model than Out-of-order machines • Shifts (sll,srl, sra), And, And Not, Or, Xor in parallel: 8 8b, 4 16b, 2 32b • Design issues include number of lanes, number of functional units, number of vector registers, length • Multiply, Multiply-Add in parallel: 4 16b of vector registers, exception handling, conditional • Compare = , > in parallel: 8 8b, 4 16b, 2 32b operations – sets field to 0s (false) or 1s (true); removes branches • Fundamental design issue is memory bandwidth • Pack/Unpack – With virtual address translation and caching • Will multimedia popularity revive vector – Convert 32b<–> 16b, 16b <–> 8b architectures? – Pack saturates (set to max) if number is too large

2/28/2007 cs252-S07, Lecture 11 51 2/28/2007 cs252-S07, Lecture 11 52