CS252 Graduate Computer Architecture Lecture 11 Vector

Review: Simultaneous Multi-threading ... CS252 One thread, 8 units Two threads, 8 units Graduate Computer Architecture Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC Lecture 11 1 1 2 2 Vector Processing 3 3 4 4 John Kubiatowicz 5 5 Electrical Engineering and Computer Sciences 6 6 University of California, Berkeley 7 7 8 8 http://www.eecs.berkeley.edu/~kubitron/cs252 9 9 http://www-inst.eecs.berkeley.edu/~cs252 M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes 2/28/2007 cs252-S07, Lecture 11 2 Review: Multithreaded Categories Design Challenges in SMT Simultaneous • Since SMT makes sense only with fine-grained Superscalar Fine-Grained Coarse-Grained Multiprocessing Multithreading implementation, impact of fine-grained scheduling on single thread performance? – A preferred thread approach sacrifices neither throughput nor single-thread performance? – Unfortunately, with a preferred thread, the processor is likely to sacrifice some throughput, when preferred thread stalls • Larger register file needed to hold multiple contexts • Clock cycle time, especially in: – Instruction issue - more candidate instructions need to be considered – Instruction completion - choosing which instructions to commit Time (processor cycle) may be challenging • Ensuring that cache and TLB conflicts generated by SMT do not degrade performance Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot 2/28/2007 cs252-S07, Lecture 11 3 2/28/2007 cs252-S07, Lecture 11 4 PowerPower 44 Power 4 Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle. 2 commits PowerPower 55 (architected register sets) 2 fetch (PC), 2 initial decodes 2/28/2007 cs252-S07, Lecture 11 5 2/28/2007 cs252-S07, Lecture 11 6 Power 5 data flow ... Power 5 thread performance ... Relative priority of each thread controllable in hardware. For balanced operation, both threads run slower than if Why only 2 threads? With 4, one of the shared they “owned” resources (physical registers, cache, memory the machine. bandwidth) would be prone to bottleneck 2/28/2007 cs252-S07, Lecture 11 7 2/28/2007 cs252-S07, Lecture 11 8 Changes in Power 5 to support SMT Initial Performance of SMT • Increased associativity of L1 instruction cache • Pentium 4 Extreme SMT yields 1.01 speedup for and the instruction address translation buffers SPECint_rate benchmark and 1.07 for SPECfp_rate • Added per thread load and store queues – Pentium 4 is dual threaded SMT • Increased size of the L2 (1.92 vs. 1.44 MB) and L3 – SPECRate requires that each SPEC benchmark be run against a caches vendor-selected number of copies of the same benchmark • Added separate instruction prefetch and • Running on Pentium 4 each of 26 SPEC buffering per thread benchmarks paired with every other (262 runs) • Increased the number of virtual registers from speed-ups from 0.90 to 1.58; average was 1.20 152 to 240 • Power 5, 8 processor server 1.23 faster for • Increased the size of several issue queues SPECint_rate with SMT, 1.16 faster for SPECfp_rate • The Power5 core is about 24% larger than the • Power 5 running 2 copies of each app speedup Power4 core because of the addition of SMT between 0.89 and 1.41 support – Most gained some – Fl.Pt. apps had most cache conflicts and least gains 2/28/2007 cs252-S07, Lecture 11 9 2/28/2007 cs252-S07, Lecture 11 10 Head to Head ILP competition Performance on SPECint2000 Processor Micro architecture Fetch / FU Clock Transis Power Itanium 2 Pentium 4 AMD Athlon 64 Pow er 5 Issue / Rate -tors Execute (GHz) Die size 3500 Intel Speculative 3/3/4 7 int. 3.8 125 M 115 Pentium dynamically 1 FP 122 W 3000 4 scheduled; deeply mm2 Extreme pipelined; SMT 2500 AMD Speculative 3/3/4 6 int. 2.8 114 M 104 Athlon 64 dynamically 3 FP 115 W 2000 FX-57 scheduled mm2 15 0 0 IBM Speculative 8/4/8 6 int. 1.9 200 M 80W SPEC Ratio Power5 dynamically 2 FP 300 (est.) (1 CPU scheduled; SMT; mm2 10 0 0 only) 2 CPU cores/chip (est.) 500 Intel Statically 6/5/11 9 int. 1.6 592 M 130 Itanium 2 scheduled 2 FP 423 W 2 0 VLIW-style mm gzip vpr gcc mcf craft y parser eon perlbmk gap vortex bzip2 twolf 2/28/2007 cs252-S07, Lecture 11 11 2/28/2007 cs252-S07, Lecture 11 12 Performance on SPECfp2000 Normalized Performance: Efficiency 14000 35 Itanium 2 Pentium 4 AMD Athlon 64 Power 5 Itanium 2 Pentium 4 AMD Athlon 64 POWER 5 I P t e 30 12000 a n A P n t t o i I h w 25 u u l e 10000 m m o r Rank 2 4 n 5 20 8000 Int/Trans 4 2 1 3 15 FP/Trans 4 2 1 3 SPEC Ratio SPEC 6000 Int/area 4 2 1 3 10 FP/area 4 2 1 3 4000 5 Int/Watt 4 3 1 2 2000 FP/Watt 2 4 3 1 0 SPECInt / M SPECFP / M SPECInt / SPECFP / SPECInt / SPECFP / 0 Transistors Transistors mm^2 mm^2 Watt Watt w upw ise sw im mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi 2/28/2007 cs252-S07, Lecture 11 13 2/28/2007 cs252-S07, Lecture 11 14 No Silver Bullet for ILP Limits to ILP • No obvious over all leader in performance • Doubling issue rates above today’s 3-6 • The AMD Athlon leads on SPECInt performance instructions per clock, say to 6 to 12 instructions, followed by the Pentium 4, Itanium 2, and Power5 probably requires a processor to – issue 3 or 4 data memory accesses per cycle, • Itanium 2 and Power5, which perform similarly on – resolve 2 or 3 branches per cycle, SPECFP, clearly dominate the Athlon and – rename and access more than 20 registers per cycle, and Pentium 4 on SPECFP – fetch 12 to 24 instructions per cycle. • Itanium 2 is the most inefficient processor both • The complexities of implementing these for Fl. Pt. and integer code for all but one capabilities is likely to mean sacrifices in the efficiency measure (SPECFP/Watt) maximum clock rate • Athlon and Pentium 4 both make good use of – E.g, widest issue processor is the Itanium 2, but it also has transistors and area in terms of efficiency, the slowest clock rate, despite the fact that it consumes the most power! • IBM Power5 is the most effective user of energy on SPECFP and essentially tied on SPECINT 2/28/2007 cs252-S07, Lecture 11 15 2/28/2007 cs252-S07, Lecture 11 16 Limits to ILP Administrivia • Most techniques for increasing performance increase • Exam: Wednesday 3/14 power consumption Location: TBA • The key question is whether a technique is energy TIME: 5:30 - 8:30 efficient: does it increase power consumption faster than it increases performance? • This info is on the Lecture page (has been) • Multiple issue processors techniques all are energy inefficient: • Meet at LaVal’s afterwards for Pizza and 1. Issuing multiple instructions incurs some overhead Beverages in logic that grows faster than the issue rate grows • CS252 Project proposal due by Monday 3/5 2. Growing gap between peak issue rates and sustained – Need two people/project (although can justify three for right performance project) • Number of transistors switching = f(peak issue rate), – Complete Research project in 8 weeks and performance = f( sustained rate), » Typically investigate hypothesis by building an artifact and growing gap between peak and sustained performance measuring it against a “base case” ⇒ increasing energy per unit of performance » Generate conference-length paper/give oral presentation » Often, can lead to an actual publication. 2/28/2007 cs252-S07, Lecture 11 17 2/28/2007 cs252-S07, Lecture 11 18 Supercomputers Supercomputer Applications Typical application areas Definition of a supercomputer: • Military research (nuclear weapons, cryptography) • Fastest machine in world at given task • Scientific research • A device to turn a compute-bound problem into an • Weather forecasting I/O bound problem • Oil exploration • Any machine costing $30M+ • Industrial design (car crash simulation) • Any machine designed by Seymour Cray All involve huge computations on large data sets CDC6600 (Cray, 1964) regarded as first supercomputer In 70s-80s, Supercomputer ≡ Vector Machine 2/28/2007 cs252-S07, Lecture 11 19 2/28/2007 cs252-S07, Lecture 11 20 Vector Supercomputers Cray-1 (1976) Epitomized by Cray-1, 1976: Scalar Unit + Vector Extensions • Load/Store Architecture • Vector Registers • Vector Instructions • Hardwired Control • Highly Pipelined Functional Units • Interleaved Memory System • No Data Caches • No Virtual Memory 2/28/2007 cs252-S07, Lecture 11 21 2/28/2007 cs252-S07, Lecture 11 22 Cray-1 (1976) Vector Programming Model V0 Vi Scalar Registers Vector Registers V1 V. Mask V2 Vj r15 v15 64 Element V3 V. Length V4 Vector Registers Vk Single Port V5 V6 Memory V7 FP Add r0 v0 [0] [1] [2] [VLRMAX-1] S0 Sj FP Mul 16 banks of ( (Ah) + j k m ) S1 S2 Sk FP Recip 64-bit words Si S3 Vector Length Register VLR (A0) 64 S4 Si Int Add + T S5 T Regs jk v1 S6 Int Logic 8-bit SECDED S7 Vector Arithmetic v2 Int Shift Instructions + + + + + + 80MW/sec data A0 ( (Ah) + j k m ) A1 Pop Cnt A2 ADDV v3, v1, v2 v3 load/store Aj Ai A3 (A0) 64 A4 Ak Addr Add [0] [1] [VLR-1] Bjk A5 A 320MW/sec B Regs A6 i Addr Mul A7 instruction Vector Load and Vector Register buffer refill Store Instructions v1 64-bitx16 NIP CIP LV v1, r1, r2 4 Instruction Buffers LIP memory bank cycle 50 ns processor cycle 12.5 ns (80MHz) Memory 2/28/2007 cs252-S07, Lecture 11 23 2/28/2007Base, r1 Stride, r2cs252-S07, Lecture 11 24 Vector Code Example Vector Instruction Set Advantages # C code # Scalar Code # Vector Code • Compact LI R4, 64 LI VLR, 64 for (i=0; i<64; i++) – one short instruction encodes N operations C[i] = A[i] + B[i]; loop: LV V1, R1 L.D F0, 0(R1) LV V2, R2 • Expressive, tells hardware that these N operations: L.D F2, 0(R2) ADDV.D V3,

Load more