Computer Architecture, Part 7

Part VII Advanced Architectures Feb. 2011 Computer Architecture, Advanced Architectures Slide 1 About This Presentation This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised Revised Revised First July 2003 July 2004 July 2005 Mar. 2007 Feb. 2011* * Minimal update, due to this part not being used for lectures in ECE 154 at UCSB Feb. 2011 Computer Architecture, Advanced Architectures Slide 2 VII Advanced Architectures Performance enhancement beyond what we have seen: • What else can we do at the instruction execution level? • Data parallelism: vector and array processing • Control parallelism: parallel and distributed processing Topics in This Part Chapter 25 Road to Higher Performance Chapter 26 Vector and Array Processing Chapter 27 Shared-Memory Multiprocessing Chapter 28 Distributed Multicomputing Feb. 2011 Computer Architecture, Advanced Architectures Slide 3 25 Road to Higher Performance Review past, current, and future architectural trends: • General-purpose and special-purpose acceleration • Introduction to data and control parallelism Topics in This Chapter 25.1 Past and Current Performance Trends 25.2 Performance-Driven ISA Extensions 25.3 Instruction-Level Parallelism 25.4 Speculation and Value Prediction 25.5 Special-Purpose Hardware Accelerators 25.6 Vector, Array, and Parallel Processing Feb. 2011 Computer Architecture, Advanced Architectures Slide 4 25.1 Past and Current Performance Trends Intel 4004: The first μp (1971) Intel Pentium 4, circa 2005 0.06 MIPS (4-bit processor) 10,000 MIPS (32-bit processor) 8008 80386 80486 8080 8-bit Pentium, MMX 8084 32-bit Pentium Pro, II 8086 8088 Pentium III, M 80186 80188 16-bit Celeron 80286 Feb. 2011 Computer Architecture, Advanced Architectures Slide 5 Architectural Innovations for Improved Performance Computer performance grew by a factor Available computing power ca. 2000: of about 10000 between 1980 and 2000 GFLOPS on desktop 100 due to faster technology TFLOPS in supercomputer center 100 due to better architecture PFLOPS on drawing board Architectural method Improvement factor 1. Pipelining (and superpipelining) 3-8 √ 2. Cache memory, 2-3 levels 2-5 √ 3. RISC and related ideas 2-3 √ discussed 4. Multiple instruction issue (superscalar) 2-3 √ Previously methods Established 5. ISA extensions (e.g., for multimedia) 1-3 √ 6. Multithreading (super-, hyper-) 2-5 ? 7. Speculation and value prediction 2-3 ? 8. Hardware acceleration 2-10 ? Part VII Newer methods 9. Vector and array processing 2-10 ? Covered in 10. Parallel/distributed computing 2-1000s ? Feb. 2011 Computer Architecture, Advanced Architectures Slide 6 Peak Performance of Supercomputers PFLOPS Earth Simulator × 10 / 5 years ASCI White Pacific TFLOPS ASCI Red TMC CM-5 Cray T3D Cray X-MP TMC CM-2 Cray 2 GFLOPS 1980 1990 2000 2010 Dongarra, J., “Trends in High Performance Computing,” Computer J., Vol. 47, No. 4, pp. 399-403, 2004. [Dong04] Feb. 2011 Computer Architecture, Advanced Architectures Slide 7 Energy Consumption is Getting out of Hand TIPS DSP performance Absolute per watt processor performance GIPS GP processor performance per watt Performance MIPS kIPS 1980 1990 2000 2010 Calendar year Figure 25.1 Trend in energy consumption for each MIPS of computational power in general-purpose processors and DSPs. Feb. 2011 Computer Architecture, Advanced Architectures Slide 8 25.2 Performance-Driven ISA Extensions Adding instructions that do more work per cycle Shift-add: replace two instructions with one (e.g., multiply by 5) Multiply-add: replace two instructions with one (x := c + a × b) Multiply-accumulate: reduce round-off error (s := s + a × b) Conditional copy: to avoid some branches (e.g., in if-then-else) Subword parallelism (for multimedia applications) Intel MMX: multimedia extension 64-bit registers can hold multiple integer operands Intel SSE: Streaming SIMD extension 128-bit registers can hold several floating-point operands Feb. 2011 Computer Architecture, Advanced Architectures Slide 9 Class Instruction Vector Op type Function or results Intel Register copy 32 bits Integer register ↔ MMX register Parallel pack 4, 2 Saturate Convert to narrower elements MMX Copy ISA Parallel unpack low 8, 4, 2 Merge lower halves of 2 vectors Parallel unpack high 8, 4, 2 Merge upper halves of 2 vectors Exten- Parallel add 8, 4, 2 Wrap/Saturate# Add; inhibit carry at boundaries sion Parallel subtract 8, 4, 2 Wrap/Saturate# Subtract with carry inhibition Parallel multiply low 4 Multiply, keep the 4 low halves Arith- metic Parallel multiply high 4 Multiply, keep the 4 high halves Parallel multiply-add 4 Multiply, add adjacent products* Parallel compare equal 8, 4, 2 All 1s where equal, else all 0s Parallel compare greater 8, 4, 2 All 1s where greater, else all 0s Parallel left shift logical 4, 2, 1 Shift left, respect boundaries Shift Parallel right shift logical 4, 2, 1 Shift right, respect boundaries Parallel right shift arith 4, 2 Arith shift within each (half)word Parallel AND 1 Bitwise dest ← (src1) ∧ (src2) Parallel ANDNOT 1 Bitwise dest ← (src1) ∧ (src2)′ Logic Parallel OR 1 Bitwise dest ← (src1) ∨ (src2) Parallel XOR 1 Bitwise dest ← (src1) ⊕ (src2) Memory Parallel load MMX reg 32 or 64 bits Address given in integer register Table access Parallel store MMX reg 32 or 64 bit Address given in integer register 25.1 Control Empty FP tag bits Required for compatibility$ Feb. 2011 Computer Architecture, Advanced Architectures Slide 10 MMX Multiplication and Multiply-Add a b d e a b d e e f g h e f g h e × h z v e × h v d × g y u d × g u b × f x t b × f t a × e w s a × e s add add s t u v s + t u + v (a) Parallel multiply low (b) Parallel multiply-add Figure 25.2 Parallel multiplication and multiply-add in MMX. Feb. 2011 Computer Architecture, Advanced Architectures Slide 11 MMX Parallel Comparisons 14 3 58 66 5 3 12 32 5 6 12 9 79 1 58 65 3 12 22 17 5 12 90 8 65 535 255 (all 1s) (all 1s) 0 0 0 0 0 0 0 0 (a) Parallel compare equal (b) Parallel compare greater Figure 25.3 Parallel comparisons in MMX. Feb. 2011 Computer Architecture, Advanced Architectures Slide 12 25.3 Instruction-Level Parallelism 3 30% 20% 2 10% Fraction of cycles of Fraction Speedup attained attained Speedup 0% 1 0 1 2 3 4 5 6 7 8 0 2 4 6 8 Issuable instructions per cycle Instruction issue width (a) (b) Figure 25.4 Available instruction-level parallelism and the speedup due to multiple instruction issue in superscalar processors [John91]. Feb. 2011 Computer Architecture, Advanced Architectures Slide 13 Instruction-Level Parallelism Figure 25.5 A computation with inherent instruction-level parallelism. Feb. 2011 Computer Architecture, Advanced Architectures Slide 14 VLIW and EPIC Architectures VLIW Very long instruction word architecture EPIC Explicitly parallel instruction computing General registers (128) Execution Execution . Execution unit unit unit Predi- Memory cates (64) Execution Execution . Execution Floating-point unit unit unit registers (128) Figure 25.6 Hardware organization for IA-64. General and floating- point registers are 64-bit wide. Predicates are single-bit registers. Feb. 2011 Computer Architecture, Advanced Architectures Slide 15 25.4 Speculation and Value Prediction spec load spec load ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- store store ---- ---- ---- ---- load check load load check load ---- ---- ---- ---- ---- ---- ---- ---- (a) Control speculation (b) Data speculation Figure 25.7 Examples of software speculation in IA-64. Feb. 2011 Computer Architecture, Advanced Architectures Slide 16 Value Prediction Mem o table Miss Mux 0 Mult/ Output Inputs Div 1 Done Inputs ready Control Output ready Figure 25.8 Value prediction for multiplication or division via a memo table. Feb. 2011 Computer Architecture, Advanced Architectures Slide 17 25.5 Special-Purpose Hardware Accelerators Data and Configuration program CPU memory memory FPGA-like unit on which Accel. 1 Accel. 3 accelerators can be formed via loading of Accel. 2 Unused configuration resources registers Figure 25.9 General structure of a processor with configurable hardware accelerators. Feb. 2011 Computer Architecture, Advanced Architectures Slide 18 Graphic Processors, Network Processors, etc. PE PE PE PE 0 1 2 3 PE PE PE PE PE 5 4 5 6 7 Input Output buffer PE PE PE PE buffer 8 9 10 11 PE PE PE PE 12 13 14 15 Feedback path Colu mn Colu mn Colu mn Colu mn me mor y me mor y me mor y me mor y Figure 25.10 Simplified block diagram of Toaster2, Cisco Systems’ network processor. Feb. 2011 Computer Architecture, Advanced Architectures Slide 19 25.6 Vector, Array, and Parallel Processing Single data Multiple data Shared Message stream streams variables passing SISD SIMD GMSV GMMP Uniprocessors Array or vector Shared-memory Rarely us ed Global Global stream me mor y Single instr Single processors multiprocessors MISD MIMD DMSV DMMP Rarely us ed Multiproc’s or Dis tributed Dis tr ib- me mor y streams streams me mor y mult icomputers Dis tributed shared memory multicomputers Multiple instr John so n’ s e x pa nsion Flynn’s categories Figure 25.11 The Flynn-Johnson classification of

Computer Architecture, Part 7

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Abstract 1 Introduction

Detailed Cache Coherence Characterization for Openmp Benchmarks

Analysis and Optimization of I/O Cache Coherency Strategies for Soc-FPGA Device

Chapter 5 Thread-Level Parallelism

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

Parallel Processors and Cache Coherency

Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring Rchitectures

18-742 Parallel Computer Architecture Lecture 5: Cache Coherence

Why On-Chip Cache Coherence Is Here to Stay

Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

A CMOS Vector Processor with a Custom Streaming Cache Greg Faanes [email protected] Silicon Graphics

Mjb › Cs575 › Handouts › Cache.1Pp.Pdf Caching Issues in Multicore Performance