Lecture 26: Domain Specific Architectures Chapter 07, CAQA 6th Edition
CSCE 513 Computer Architecture
Department of Computer Science and Engineering Yonghong Yan [email protected] https://passlab.github.io/CSCE513 Copyright and Acknowledgements
§ Copyright © 2019, Elsevier Inc. All rights Reserved – Textbook slides § Machine Learning for Science” in 2018 and A Superfacility Model for Science” in 2017 By Kathy Yelic – https://people.eecs.berkeley.edu/~yelick/talks.html
2 CSE 564 Class Contents
§ Introduction to Computer Architecture (CA) § Quantitative Analysis, Trend and Performance of CA – Chapter 1 § Instruction Set Principles and Examples – Appendix A § Pipelining and Implementation, RISC-V ISA and Implementation – Appendix C, RISC-V (riscv.org) and UCB RISC-V impl § Memory System (Technology, Cache Organization and Optimization, Virtual Memory) – Appendix B and Chapter 2 – Midterm covered till Memory Tech and Cache Organization § Instruction Level Parallelism (Dynamic Scheduling, Branch Prediction, Hardware Speculation, Superscalar, VLIW and SMT) – Chapter 3 § Data Level Parallelism (Vector, SIMD, and GPU) – Chapter 4 § Thread Level Parallelism – Chapter 5 § Domain-Specific Architecture – Chapter 7
3 The Moore’s Law TrendSuperscalar/Vector/Parallel GPUs
1 PFlop/s (1015) IBM Parallel BG/L ASCI White ASCI Red 1 TFlop/s Pacific 12 (10 ) 2X Transistors/Chip TMC CM-5 Cray T3D Every 1.5 Years Vector TMC CM-2
1 GFlop/s Cray 2 Cray X-MP (109) Super Scalar Cray 1 1941 1 (Floating Point operations / second, Flop/s) 1945 100 CDC 7600 IBM 360/195 1949 1,000 (1 KiloFlop/s, KFlop/s) 1 MFlop/s Scalar 1951 10,000 (106) CDC 6600 1961 100,000 1964 1,000,000 (1 MegaFlop/s, MFlop/s) 1968 10,000,000 IBM 7090 1975 100,000,000 1987 1,000,000,000 (1 GigaFlop/s, GFlop/s) 1992 10,000,000,000 1993 100,000,000,000 1 KFlop/s 1997 1,000,000,000,000 (1 TeraFlop/s, TFlop/s) (103) UNIVAC 1 2000 10,000,000,000,000 EDSAC 1 2005 131,000,000,000,000 (131 Tflop/s) 4 1950 1960 1970 1980 1990 2000 2010 2020 An Overview of the GK110 Kepler Architecture
Kepler GK110 was built first and foremost for Tesla, and its goal was to be the highest performing parallel computing microprocessor in the world. GK110 not only greatly exceeds the raw compute horsepower delivered by Fermi,Recent but it does so efficiently, consuming significantly M less poweranycore and GPU processors generating much less heat output.
A full Kepler GK110 implementation includes 15 SMX units and six 64 bit memory controllers. Different products will use different configurations of GK110. For example, some products may deploy 13 or 14 SMXs. Key features of the architecture that will be discussed below in more depth include: Kepler Memory Subsystem / L1, L2, ECC The new SMX processor architecture Kepler&s