COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1)

COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Marquette University Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations from Mike Johnson, Morgan Kaufmann, and other online sources Page 1 1 Outline Marquette University ❖Historical Perspective ❖Technology Trends ❖The Computing Stack: Layers of Abstraction ❖Performance Evaluation ❖Benchmarking Page 2 2 Page 1 Historical Perspective Marquette University ❖ ENIAC of WWII - first general purpose computer ◆ Used for computing artillery firing tables ◆ 80 feet long by 8.5 feet high; used 18,000 vacuum tubes ◆ Performed 1900 additions per second Page 3 3 Computing Devices Then… Marquette University Electronic Delay Storage Automatic Calculator (EDSAC) - early British computer University of Cambridge, UK, 1949 Considered to be the first stored program electronic computer Page 4 4 Page 2 Computing Systems Today Marquette University ❖The world is a large parallel system Massive Cluster ◆Microprocessors in everything Gigabit Ethernet Clusters ◆Vast infrastructure behind them Internet Scalable, Reliable, Connectivity Secure Services Databases Information Collection Remote Storage Online Games Sensor Commerce Refrigerators … Nets Cars MEMS for Sensor Nets Routers Robots Page 5 5 Technology Trends Marquette University ❖ Electronics technology evolution ◆ Increased capacity and performance ◆ Reduced cost Year Technology Relative performance/cost 1951 Vacuum tube 1 1965 Transistor 35 1975 Integrated circuit (IC) 900 1995 Very large scale IC (VLSI) 2,400,000 2005 Ultra large scale IC (ULSI) 6,200,000,000 Page 6 6 Page 3 Moore’s Law Marquette University Page 7 7 Feature Size Decreasing Marquette University Source: International Technology Roadmap for Semiconductors (ITRS) 2009 Exec. Summary Page 8 8 Page 4 Computation speed increasing Marquette University Page 9 9 Memory Capacity increasing Marquette University Page 10 10 Page 5 Power Trends Marquette University Power = Capacitive load Voltage 2 Frequency ×30 5V → 1V ×1000 The power wall: ◼ Can’t reduce voltage or remove heat more ◼ How else can we improve performance? Page 11 11 Uniprocessor Performance Marquette University Solution: Can increase the number of processors! Page 12 12 Page 6 Types of Parallelism Marquette University ❖Instruction Level Parallelism Multiple instructions executing within processor, e.g. pipelining, superscalar design Simultaneous, temporal multithreading ❖Multiprocessing ◆More than one processor per chip ◆May be tightly or loosely coupled Single Multiple Flynn’s Instruction Instruction Single SISD MISD Taxonomy: Data Multiple SIMD MIMD Data Page 13 13 Example: Intel i7 Marquette University Intel i7 Nehalem Architecture, 4 cores Page 14 14 Page 7 Example: AMD Zen (family 17h) Marquette University • Over 1300 sensors to monitor the state of the die over all critical paths • 48 high-speed power supply monitors, 20 thermal diodes, and 9 high- speed droop detectors [Images source: wikichip.org] Page 15 15 Example: ARM Marquette University • Neoverse N1 (codename Ares) is a high-performance ARM microarchitecture designed by ARM Holdings for the server market. [Images source: wikichip.org] Page 16 16 Page 8 Example: (ARM) Vulcan Marquette University • Vulcan is a 16 nm high-performance 64-bit ARM microarchitecture designed by Broadcom and later introduced by Cavium for the server market. [Images source: wikichip.org] Page 17 17 Example: Nvidia V100 Marquette University • Tesla Volta V100 graphics processor has 5,120 CUDA / Shader cores and is based upon 21 Billion transistors. Page 18 18 Page 9 Example: (Tilera) Mellanox TILE-Gx8072 Marquette University Powerful Processor Cores • 72 cores operating at frequencies up to 1.2 GHz Cache • 22.5 MBytes total on-chip cache • 32 KB L1 instruction cache and 32 KB L1 data cache per core • 256 KB L2 cache per core • 18 MBytes coherent L3 cache Page 19 19 Still the same basics! Marquette University 1. Input 2. Control 3. Datapath 4. Memory 5. Output Page 20 20 Page 10 Hardware Architectures Marquette University ❖Von Neumann architecture Single memory for both Programs and Data ❖Harvard Architecture Separate memories for Programs and Data ❖Modified Harvard Architecture Single physical memory, but separate CPU pathways for Programs and Data Page 21 21 Datapath and Control Marquette University Page 22 22 Page 11 Instruction Set Architectures (ISAs) Marquette University ❖CISC: Complex Instruction Set Chip Large number of core instructions Leads to larger, slower hardware Good for special purpose processors (or necessary backwards compatibility, such as Intel) ❖RISC: Reduced Instruction Set Chip Small number of general instructions Leads to more compact, faster hardware Good for general purpose processors Page 23 23 The Computing Stack: Hierarchy Marquette University ❖Must focus on one “layer” or “abstraction” Application Software Operating Architecture System Compiler Firmware Instruction Set Instr. Set Proc. I/O system Architecture (ISA) Datapath & Control Hardware Digital Design Architecture Circuit Design Layout Page 24 24 Page 12 Manufacturing ICs Marquette University ❖Yield: proportion of working dies per wafer Page 25 25 Intel Core i7 Wafer Marquette University ❖300mm wafer, 280 chips, 32nm technology ❖Each chip is 20.7 x 10.5 mm ❖Gates constructed of transistor circuits ❖Circuits etched directly into silicon ◆Original Pentium: 4.5 million transistors ◆Pentium IV: 42 million transistors ◆Pentium IV Itanium: 592 million transistors ◆i7: 731 million transistors Page 26 26 Page 13 How to Measure Performance? Marquette University ❖ Execution time (response time, latency) — How long does it take for my job to run? — How long does it take to execute a job? — How long must I wait for the database query? ❖ Throughput — How many jobs can the machine run at once? — What is the average execution rate? — How much work is getting done? If we upgrade a machine with a new processor what do we increase? If we add a new machine to the lab what do we increase? Page 27 27 Definition of Performance Marquette University ❖Performance (larger number means better performance) 1 performance(x ) = execution time(x ) So "X is n times faster than Y" means performance(xy ) execution time( ) n == performance(yx ) execution time( ) Page 28 28 Page 14 Measuring Execution Time Marquette University ❖ Elapsed time ◆ Total response time, including all aspects ➢ Processing, I/O, OS overhead, idle time ◆ Determines system performance ❖ CPU time ◆ Time spent processing a given job ➢ Discounts I/O time, other jobs’ shares ◆ Comprises user CPU time and system CPU time ◆ Different programs are affected differently by CPU and system performance Page 29 29 CPU Clocking Marquette University ❖Operation of digital hardware governed by a constant-rate clock Clock period Clock (cycles) Data transfer and computation Update state ◼ Clock period (clock cycle time OR cycle time): duration of a clock cycle –12 ◼ e.g., 250ps = 0.25ns = 250×10 s ◼ Clock frequency (rate): cycles per second 9 ◼ e.g., 4.0GHz = 4000MHz = 4.0×10 Hz ◼ Clock period is the inverse of clock frequency Page 30 30 Page 15 CPU Time (Execution time) Marquette University CPU Time = CPU Clock CyclesClock Cycle Time CPU Clock Cycles = Clock Rate ❖Performance improved by ◆Reducing number of clock cycles ◆Increasing clock rate ◆Hardware designer must often trade off clock rate against cycle count Page 31 31 CPU Time: Example 1 Marquette University ❖ Computer A: 2GHz clock rate, 10s CPU time ❖ Designing Computer B ◆ Aim for 6s CPU time ◆ Can do faster clock, but causes 1.2 × clock cycles ❖ How fast must Computer B clock be? Clock CyclesB 1.2Clock CyclesA Clock Rate B = = CPU Time B 6s Clock CyclesA = CPU Time A Clock Rate A = 10s 2GHz = 20 10 9 1.220 10 9 24 10 9 Clock Rate = = = 4GHz B 6s 6s Page 32 32 Page 16 Instruction Count and CPI Marquette University CPU Clock Cycles = Instructio n Count Cycles Per Instructio n CPU Time = Instructio n Count CPIClock Cycle Time Instructio n Count CPI = Clock Rate ❖Instruction Count for a program ◆Determined by program, ISA and compiler ❖Average cycles per instruction ◆Determined by CPU hardware ◆If different instructions have different CPI ➢Average CPI affected by instruction mix Page 33 33 CPI: Example 2 Marquette University ❖Computer A: Cycle Time = 250ps, CPI = 2.0 ❖Computer B: Cycle Time = 500ps, CPI = 1.2 ❖Same ISA ❖Which is faster, and by how much? CPU Time = Instructio n Count CPI Cycle Time A A A = I2.0 250ps = I500ps A is faster… CPU Time = Instructio n Count CPI Cycle Time B B B = I1.2500ps = I 600ps CPU Time I600ps B = = 1.2 CPU Time I500ps …by this much A Page 34 34 Page 17 CPI in More Detail Marquette University ❖If different instruction classes take different numbers of cycles n Clock Cycles = (CPIi Instruction Count i ) i=1 ◼ Weighted average CPI n Clock Cycles Instruction Count i CPI = = CPIi Instruction Count i=1 Instruction Count Relative frequency Page 35 35 CPI: Example 3 Marquette University ❖Alternative compiled code sequences using instructions in classes A, B, C Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 4 1 1 ◼ Sequence 1: IC = 5 ◼ Sequence 2: IC = 6 ◼ Clock Cycles ◼ Clock Cycles = 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3 = 10 = 9 ◼ Avg. CPI = 10/5 = 2.0 ◼ Avg. CPI = 9/6 = 1.5 Page 36 36 Page 18 Performance Summary Marquette University The BIG Picture Instructions Clock cycles Seconds CPU Time = Program Instruction Clock cycle ❖Performance depends on ◆Algorithm: affects

COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1)

Processamento Paralelo Em Cuda Aplicado Ao Modelo De Geração De Cenários Sintéticos De Vazões E Energias - Gevazp

Radio Frequency Identification Based Smart

Andrzej Nowak - Bio

FPL-2019 Why-I-Love-Latency

A Survey of Microarchitectural Timing Attacks and Countermeasures on Contemporary Hardware

Parallel Implementation of Sorting Algorithms

IJMEIT// Vol.04 Issue 08//August//Page No:1729-1735//ISSN-2348-196X 2016

Simultaneous Multithreading on X86 64 Systems: an Energy Efﬁciency Evaluation

(19) United States %

PCAP Assignment IV.Pages

UC Riverside UC Riverside Previously Published Works

Heterogeneous Computing with Opencl 2.0 This Page Intentionally Left Blank Heterogeneous Computing with Opencl 2.0 Third Edition