Computer Architecture What is Computer Architecture Review of Basic From Wikipedia, the free encyclopedia In computer science and engineering, computer architecture refers to specification of the relationship between different hardware components of a computer system. It may also Computer refer to the practical art of defining the structure and relationship of the subcomponents of a computer. Architecture This article needs attention from an expert in computer science.

החומר בפרק הזה חוזר בקצרה על תוכן הקורס ארכיטקטורה לתואר ראשון במכללת הדסה. במידה שהחומר הזה לא מוכר מומלץ לפנות לאתר https://cs.hac.ac.il/staff/martin/Architecture

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 1 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 2

Computer Architecture Theory —Goals — Specification What is Computer Architecture Theory von Neumann Architecture Definition 2.0 Computation  components CPU: ALU + memory + control input memory output In computer science and engineering, computer architecture Instructions refers to the study of performance in computer systems. Performance = run-time speed Arithmetic It also refers to the practical science of applying Run-time of what? Logic Unit performance theory to specifying the structure and Compared to what? (ALU) relationship of the subcomponents of a computer. Requirements controller Word processing Number crunching data/instruction path From an expert in computer science. Gaming control path Web server Real time Specification Requirements + performance theory  component implementation

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 3 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 4 Requirements — Relative to Application Fundamental Architectural Abstractions Fastest CPU — E-2278G Digital computer High-end version of -64 processor family Machine that can be programmed to process symbols IA-32 instruction set on enhanced micro-architecture Data Netburst  Ivy Bridge  Haswell  Broadwell  Skylake  Comet Lake Symbol with no intrinsic meaning to machine II  Pentium III   Multicore User imposes meaning — Integer, float, string, ... Operation Smartphones — ARM CPU Symbol describing processing of data symbols Low power Machine interprets meaning — transfer, ALU, control, OS, ... Higher performance / Watt than x86 Instruction Fastest supercomputer — Fujitsu Fugaku Symbol describing operation on data 158,976 nodes x 48 cores/node = 7,630,848 cores Machine language = collection of legal instructions Node = ARM v8.2-A CPU Addressing Mode Total RAM = 4.85 PiB = 1024 x 1024 GB Specifies data location as operand Source operand — data input to operation Destination operand — data output from operation

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 5 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 6

Stages in Computer Design Typical Operations Instruction Set Architecture (ISA) Data transfer 1. Define universe of problems to be solved Load (r  m), store (m  r), move (r/m  r/m), convert data types 2. Study candidate operations at level of system programmer Arithmetic/Logical (ALU) • Atomic — operations complete sequentially Integer arithmetic (+ – compare shift) and logical (AND, OR, NOR, XOR) • General operation = combination of atomic operations Decimal 3. Specify instruction set for machine language Integer arithmetic on decimal numbers • Choose minimum set of orthogonal operations Floating point (FPU) • Not too many ways to solve same problem Floating point arithmetic (+ – sqrt trig exp …) String Implementation String move, string compare, string search 1. Design machine as implementation of ISA Control 2. Evaluate theoretical performance Conditional and unconditional branch, call/return, trap 3. Identify performance problem areas Operating System 4. Improve processor efficiency System calls, virtual memory management instructions Graphics Pixel operations, compression/decompression operations

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 7 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 8 Memory Hierarchy CPU and Memory Hierarchy CPU controller accesses L1 Memory locations Memory location Memory location in or Memory location outside CPU and RAM outside CPU near CPU inside CPU if (L1 cache hit) {access performed in 1 clock cycle} else { Stores data and Stores "all" data and Fast access to important Fast access to small instructions of "all" instructions of data and instructions amount of L1 cache miss — L1 cache accesses cache controller programs running programs from RAM information cache controller initiates access to L2 and main memory if (address in L2 cache) {controller copies contents to L1 from L2} Organized by OS Organized by Copy of RAM section Organized by CPU addresses else {controller copies location to L1 from main memory} } CPU

Long Term Main Memory I/O Cache Register ALU L1 instructions cache Main Storage (RAM) L2 Registers controller Memory L1 data Disk Next Few All Files Running Programs Current Instructions request and Data and Data Data access in 1 CC and Data update access latency >> 1 clock cycle Cache miss penalty Address not in L1  delay in memory access

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 9 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 10

Specifying Operands Addressing Modes Immediate Constant = IMM = numerical value coded into instruction Mode Syntax Memory Access Use Register R3 Regs[R3] Register data Register operands Immediate #3 3 Constant Direct = a CPU storage location R3 (1001) Mem[1001] Static data register name (absolute) REGS[register name] = data stored in register 11223340 Register (R1) Mem[Regs[R1]] Pointer REGS[R3] = data stored in register R3 = 11223340 deferred Displacement 100(R1) Mem[100+Regs[R1]] Local variable Indexed (R1 + R2) Mem[Regs[R1]+Regs[R2]] Array addressing Memory operands 11223344 Memory address = a memory storage location @(R3) Mem[Mem[Regs[R3]]] Pointer to pointer 45 indirect MEM[address] = data stored in memory Auto Mem[Regs[R2]] (R2)+ Stack access Increment MEM[11223344] = data stored at address 11223344 = 45 Regs[R2]  Regs[R2]+d Auto Regs[R2]  Regs[R2]-d -(R2) Stack access Effective Address (EA) — pointer arithmetic Decrement Mem[Regs[R2]] REGS[R3]  &(variable) Scaled 100(R2)[R3] Mem[100+Regs[R2]+Regs[R3]*d] Indexing arrays Load instruction to PC-relative (PC) Mem[PC+value] MEM[REGS[R3]+4] = *(&(variable)+4) = *(REGS[R3]+4) data register PC-relative Load instruction to = *(11223340+4) = 45 1001(PC) Mem[PC+Mem[1001]] deferred data register

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 11 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 12 Commitment to State Complex Instruction Set Computer (CISC) Internal registers Classic Machine Design Temporary registers used in executing machine instructions 300 instruction types Not visible to programs 15 addressing modes 10 data types Complex machine implementations Architectural state CPU registers visible to programs Mainframes (1955 — 2000) Large, expensive, centralized computers for big business and government System state Manufacturers: IBM, Control Data, Burrows, Honeywell All data resources visible to programs Minicomputers (1965 — 1990) Architectural state + system memory Smaller computers for smaller organizations Manufacturers: Digital (PDP/VAX), Data General (Eclipse) Commitment to state CISC (1979 — 1996) Update of system state 6800 (1974) and 8086 (1978) designed as tiny CISC on chip Write to architectural state / system memory Apple II (1977) — 6502 (1975) IBM PC (1981) — 8088 (1979) Intel x86 for PC/Mac = last CISC ISA still manufactured.

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 13 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 14

Why CISC? Physical Implementation of CISC —Generic Machine

Semantic Gap Argument ALU Subsystem Computer language should imitate natural language Large vocabulary + high redundancy  flexibility + power 1 3 OUT Registers Terrible compilers IN 2 Limited optimization ALU Operation Limited error messaging ALU Result Flag Efficient code written or optimized in assembly language System Bus Expensive memory

RAM < $0.01/MB since 2015 Status control RAM ~ $5000/MB wholesale in 1977 Decoder IR PC MAR MDR Word + Implications for machine language Design for user-friendly programming and small memory use Many highly specific instructions using many addressing modes PC - program counter MAR - memory address register Address Data Main Memo ry Compact instruction codes that perform a lot of work IR - instruction register MDR - memory data register

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 15 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 16 Decoding Machine Instructions Run Time and Clock Cycles Machine Language Instruction CPU is timed by periodic signal called clock (CLK) SUB R1, R2, 100(R3) clock Microcode Instruction Sequence (Microprogram) cycle ALU_IN  R3 Microcode instruction ALU  100 Hardware level atomic operation ADD 9 lines = 9 clock cycles MAR  OUT Clock Cycle (CC) time = seconds per cycle READ ALU Subsystem Instruction requires 1 or more clock cycles to process

1 ALU_IN  MDR 3 OUT = cycles per second = Hz (Hertz) Registers IN 2 ALU  R2 Run time  clock cycles to run program seconds per clock cycles ALU Operation SUB ALU Result Flag clock cycles to run program R1  OUT System Bus  clock cycles per second

Status control Decoder IR PC MAR MDR Word + Higher clock rate  shorter run time

PC - program counter MAR - memory address register Address Data More clock cycles (at constant clock rate)  longer run time IR - instruction register MDR - memory data register Main Memo ry

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 17 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 18

Intel 386 —Last Pure CISC Intel CPU Basic Performance Measures Run Time Elapsed time T from start to finish of a defined program task

Latency Excess response time — depends on context

Throughput Number of defined tasks performed per unit time 1 Throughput  T  latency between tasks Enhancement Change to system  new run time T '

Speedup T SS 1' TT < T '

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 19 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 20 Definitions CPU Equation

T  total run time of program Clock cycles to run all instructions of type i ti total run time of instructions in group i clock cycles × IC  number of instructions in group i (IC nstruction ount) NICCPIi instructions of type i ii i instruction of type i CPI  number of clock cycles to run 1 instruction in group i (C ycles Per Instruction) i Total clock cycles to run all instructions in program Nii  number of clock cycles to run all instructions in group NNICCPIiii   seconds per clock cycle all groups ii

R clock rate clock frequency  clock cycles per second  Hertz (Hz) 1 Average number of clock cycles per instruction for program N CPI total number of clock cycles to run program total number of instructions in program IC  total number of instructions in program IC weighted average 11 IC IC N  total number of clock cycles to run program i i  1 CPI Niiii IC  CPI  CPI  IC IC IC IC i CPI  average number of clock cycles per instruction for the program ii i IC quantity'  new value of quantity after architectural change Ratio i is proportion (percent) of instructions in group i IC  Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 21 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 22

CPU Run Time Amdahl Equation t Run time of one instruction of type i Fii relative run time of instructions in group i T clock cycles seconds × CPI  instruction of type i clock cycle i ti Sii speedup for instructions in group Run time for all instructions of type i ti ' clock cycles seconds ttFTF ti instructions of type ×× T ii i  i1 i instruction of type i clock cycle S iii    i  Tt''tF FF IC CPI   i iiT ii ii i   iiSSii ii SS ii Total run time for program IC i Enhancement to group e Amdahl's "Law" TtCPIICCPIICiiii    IC Speedup limited by 1 – F all groups ii i 11 e So S  Enhance maximum Fe 1 FFee F e TCPIIC=  1 Fe Accept impairment to small 1 – Fe SSee S e clock cycles per instruction number of instructions clock cycle

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 23 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 24 Amdahl Equation in Parallel Processing SPEC Benchmark Programs for system performance measurement + comparison n1 n processors CPI Standard + repeatable CPI Pwork can be parallelized n Test system for realistic conditions  CPIn1 P work cannot be parallelized Summary score for easy comparison Results posted at http://www.spec.org/ CPI n1 n1 Specific test suites FCPIPP1 F n Cint — CPU integer instructions

FP  Fraction of processing that can be performed independently Cfp — CPU FP instructions n  Number of processing units Performance as file server, web server, mail server, graphics Updated every few years to reflect realistic conditions

nn11 Based on current statistical distributions of computing tasks CPI IC CPI 1 Current CPU test version — 2017 S n processors n processors CPI IC CPI FP Previous version — 2006  1FP n Reports speedup Run time compared with a standard machine

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 25 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 26

How SPEC Works Typical SPEC Report —1

User runs n programs on test machine SPEC(R) CPU2017 Integer Speed Result Records run-time conditions test ASUSTeK Computer Inc. T , in 1,2,..., ASUS RS700-E9(Z11PP-D24) Server System Records program run-time in seconds i (2.70 GHz, Intel Xeon Gold 6150)

SPEC provides run-times on reference machine CPU2017 License: 9016 Test date: Dec-2017 Test sponsor: ASUSTeK Computer Inc. Hardware availability: Jul-2017 Sun Fire V490 ref Tested by: ASUSTeK Computer Inc. Software availability: Sep-2017 Ti 2100 MHz UltraSPARC-IV+ processor Base Base Base Peak Peak Peak Benchmarks Thrds Run Time Ratio Thrds Run Time Ratio Powerful symmetric multiprocessing (SMP) server (2006 – 2014) ------600.perlbench_s 72 286 6.22 72 239 7.42 User calculates speedup for each program 602.gcc_s 72 423 9.42 72 413 9.65 T ref 605.mcf_s 72 426 11.1 72 421 11.2 i , in 1,2,..., 620.omnetpp_s 72 257 6.35 72 248 6.58 Si  test 623.xalancbmk_s 72 150 9.46 72 140 10.1 Ti 625.x264_s 72 150 11.8 72 150 11.8 User calculates geometric mean of speedups 631.deepsjeng_s 72 280 5.11 72 282 5.08 641.leela_s 72 393 4.34 72 392 4.36 1 648.exchange2_s 72 220 13.4 72 220 13.4 n T ref n 657.xz_s 72 280 22.1 72 277 22.3 S test machine on ref  i SPECspeed2017_int_base 8.87 test SPECspeed2017_int_peak 9.16 i1 Ti S machine A on ref S machine A compared to machine B  S machine B on ref Base = standard configuration Peak = specialist configuration

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 27 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 28 Typical SPEC Report —2 Some Cint2017 Results

HARDWARE Clock Total Total Total Cint 2017 ------Processor CPU Name: Intel Xeon Gold 6150 (GHz) Chips Cores Threads Base Max MHz.: 3700 Intel Xeon E‐2278G Nominal: 2700 3.4 1 8 16 13.2 Enabled: 36 cores, 2 chips (Lenovo) Orderable: 1, 2 chip(s) Cache L1: 32 KB I + 32 KB D on chip per core Intel Xeon E‐2278G L2: 1 MB I+D on chip per core 3.4 1 8 8 12.4 (Asus) L3: 24.75 MB I+D on chip per chip Other: None Intel Xeon E‐2278G Memory: 768 GB (24 x 32 GB 2Rx4 PC4-2666V-R) 3.4 1 8 16 12.1 Storage: 1 x 240 GB SATA SSD (Fujitsu) Other: None SOFTWARE Intel Xeon Gold 6250 3.9 2 16 32 12.6 ------(Lenovo) OS: Red Hat Enterprise Linux Server release 7.3 (x86_64) Kernel 3.10.0-514.el7.x86_64 Intel Xeon Gold 6256 Compiler: C/C++: Version 18.0.0.128 of Intel C/C++ Compiler; 3.6 2 24 48 12.6 Fortran: Version 18.0.0.128 of Intel Fortran Compiler (Lenovo) Parallel: Yes Firmware: Version 0601 released Oct-2017 File System: xfs i9‐9900K 3.6 1 6 6 11.2 System State: Run level 3 (multi-user) Base Pointers: 64-bit Peak Pointers: 32/64-bit Intel Core i7‐9700K 3.6 1 8 8 10.6 Other: jemalloc: jemalloc memory allocator library V5.0.1

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 29 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 30

Some Comments on Cint2017 Results Benchmarking a Processor Design Auto parallel Specify Instruction Set Architecture (ISA) High level Cint code not explicitly threaded for parallel processing Specifies machine language for proposed CPU Auto parallel compiler creates parallel threads using heuristics Provides human-readable assembly language Provides limited speed up (or even degradation) Determines CPIi for each instruction group i Count clock cycles required to implement each instruction in ISA Intel Xeon E-2278G with 3.4 GHz clock Fastest CPU in Cint2017 tests Write compilers for proposed machine language 16 threads slightly faster than 8 threads C, C++, Fortran Communication between more threads can slow processing Compile benchmark programs to machine language 16 threads from different manufacturer slightly slower Programs from SPEC CINT and CFP Intel Xeon Gold 6256 with 3.6 GHz clock Analyze compiler output (executable programs) Cint with 48 threads slightly slower than E-2278G Sort machine instructions into groups With 3.6 GHz clock, expect 3.6 GHz / 3.4 GHz = 1.06 faster Calculate relative instruction count IC /IC for each group 48 threads slower than 16 threads i Calculate average CPI and overall run time T Compare run time with reference machine

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 31 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 32 CISC Creates Anti‐CISC Revolution RISC "Philosophy" 1974 — 1977 Technological developments from 1975 to 1990 Data General introduces Eclipse 32-bit CISC minicomputer Price of RAM drops from $5000 / MByte (1975) to $5 / MByte (1990) Digital (DEC) introduces VAX 32-bit CISC minicomputer Compilers become powerful and efficient with extensive optimization First serious inexpensive competition to mainframe computers Portable code made practical by minicomputer, Unix, C, and TCP/IP 1977 — 1990 Principal research results on CISC performance Serious computers became available to small organizations ~ 90% of run time devoted to ~ 10% of instruction set UNIX developed as minicomputer operating system ~ 90% of instructions in ISA rarely used TCP/IP developed to support networks of minicomputers Computer Science emerged as separate academic discipline Reduced Instruction Set Computer (RISC) Students needed topics for projects, theses, dissertations Apply Amdahl's "Law" CISC ISA 1980 — 1990 Speed up operations accounting for most of run time Research results on minicomputer performance Ignore impairments to other instructions CISC uses machine resources inefficiently RISC ISA — only most important CISC instructions Most machine instructions are rarely used in programs Other CISC instructions = multiple RISC instructions CISC machines run slowly to support unnecessary features RISC implementation executes its ISA in fast dedicated hardware

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 33 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 34

Instruction Types RISC Microprocessors Representative instruction distribution Simpler ISA Five programs from SPECint92 benchmark suite Small set of uniform length machine instructions Compile for x86 instruction set (ISA for Intel 386/486/Pentium) Simpler hardware No microcode — standard instruction implementation Relative Proportion Instruction of Total Run Time First 10 instructions account No central system bus Load 22% for 95% of run time CPU process several instructions at once Conditional branch 20% Compare 16% Amdahl's "Law" Lower CPIi + higher clock speed Store 12% Fast implementation of 95% Instruction completes on (almost) every clock cycle Add 8% Other 5% will not seriously And 6% degrade performance Sub 5% All processors today use RISC technology Move reg-reg 4% Must include unconditional Call 1% branch for completeness Pure RISC (PowerPC, Sparc, MIPS, ARM, …) Return 1% RISC technology for CISC language (Pentium II – 4, Centrino, Core) Other 5% Total 100% Explicitly parallel RISC (Intel , IBM mainframes)

Ref: Hennessy / Patterson, figure 2.11

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 35 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 36 Typical RISC ISA Typical Instruction Encoding Data types Instruction types for Alpha 64-bit RISC processor 32-bit / 64-bit integer and floating point 31 26 25 21 20 16 15 5 4 0 Flat memory model with 32-bit / 64-bit address Opcode Number PALcode type Address mode: disp(Rn) ~ Mem[Regs[Rn] + disp] Opcode Ra Disp Branch type Opcode Ra Rb Disp Memory type Register-register operation model Opcode Ra Rb Function Rc Operate type 32 – 128 integer registers + 32 – 128 FP registers OS (kernel mode) registers Opcode (6 bits) identifies operation to CPU Result flags Read-only (value = 0) and write-only (null) registers Ra, Rb (5 bits) identify register names (R0 to R32) Instruction types PALcode (Privileged Architecture Library) — hardware support for OS Load, store, move register-register Branch —test Ra, true  Ra  PC, PC  PC + Disp Integer add, sub, mult, div, shift, compare Memory — move between Ra and Mem[Regs[Rb] + Disp] Boolean and, or, xor Operate R/R — Rc  Ra function Rb (register name) Floating point add, sub, mult, div, sqrt, compare Operate R/I— Rc  Ra function Imm (in Rb and 3 bits of function) Jump, jump register, jump and link, conditional branch

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 37 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 38

Simple RISC Physical Implementation Pipelining —The RISC Advantage Stage 1 Stage 2 Stage 3 Stage 4 Instruction Level Parallelism (ILP) Hardware starts second instruction before first completes Data Instruction Instruction Write Typically 4 instructions in various stages of execution at one time Execute Memory Fetch Decode Back Access Stage 1 Stage 2 Stage 3 Stage 4

Address Instruction Address Data Data Instruction Instruction Write Execute Memory Fetch Decode Back Instruction Data Access Memory Memory Write Early PowerPC implementation Address Instruction Address Data Instruction Data No system bus — instructions proceed from left to right (assembly line) Memory Memory Write Separate cache memory for instructions and data Simple repetitive operations CC Stage 1 Stage 2 Stage 3 Stage 4 1 I 1. Fetch uniform-length instructions 1 2 I I 2. Instruction decode — read source operands from registers 2 1 3 I3 I2 I1 3. Execute ALU instructions and calculate addresses 4 I4 I3 I2 I1

4. Access memory and/or write destination operands (commit to state) 5 I5 I4 I3 I2

One CC per stage per instruction  4 clock cycles per instruction 6 I6 I5 I4 I3

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 39 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 40 Instruction‐Oriented View Pipeline Imbalance

Stage 1 Stage 2 Stage 3 Stage 4 Clock Cycles Data Instruction Instruction Write Execute Memory 1 2 3 4 5 6 Fetch Decode Back Instructions Access

I1 IF ID EX W Instruction I IF ID EX W IF Address Instruction Address Data 2 Fetch I IF ID EX W Instruction Data 3 Instruction Memory Memory ID Write I4 IF ID EX Decode

I5 IF ID EX Execute Instruction executes in 4 clock cycles I IF W Write 6 Clock cycle time determined by LOAD instruction Longest execution time NICideal (pipeline length 1 ) fetch decode execute memory access  register write-back  minimum N ideal CPI ideal  2 IC clock cycle memory access register write-back minimum IC pipeline length 1 pipeline length 1 Most instructions do not access data memory in stage 4 11 IC IC IC large Only LOAD and STORE access data memory instructions Only LOAD performs both memory access and register write-back T CPI ICIC large IC clock rate Most operations can complete in time minimum

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 41 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 42

Superpipelining Pipeline Hazards

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Instruction dependencies IF ID EX ME M WB Result of one instruction is source for later instruction Data Instruction Instruction Write Execute Memory Hazard condition Fetch Decode Back Access Processor runs uninterrupted but provides incorrect answers Address Instruction Address Data Pipeline hazard Instruction Data Several instructions in various stages of execution Memory Memory Pipeline uses a resource value before update by earlier instruction Divide stage 4 into two stages Example 1 2 3 4 5 6 7 Only load/store do useful work in MEM Stage ADD R1,R2,R3 I F D E M W 1 SUB R4,R5,R1 ; hazard if SUB reads R1 before ADD writes R1 I2 F D E M W Divide clock cycle time (double clock rate) Hazard Types I3 F D E M W I F D E M '/2 minimum 4 I F D E Structural Hazard conflict over access to resource CPI IC1 IC 5 S 2 Data Hazard instruction result not ready when needed CPI'''1 IC IC /2 CPIideal'1 CPI ideal  Control Hazard branch address and condition not ready when needed Programs can run twice as fast

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 43 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 44 Dealing with Hazards Structural Hazards Avoid error Conflict over access to resource Pause pipeline and wait for resource to be available Typical structural hazard — unified cache hazard Called WAIT STATE or PIPELINE STALL Instructions and data in same memory device Degrades processor performance Cannot access data and fetch instruction on same clock cycle Adds stall clock cycles (wasted time) to instruction execution To prevent hazard Stall INSTRUCTION FETCH during data MEMORY ACCESS CPI  processing clock cycles (ideal) + stalled clock cycles completed instruction ideal stall ideal stall NN N N ideal stall stall    CPI  CPI 1 CPI CC1 CC2 CC3 CC4 CC5 ICICIC IC large Instruction Instruction Data Write Execute Fetch Decode Access Back CPIstall CPI ideal performance degradation 1 1CPIstall CPI ideal CPI stall Address Instruction Address Data

Eliminate cause of stall Instruction and Data Memory Improve implementation based on analysis of stalls Main activity of hardware architects unified cache

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 45 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 46

Stall Implementation for Cache Hazard Calculating Effect of Cache Hazard on CPI IF ID EX MEM WB Assume: CC1 I1 stall stall cycles stall cycles stalls CC2 LW I1 CPI  Loads ~ 25% CC3 I2 LW I1 instructionsstall instruction Stores ~ 15% CC4 I3 I2 LW I1 Other ~ 60% stall cycles stalls instructionsi CC5  I3 I2 LW I1   CC6 I4  I3 I2 LW itypes stall iiinstruction instruction CC7 I4  I3 I2 load store CC8 I4  I3 1 stall cycle 1 stall IC 1 cycle 1 stall IC CC9 I4     CC10 I4 stall data memory load ICstall data memory store IC

load store On CC5 Load Word (LW) instruction blocks Instruction Fetch (IF) 1 stall cycle 1 stall IC IC   No instruction is fetched on CC5 stall data memory access IC IC No instruction (NOP) is forwarded to ID on CC6 1 stall cycle 1 stall 0.25 loads 0.15 stores NOP = bubble = Φ forwarded to EX on CC7, etc    stall data memory access instruction instruction stall cycles CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10  0.40 I1 IF ID EX MEM WB instruction LW IF ID EX MEM WB ideal stall 0.4 I2 IF ID EX MEM WB CPI CPI CPI 1.40  (degradation== 29%) I3 IF ID EX MEM WB 1.4 I4 IF ID EX MEM WB

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 47 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 48 Data Hazards Control Hazards Instruction result not ready when needed Branch outcome affects program counter (PC) Classification (named for correct order of operations) Taken Read After Write (RAW) Branch condition is true and PC  PC + Disp Correct I2 reads register after I1 writes to it Not taken Hazard I2 reads register before I1 writes to it I2 uses incorrect value Branch condition is false and PC not changed Write After Write (WAW) Target Correct I2 writes to register after I1 writes to it Result of calculation PC  PC + Disp Hazard I2 writes to register before I1 writes to it Incorrect value stays in register Branch hazard Write After Read (WAR) Outcome not known until branch execution finishes Correct I2 writes to register after I1 reads it Pipeline automatically fetches (default) instruction following branch Hazard I2 writes to register before reads I1 it Default instruction not correct if branch taken I1 uses incorrect value Read After Read (RAR) To prevent hazard No hazard — reads do not affect registers Flush default instructions Stall pipeline until branch condition and branch target are ready To prevent hazard — stall pipeline until result is ready Delay in processing branch instructions is called branch penalty

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 49 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 50

Exception Hazards Precise Exception Exception Return-point Hardware or software condition requiring special service routine Follows atomic operation

Interrupt Previous operations commit all results to state I1

Service response to external hardware event No following operations commit any results I2 I4 commits all state Usually asynchronous to state I3

Not triggered by program instructions I4 Does not affect validity of running instructions Return-point I5 Trap Precise exception commits no state Interrupt Service Service response to software condition in running program Exception with well-defined return-point I 5 Routine Usually synchronous Service exception following atomic operation I6 Triggered by program instructions Restart execution at return point without error I7 May stall or affect validity of running instructions I Hazard 8 Multiple instructions in various stages of execution in pipeline How/where/when to interrupt pipeline Where is return-point?

Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 51 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 52 Exception Hazards in 5‐Stage Pipeline Berkeley Solution Exceptions specific to each stage Attach exception status field and source PC to instruction in IF Memory access exception in IF or MEM Instruction raises exception Mark status field with exception Instruction exception in ID Continue pipeline until prior instruction completes (reaches WB) Arithmetic exception in EX RETURN-POINT  PC of instruction that raises exception 5 instructions in various stages of execution Flush pipeline (mark instructions in IF — MEM as NOP to cancel WB) Where is return-point? PC  EXCEPTION SERVICE ROUTINE (ESR) How to handle subsequent partially executed instructions? Return from ESR depends on exception type

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 I1 IF ID EX MEM WB I1 completes atomically I1 IF ID EX MEM WB I IF ID EX error  return point = I MEM 2 2 I IF ID EX WB I IF ID EX   2 error 3 I4 IF ID    I3 IF ID EX MEM WB I5 IF     I4 IF ID EX MEM WB ESR IF ID EX MEM WB

I5 IF ID EX MEM WB Ref: http://www-inst.eecs.berkeley.edu/~cs252/ Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 53 Advanced Computer Architecture — Hadassah College — Fall 2020 Review Dr. Martin Land 54