Introduction--CSE 586 :

• Architecture of modern computer systems – : pipelined, exhibiting instruction level CSE 586 Computer Architecture parallelism, and allowing speculation . – : multi-level hierarchy and its management, including hardware and software assists for enhanced performance; interaction of hardware/software for Jean-Loup Baer systems. http://www.cs.washington.edu/education/courses/586/00sp – Input/output: Buses; Disks – performance and reliability (RAIDs) – Multiprocessors: SMP’s and

CSE 586 Spring 00 1 CSE 586 Spring 00 2

Course mechanics Course mechanics (cont’d)

• Please see Web page: • Prerequisites http://www.cs.washington.edu/education/courses/586/00sp/ – Knowledge of computer organization as taught in an UG class • Instructors: (cf.,e.g., CSE 378 at UW) – Material on pipelining and memory hierarchy as found in – Jean-Loup Baer: [email protected] Hennessy and Patterson Computer Organization and Design, 2nd – Vivek Sahasranaman : [email protected] Edition, Morgan Kaufman Chapters 6 and 7 • Textbook and reading material: – Your first assignment (due next week 4/6/00 see Web page) will – John Hennessy and David Patterson: Computer Architecture: A test that knowledge Quantitative Approach, 2nd Edition, Morgan Kaufman, 1996 – Papers drawn form the literature: the list will be updated every week in the “outline” link of the Web page

CSE 586 Spring 00 3 CSE 586 Spring 00 4

Course mechanics (cont’d) Class list and e-mail

• Assignments (almost one a week) • Please subscribe right away to the e-mail list for this class. – Paper and pencil (from the book or closely related) • See the Web page for instructions (use Majordomo) – Programming (simulation) . We will provide skeletons in C but you can use any language you want – Paper review (one at the end of the quarter) • Exam – One take home final •Grading – Assignments 75% – Final 25%

CSE 586 Spring 00 5 CSE 586 Spring 00 6 Course outline (will most certainly be modified; look at Course outline (cont’d) web page periodically)

• Weeks 1 and 2: • Weeks 5 and 6 – Performance of computer systems. – Caches and performance enhancements. – ISA . RISC and CISC. Current trends (EPIC) and extensions – TLB’s (MMX) • Week 7 – Review of pipelining – Hardware/software interactions at the Virtual memory level • Weeks 3 and 4 • Week 8 – Simple branch predictors – Buses – Instruction level parallelism (scoreboard, ) – Disk – Multiple issue: Superscalar and out-of-of order execution • Weeks 9 and 10 – Predication – Multiprocessors. SMP. Cache coherence – Synchronization

CSE 586 Spring 00 7 CSE 586 Spring 00 8

Technological improvements -Memory Performance Gap

• CPU : • x Memory system (10 x over 8 years but densities have increased 100x over the same period) – Annual rate of speed improvement is 35% before 1985 and 60% • o CPU (100x over 10 years) since 1985 1000 – Slightly faster than increase in transistors o • Memory: – Annual rate of speed improvement is < 10% 100 o – Density quadruples in 3 years. o • I/O : o x 10 o – Access time has improved by 30% in 10 years o x x x x – Density improves by 50% every year 1 89 91 93 95 97 99

CSE 586 Spring 00 9 CSE 586 Spring 00 10

Improvements in Processor Speed Intel x86 Progression

Chip Date Initial MIPS • Technology 4004 11/71 2,300 0.06 – Faster clock (commercially 700 Mhz available; prototype 1.5 Ghz) 8008 4/72 3,500 0.06 • More transistors = More functionality 8080 4/74 6,000 0.6 – Instruction Level parallelism (ILP) 8086 6/78 29,000 0.3 – Multiple functional units, superscalar or out-of-order execution 8088 6/79 29,000 0.3 – 10 Million transistors but Moore law still applies. 286 2/82 134,000 0.9 • Extensive pipelining 386 10/85 275,000 5 – From single 5 stage to multiple pipes as deep as 20 stages 486 4/89 1.2Million 20 Pentium 3/93 3.1Million 100 • Sophisticated instruction fetch units Pentium Pro 3/95 5.5Million 300 – Branch prediction; ; trace caches Pentium III (Xeon) 2/99 10 Million? 500? • On-chip Memory – One or two levels of caches. TLB’s for instruction and data CSE 586 Spring 00 11 CSE 586 Spring 00 12 Speed improvement: expose ISA to the Performance evaluation basics compiler/user

level • Performance inversely proportional to execution time – Scheduling to remove hazards, reduce load and branch delays • Elapsed time includes: • Control flow prediction user + system; I/O; memory accesses; CPU per se Static prediction and/or predication; code placement in cache • CPU execution time (for a given program): 3 factors • Loop unrolling – Number of instructions executed Reduce branching but increase register pressure – Clock cycle time (or rate) • Memory hierarchy level – CPI: number of (or its inverse IPC) Instructions to manage the data cache (prefetch, purge) CPU execution time = Instruction count * CPI * clock cycle time • Etc…

CSE 586 Spring 00 13 3/30/00 CSE 586 Spring 00 14

Components of the CPI Benchmarking

• CPI for single instruction issue with ideal pipeline = 1 • Measure a real workload for your installation • Previous formula can be expanded to take into account • Weight programs according to frequency of execution classes of instructions • If weights are not available, normalize so each program – For example in RISC machines: branches, f.p., load-store. takes equal time on a given machine – For example in CISC machines: string instructions Σ CPI = CPIi * fi where fi is the frequency of instructions in class i • Will talk about “contributions to the CPI” from, e.g,: – memory hierarchy – branch (misprediction) – hazards etc.

CSE 586 Spring 00 15 CSE 586 Spring 00 16

Comparing and summarizing benchmark Available benchmark suites performance

• Spec 95 (integer and floating-points) now SPEC CPU2000 • For execution times, use (weighted) arithmetic mean: – http://www.spec.org/ Weight. Ex. Time = Σ Weight * Time – Too many spec-specific compiler optimizations i i • For rates, use (weighted) harmonic mean: • Other “specific” SPEC: SPEC Web, SPEVC JVM etc. Σ • Perfect Club and NASA benchmarks Weight. Rate = 1 / (Weighti / Rate i ) – Mostly for scientific and parallelizable programs • See paper by Jim Smith (link in outline) • TCP-A, TCP-B, TCP-C , TPC-D benchmarks “Simply put, we consider one computer to be faster than another if it – Transaction processing (response time); decision support (data executes the same set of programs in less time” mining) • Desktop applications – Recent UW paper (http://www.cs.washington.edu/homes/baer/isca98.ps)

CSE 586 Spring 00 17 CSE 586 Spring 00 18 Normalized execution times Computer design: Make the common case fast

• Compute an aggregate performance measure before • Amdahl’s law (speedup) normalizing Speedup = (performance with enhancement)/(performance base case) • Average normalized (wrt another machine) execution time: Or equivalently Speedup = (exec.time base case)/(exec.time with enhancement) Either with arithmetic mean n n ∏execution time ratioi or like here with geometric mean: i=1 • Application to parallel processing – s fraction of program that is sequential Geometric mean ( X ) X i = i Geometric mean (Y ) – Speedup S is at most 1/s Geometric mean(Y) i i – That is if 20% of your program is sequential the maximum speedup with an infinite number of processors is at most 5

• Geometric mean does not measure execution time

CSE 586 Spring 00 19 CSE 586 Spring 00 20

Instruction Set Architecture What is not part of the ISA (but interesting!)

• Part of the interface between hardware and software that is • Caches, TLB’s etc. visible to the programmer • Branch prediction mechanisms • Instruction Set • Register renaming – RISC, CISC , VLIW-EPIC but also other issues such as how • Number of instruction issued per cycle branches are handled, multimedia/graphics extensions etc. • Number of functional units, pipeline structure • Addressing modes (including how the PC is used) • etc ... • Registers – Integer, floating-point, but also flat vs.windows, and special- purpose registers, e.g. for multiply/divide or for condition codes or for predication

CSE 586 Spring 00 21 CSE 586 Spring 00 22

CPU-centric operations (arith-logical) RISC vs. CISC (highly abstracted)

• Registers-only (load-store architectures) – Synonym with RISC? In general 3 operands (2 sources, 1 result) – Fixed-size instructions. Few formats. Pros Cons • Registers + memory (CISC). – Vary between 2 and 3 operands (depends on instruction formats). Load-Store Easy to encode Low code density At the extreme can have n operands (Vax) – Variable-size instructions (expanding opcodes and operand "Same CPI" specifiers) • Stack oriented Reg + mem High code density Diff. instr. formats – Historical? But what about JVM byte codes • Memory only (historical?) Diff. Exec. time – Used for “long-string” instructions

CSE 586 Spring 00 23 CSE 586 Spring 00 24 Addressing modes Flow of control – Conditional branches for either load-store or cpu-centric ops

• Basic : • About 30% of executed instructions – Register, Immediate, Indexed or displacement (subsumes register – Generally PC-relative indirect), Absolute • Compare and branch • Between RISC and CISC – Only one instr. but a “heavy one”; Often limited to – Basic , Base + Index (e.g., in IBM Power PC), Scale-index (index equality/inequality or comparisons with 0 multiplied by the size of the element being addressed) • Condition register • Very CISC-like – Simple but uses a register and uses two instructions – Memory indirect, Auto-increment and decrement in conjunction • Condition codes with all others (increased complexity in pipelined units) – CC is extra state (bad for pipelining) but can be set “for free” and allow for branch executing in 0 time

CSE 586 Spring 00 25 CSE 586 Spring 00 26

Unconditional transfers Registers (visible to the ISA)

• Jumps (long branches and indirect jumps) • Integer (Generally 32 but see x86) • Procedure call and return • Floating-point (Generally 32 but see x86) – Who saves what (return address, registers, parameters etc.), where • Some GPR are special (stack, register) and when (caller, callee) – stack pointer, frame pointer, even the PC (VAX) – Combination of hardware/software protocols • Some registers have special functions – control registers, segment registers (x86) • Flat registers vs. windows (Sparc) or hierarchy (Cray)

CSE 586 Spring 00 27 CSE 586 Spring 00 28

A sample of less conventional features Extensions to basic ISA

• Decimal operations (HP-PA; handled by F-P in x86) • Multimedia/Graphics extensions • String operations (x86, IBM mainframes) – MMX (Intel) for some SIMD (single-instruction-multiple-data, i.e., vector-like) instructions using f-p registers and “slicing” them in 8- • Lack of byte instructions (Alpha) bit units • Synchronization (to be seen at end of quarter) – Sun SPARC Visual Instruction Set for on-chip graphics functional – Atomic swap, load linked / store cond., “fence” units • Predicated execution (conditional move) – Full-fledge Multimedia processors (e.g., Equator Map 1000 with hundreds of SIMD instructions) • Cache hints (prefetch, flush) • Vector processors • Interaction with the operating system (PALcode) – The “old” Cray-like supercomputers • TLB instructions (TLB miss handled by software in MIPS) – Might make a come-back for

CSE 586 Spring 00 29 CSE 586 Spring 00 30 From 32b to 64b ISA’s Instruction Execution Cycle

• The most costly design error: addressing space too small • Loop forever: – 16 bits PDP-11 led to 32-bit Vax Fetch next instruction and increment PC – 32 bits can only address 4 Gbytes so move to 64 bits Decode • Start from scratch: Dec Alpha Read operands – Still more or less tailored after MIPS ISA Execute or compute memory address or compute branch address • Start from scratch: Intel IA-64 (Merced) Store result or access memory or modify PC • With backward compatibility • But this logical decomposition does not correspond well to – MIPS III ( implemented as MIPS R4000) and MIPS IV (R10000) a break-up in steps of the same complexity – HP PA-8000 – Sun Sparc V8 and V9 (UltraSparc) – IBM /Motorola Power PC 620

CSE 586 Spring 00 31 CSE 586 Spring 00 32

IF ID/RR EXE Mem WB

Multiple cycle implementation (not pipelined)

4 ALU • Instruction fetch and increment PC NPC A • Decode and read registers; branch address computation zero • ALU use for either arithmetic or memory address PC Inst. Regs. ALU Data computation; branch condition and next PC determined mem. mem. • Memory access IR ALUout MD

• Store result in register s B • With this decomposition, instructions will take either e 2 ALU – 3 cycles (branch) – 4 cycles (everything else except branch and load) target – 5 cycles (load)

CSE 586 Spring 00 33 CSE 586 Spring 00 34

Multiple cycle implementation (RTL level) Multiple cycle impl. (cont’d)

• For example (using MIPS R3000 ISA): 3. ALU execution (Part I) 1. Instruction fetch and increment PC ALUoutput <- A op B (if Arit-logic) ALUoutput <- A + sign-extend (IR[15:0]) IR <- Mem[PC] for immediate or memory address computation NPC <- PC + 4 If (A = B) then PC <- target else PC <- NPC 2. Instruction decode and register read For branches; of course can be other conditions A <- Reg[IR[25:21]] (1st input to ALU) 4. Memory access or ALU execution (Part II) B <- Reg[IR[20:16]] (2nd input to ALU) Memory-data <- Memory[ALUoutput] (Load) target <- NPC + sign-extend(IR[15:0]*4) (in case the Memory[ALUoutput] <- B (Store) instruction is a branch) Reg[IR[15:11]] <- ALUoutput (if Arit-logic) 5. Write-back stage Reg[IR[20:16]] <- Memory-data

CSE 586 Spring 00 35 CSE 586 Spring 00 36 IF ID/RR EXE Mem WB IF ID/RR EXE Mem WB

Tracing an arit. Instr. Tracing a Load Instr.

4 4 ALU ALU

NPC A NPC A zero zero

PC Inst. Regs. ALU Data PC Inst. Regs. ALU Data mem. mem. mem. mem. IR ALUout MD IR ALUout MD

s B s B e 2 ALU e 2 ALU

target target

CSE 586 Spring 00 37 CSE 586 Spring 00 38

IF ID/RR EXE Mem WB

Tracing a branch Instr. Pipelining

4 ALU • One instruction/result every cycle (ideal) NPC A zero – Not in practice because of hazards • Increase throughput PC Inst. Regs. Data ALU – Throughput = number of results/second mem. mem. IR ALUout MD • Improve speed-up – In the ideal case, if n stages , the speed-up will be close to n. Can’t B s make n too large: load balancing between stages & hazards e 2 ALU • Might slightly increase the latency of individual instructions (pipeline overhead) target

CSE 586 Spring 00 39 CSE 586 Spring 00 40

IF ID/RR EXE Mem WB EX/MEM IF/ID ID/EX MEM/WB Basic pipeline implementation

4 ALU (PC) • Five stages: IF, ID, EXE, MEM, WB

• What are the resources needed and where (Rd) zero – ALU’s, Registers, etc. PC Inst. Regs. ALU Data • What info. is to be passed between stages mem. mem. – Requires pipeline registers between stages: IF/ID, ID/EXE, EXE/MEM and MEM/WB – What is stored in these pipeline registers? s e • Design of the . data 2 ALU

control

CSE 586 Spring 00 41 CSE 586 Spring 00 42 IF ID/RR EXE Mem WB Five instructions in progress; one of each color EX/MEM IF/ID ID/EX MEM/WB Hazards

4 ALU (PC) • Structural hazards – Resource conflict (mostly in multiple issue machines; also for (Rd) zero resources which are used for more than one cycle see later) PC Inst. Regs. ALU Data • Data dependencies mem. mem. – Most common RAW but also WAR and WAW in OOO execution • Control hazards s – Branches and other flow of control disruptions e data 2 ALU • Consequence: stalls in the pipeline – Equivalently: insertion of bubbles or of no-ops control

CSE 586 Spring 00 43 CSE 586 Spring 00 44

Pipeline speed-up Example of structural hazard

pipeline depth Speedup_ideal = • For single issue machine: common data and instruction 1 memory (unified cache) – Pipeline stall every load-store instruction (control easy to implement) pipeline depth Speedup_ hazards = • Better solutions 1 + CPI contributed by hazards – Separate I-cache and D-cache – Instruction buffers – Both + sophisticated instruction fetch unit! • Will see more cases in multiple issue machines

CSE 586 Spring 00 45 CSE 586 Spring 00 46

IF ID EXE MEM WB R1 available here Add R1, R2, R3 Data hazards ||| | || R 1 needed here

Sub R4,R1,R2 | | | ||| • Data dependencies between instructions that are in the pipe at the same time. • For single pipeline in order issue: Read After Write hazard ADD R3,R5,R1 | | |||| (RAW) OK if in ID stage one can write In 1st part of cycle and read in 2nd part Add R1, R2, R3 #R1 is result register OR R6,R1,R2 | || ||| Sub R4, R1,R2 #conflict with R1 Add R3, R5, R1 #conflict with R1 OK Add R5,R1,R2 | | | ||| Or R6,R1,R2 #conflict with R1 Add R5, R2, R1 #R1 OK now (5 stage pipe)

CSE 586 Spring 00 47 CSE 586 Spring 00 48 IF ID EXE MEM WB R1 available here Add R1, R2, R3 Forwarding ||| | || R 1 needed here

Sub R4,R1,R2 | | | ||| • Result of ALU operation is known at end of EXE stage • Forwarding between: ADD R3,R5,R1 | | |||| – EXE/MEM pipeline register to ALUinput for instructions i and i+1 – MEM/WB pipeline register to ALUinput for instructions i and i+2 • Note that if the same register has to be forwarded, forward the last OR R6,R1,R2 | || ||| one to be written – Forwarding through (write 1st half of cycle, read 2nd OK w/o forwarding half of cycle) Add R5,R1,R2 | | | ||| • Forwarding between load and store (memory copy)

CSE 586 Spring 00 49 CSE 586 Spring 00 50

Other data hazards Forwarding cannot solve all conflicts

• Write After Write (WAW). Can happen in • At least in our simple MIPS-like pipeline – Pipelines with more than one write stage Lw R1, 0(R2) #Result at end of MEM stage – More than one functional unit with different latencies (see later) Sub R4, R1,R2 #conflict with R1 • Write After Read (WAR). Very rare Add R3, R5, R1 #OK with forwarding – With VAX-like autoincrement addressing modes Or R6,R1,R2 # OK with forwarding

CSE 586 Spring 00 51 CSE 586 Spring 00 52

IF ID EXE MEM WB IF ID EXE MEM WB R1 available here R1 available here LW R1, 0(R2) LW R1, 0(R2) ||| | || R 1 needed here |||| || R 1 needed here No way! Insert a bubble Sub R4,R1,R2 | | | ||| Sub R4,R1,R2 | | |||||

OK ADD R3,R5,R1 | || |||| ADD R3,R5,R1 | | ||||

OK |||||| OR R6,R1,R2 | || ||| OR R6,R1,R2

CSE 586 Spring 00 53 CSE 586 Spring 00 54 Compiler solution: pipeline scheduling

• Try to fill the “load delay” slot – More difficult for multiple issue machines • Increases register pressure • Increases sequence of loads (might be harder on the cache)

CSE 586 Spring 00 55