CS61C : Machine Structures
Total Page:16
File Type:pdf, Size:1020Kb
inst.eecs.berkeley.edu/~cs61c/su05 Review CS61C : Machine Structures • Benchmarks Lecture #29: Intel & Summary • Attempt to predict performance • Updated every few years • Measure everything from simulation of desktop graphics programs to battery life • Megahertz Myth • MHz ≠ performance, it’s just one factor 2005-08-10 Andy Carle CS 61C L29 Intel & Review (1) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (2) A Carle, Summer 2005 © UCB MIPS is example of RISC MIPS vs. 80386 • RISC = Reduced Instruction Set Computer • Address: 32-bit • 32-bit • Term coined at Berkeley, ideas pioneered • Page size: 4KB • 4KB by IBM, Berkeley, Stanford • Data aligned • Data unaligned • RISC characteristics: • Destination reg: Left • Right • Load-store architecture •add $rd,$rs1,$rs2 •add %rs1,%rs2,%rd • Fixed-length instructions (typically 32 bits) • Regs: $0, $1, ..., $31 • %r0, %r1, ..., %r7 • Three-address architecture • Reg = 0: $0 • (n.a.) • RISC examples: MIPS, SPARC, • Return address: $31 • (n.a.) IBM/Motorola PowerPC, Compaq Alpha, ARM, SH4, HP-PA, ... CS 61C L29 Intel & Review (3) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (4) A Carle, Summer 2005 © UCB MIPS vs. Intel 80x86 MIPS vs. Intel 80x86 • MIPS: “load-store architecture” • MIPS: “Three-address architecture” • Only Load/Store access memory; rest • Arithmetic-logic specify all 3 operands operations register-register; e.g., add $s0,$s1,$s2 # s0=s1+s2 lw $t0, 12($gp) add $s0,$s0,$t0 # s0=s0+Mem[12+gp] • Benefit: fewer instructions ⇒ performance • Benefit: simpler hardware ⇒ easier to pipeline, • x86: “Two-address architecture” higher performance • Only 2 operands, x86: “register-memory architecture” so the destination is also one of the sources • • All operations can have an operand in memory; add $s1,$s0 # s0=s0+s1 other operand is a register; e.g., • Often true in C statements: c += b; add 12(%gp),%s0 # s0=s0+Mem[12+gp] • Benefit: smaller instructions ⇒ smaller code • Benefit: fewer instructions ⇒ smaller code CS 61C L29 Intel & Review (5) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (6) A Carle, Summer 2005 © UCB MIPS vs. Intel 80x86 Unusual features of 80x86 • MIPS: “fixed-length instructions” • 8 32-bit Registers have names; 16-bit 8086 names with “e” prefix: • All instructions same size, e.g., 4 bytes •eax, ecx, edx, ebx, esp, ebp, esi, edi • simple hardware ⇒ performance • 80x86 word is 16 bits, double word is 32 bits • branches can be multiples of 4 bytes • PC is called eip (instruction pointer) • x86: “variable-length instructions” • (load effective address) • Instructions are multiple of bytes: 1 to 17; leal • Calculate address like a load, but load address ⇒ small code size (30% smaller?) into register, not data • More Recent Performance Benefit: better instruction cache hit rates • Load 32-bit address: • Instructions can include 8- or 32-bit immediates leal -4000000(%ebp),%esi # esi = ebp - 4000000 CS 61C L29 Intel & Review (7) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (8) A Carle, Summer 2005 © UCB Instructions:MIPS vs. 80x86 80386 addressing (ALU instructions too) • base reg + offset (like MIPS) • addu, addiu • addl •movl -8000044(%ebp), %eax • subu • subl • base reg + index reg (2 regs form addr.) • and,or, xor • andl, orl, xorl •movl (%eax,%ebx),%edi • sll, srl, sra • sall, shrl, sarl # edi = Mem[ebx + eax] • lw • movl mem, reg • scaled reg + index (shift one reg by 1,2) • sw • movl reg, mem •movl(%eax,%edx,4),%ebx • mov • movl reg, reg # ebx = Mem[edx*4 + eax] • li • movl imm, reg • scaled reg + index + offset • lui • n.a. •movl 12(%eax,%edx,4),%ebx # ebx = Mem[edx*4 + eax + 12] CS 61C L29 Intel & Review (9) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (10) A Carle, Summer 2005 © UCB Branches in 80x86 Branch: MIPS vs. 80x86 • Rather than compare registers, x86 • beq • (cmpl;) je uses special 1-bit registers called if previous operation “condition codes” that are set as a set condition code, then side-effect of ALU operations cmpl unnecessary • S - Sign Bit • bne • (cmpl;) jne • Z - Zero (result is all 0) • slt; beq • (cmpl;) jlt • C - Carry Out • slt; bne • (cmpl;) jge • P - Parity: set to 1 if even number of ones in rightmost 8 bits of operation • jal • call • • • Conditional Branch instructions then jr $31 ret use condition flags for all comparisons: <, <=, >, >=, ==, != CS 61C L29 Intel & Review (11) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (12) A Carle, Summer 2005 © UCB While in C/Assembly: 80x86 Unusual features of 80x86 C while (save[i]==k) • Memory Stack is part of instruction set i = i + j; •call places return address onto stack, (i,j,k: %edx,%esi,%ebx) increments esp (Mem[esp]=eip+6; esp+=4) leal -400(%ebp),%eax •push places value onto stack, increments esp .Loop: cmpl %ebx,(%eax,%edx,4) •pop gets value from stack, decrements esp x jne .Exit • incl, decl (increment, decrement) 8 addl %esi,%edx 6 incl %edx # edx = edx + 1 j .Loop • Benefit: smaller instructions ⇒ smaller code .Exit: Note: cmpl replaces sll, add, lw in loop CS 61C L29 Intel & Review (13) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (14) A Carle, Summer 2005 © UCB Outline Intel Internals • Intro to x86 • Hardware below instruction set called • Microarchitecture "microarchitecture" • Pentium Pro, Pentium II, Pentium III all based on same microarchitecture (1994) • Improved clock rate, increased cache size • Pentium 4 has new microarchitecture CS 61C L29 Intel & Review (15) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (16) A Carle, Summer 2005 © UCB Pentium, Pentium Pro, Pentium 4 Pipeline Dynamic Scheduling in Pentium Pro, II, III • PPro doesn’t pipeline 80x86 instructions • PPro decode unit translates the Intel instructions into 72-bit "micro-operations" (~ MIPS instructions) • Takes 1 clock cycle to determine length of 80x86 instructions + 2 more to create the micro-operations • Most instructions translate to 1 to 4 micro-operations • Pentium (P5) = 5 stages Pentium Pro, II, III (P6) = 10 stages •10 stage pipeline for micro-operations “PentiumCS 61C L29 4 Intel (Partiall & Review (17)y) Previewed,” Microprocessor Report, 8/28/00A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (18) A Carle, Summer 2005 © UCB Dynamic Scheduling Hardware support for reordering Consider: • Out-of-Order execution (OOO): allow an instruction to execute before prior lw $t0 0($t0) # might miss in mem instructions have executed. add $s1 $s1 $s1 # will be stalled in • Speculation across branches add $s2 $s1 $s1 # pipe waiting for lw • When instruction no longer speculative, write results (instruction commit) Solutions: • Fetch/issue in-order, execute OOO, * Compiler (STATIC) reordering (loops?) commit in order * Hardware (DYNAMIC) reordering • Watch out for hazards! CS 61C L29 Intel & Review (19) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (20) A Carle, Summer 2005 © UCB Hardware for OOO execution Dynamic Scheduling in Pentium Pro Max. instructions issued/clock 3 • Need HW buffer for results of uncommitted Max. instr. complete exec./clock 5 instructions: reorder buffer Reorder Max. instr. commited/clock 3 Buffer • Reorder buffer can be IF Instructions in reorder buffer 40 operand source Issue Regs • Once operand commits, 2 integer functional units (FU), 1 floating result is found in point FU, 1 branch FU, 1 Load FU, 1 Store register Res Stations Res Stations FU Adder Adder • Discard results on mispredicted branches or on exceptions CS 61C L29 Intel & Review (21) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (22) A Carle, Summer 2005 © UCB Pentium, Pentium Pro, Pentium 4 Pipeline Pentium 4 • Still translate from 80x86 to micro-ops • P4 has better branch predictor, more FUs • Clock rates: • Pentium III 1 GHz v. Pentium IV 1.5 GHz • 10 stage pipeline vs. 20 stage pipeline • Faster memory bus: 400 MHz v. 133 MHz • Pentium (P5) = 5 stages Pentium Pro, II, III (P6) = 10 stages Pentium 4 (NetBurst) = 20 stages “PentiumCS 61C L29 4 Intel (Partiall & Review (23)y) Previewed,” Microprocessor Report, 8/28/00A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (24) A Carle, Summer 2005 © UCB Pentium 4 features Block Diagram of Pentium 4 Microarchitecture • Multimedia instructions 128 bits wide vs. 64 bits wide => 144 new instructions • When used by programs?? • Instruction Cache holds micro- operations vs. 80x86 instructions • no decode stages of 80x86 on cache hit • called “trace cache” (TC) • BTB = Branch Target Buffer (branch predictor) • I-TLB = Instruction TLB, Trace Cache = Instruction cache • RF = Register File; AGU = Address Generation Unit • "Double pumped ALU" means ALU clock rate 2X => 2X ALU F.U.s CS 61C L29 Intel & Review (25) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (26) A Carle, Summer 2005 © UCB Pentium, Pentium Pro, Pentium 4 Pipeline CS61C: So what's in it for me? (1st lecture) Learn some of the big ideas in CS & engineering: • 5 Classic components of a Computer • Principle of abstraction, systems built as layers • Data can be anything (integers, floating point, characters): a program determines what it is • Stored program concept: instructions just data • Compilation v. interpretation thru system layers • Principle of Locality, exploited via a memory hierarchy (cache) • Pentium (P5) = 5 stages Pentium Pro, II, III (P6) = 10 stages • Greater performance by exploiting parallelism Pentium 4 (NetBurst) = 20 stages (pipelining) • Principles/Pitfalls of Performance Measurement “Pentium 4 (Partially) Previewed,” Microprocessor Report, 8/28/00 CS 61C L29 Intel & Review (27) A Carle, Summer 2005 © UCB CS 61C L29