CS61C : Machine Structures

inst.eecs.berkeley.edu/~cs61c/su05 Review CS61C : Machine Structures • Benchmarks Lecture #29: Intel & Summary • Attempt to predict performance • Updated every few years • Measure everything from simulation of desktop graphics programs to battery life • Megahertz Myth • MHz ≠ performance, it’s just one factor 2005-08-10 Andy Carle CS 61C L29 Intel & Review (1) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (2) A Carle, Summer 2005 © UCB MIPS is example of RISC MIPS vs. 80386 • RISC = Reduced Instruction Set Computer • Address: 32-bit • 32-bit • Term coined at Berkeley, ideas pioneered • Page size: 4KB • 4KB by IBM, Berkeley, Stanford • Data aligned • Data unaligned • RISC characteristics: • Destination reg: Left • Right • Load-store architecture •add $rd,$rs1,$rs2 •add %rs1,%rs2,%rd • Fixed-length instructions (typically 32 bits) • Regs: $0, $1, ..., $31 • %r0, %r1, ..., %r7 • Three-address architecture • Reg = 0: $0 • (n.a.) • RISC examples: MIPS, SPARC, • Return address: $31 • (n.a.) IBM/Motorola PowerPC, Compaq Alpha, ARM, SH4, HP-PA, ... CS 61C L29 Intel & Review (3) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (4) A Carle, Summer 2005 © UCB MIPS vs. Intel 80x86 MIPS vs. Intel 80x86 • MIPS: “load-store architecture” • MIPS: “Three-address architecture” • Only Load/Store access memory; rest • Arithmetic-logic specify all 3 operands operations register-register; e.g., add $s0,$s1,$s2 # s0=s1+s2 lw $t0, 12($gp) add $s0,$s0,$t0 # s0=s0+Mem[12+gp] • Benefit: fewer instructions ⇒ performance • Benefit: simpler hardware ⇒ easier to pipeline, • x86: “Two-address architecture” higher performance • Only 2 operands, x86: “register-memory architecture” so the destination is also one of the sources • • All operations can have an operand in memory; add $s1,$s0 # s0=s0+s1 other operand is a register; e.g., • Often true in C statements: c += b; add 12(%gp),%s0 # s0=s0+Mem[12+gp] • Benefit: smaller instructions ⇒ smaller code • Benefit: fewer instructions ⇒ smaller code CS 61C L29 Intel & Review (5) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (6) A Carle, Summer 2005 © UCB MIPS vs. Intel 80x86 Unusual features of 80x86 • MIPS: “fixed-length instructions” • 8 32-bit Registers have names; 16-bit 8086 names with “e” prefix: • All instructions same size, e.g., 4 bytes •eax, ecx, edx, ebx, esp, ebp, esi, edi • simple hardware ⇒ performance • 80x86 word is 16 bits, double word is 32 bits • branches can be multiples of 4 bytes • PC is called eip (instruction pointer) • x86: “variable-length instructions” • (load effective address) • Instructions are multiple of bytes: 1 to 17; leal • Calculate address like a load, but load address ⇒ small code size (30% smaller?) into register, not data • More Recent Performance Benefit: better instruction cache hit rates • Load 32-bit address: • Instructions can include 8- or 32-bit immediates leal -4000000(%ebp),%esi # esi = ebp - 4000000 CS 61C L29 Intel & Review (7) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (8) A Carle, Summer 2005 © UCB Instructions:MIPS vs. 80x86 80386 addressing (ALU instructions too) • base reg + offset (like MIPS) • addu, addiu • addl •movl -8000044(%ebp), %eax • subu • subl • base reg + index reg (2 regs form addr.) • and,or, xor • andl, orl, xorl •movl (%eax,%ebx),%edi • sll, srl, sra • sall, shrl, sarl # edi = Mem[ebx + eax] • lw • movl mem, reg • scaled reg + index (shift one reg by 1,2) • sw • movl reg, mem •movl(%eax,%edx,4),%ebx • mov • movl reg, reg # ebx = Mem[edx*4 + eax] • li • movl imm, reg • scaled reg + index + offset • lui • n.a. •movl 12(%eax,%edx,4),%ebx # ebx = Mem[edx*4 + eax + 12] CS 61C L29 Intel & Review (9) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (10) A Carle, Summer 2005 © UCB Branches in 80x86 Branch: MIPS vs. 80x86 • Rather than compare registers, x86 • beq • (cmpl;) je uses special 1-bit registers called if previous operation “condition codes” that are set as a set condition code, then side-effect of ALU operations cmpl unnecessary • S - Sign Bit • bne • (cmpl;) jne • Z - Zero (result is all 0) • slt; beq • (cmpl;) jlt • C - Carry Out • slt; bne • (cmpl;) jge • P - Parity: set to 1 if even number of ones in rightmost 8 bits of operation • jal • call • • • Conditional Branch instructions then jr $31 ret use condition flags for all comparisons: <, <=, >, >=, ==, != CS 61C L29 Intel & Review (11) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (12) A Carle, Summer 2005 © UCB While in C/Assembly: 80x86 Unusual features of 80x86 C while (save[i]==k) • Memory Stack is part of instruction set i = i + j; •call places return address onto stack, (i,j,k: %edx,%esi,%ebx) increments esp (Mem[esp]=eip+6; esp+=4) leal -400(%ebp),%eax •push places value onto stack, increments esp .Loop: cmpl %ebx,(%eax,%edx,4) •pop gets value from stack, decrements esp x jne .Exit • incl, decl (increment, decrement) 8 addl %esi,%edx 6 incl %edx # edx = edx + 1 j .Loop • Benefit: smaller instructions ⇒ smaller code .Exit: Note: cmpl replaces sll, add, lw in loop CS 61C L29 Intel & Review (13) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (14) A Carle, Summer 2005 © UCB Outline Intel Internals • Intro to x86 • Hardware below instruction set called • Microarchitecture "microarchitecture" • Pentium Pro, Pentium II, Pentium III all based on same microarchitecture (1994) • Improved clock rate, increased cache size • Pentium 4 has new microarchitecture CS 61C L29 Intel & Review (15) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (16) A Carle, Summer 2005 © UCB Pentium, Pentium Pro, Pentium 4 Pipeline Dynamic Scheduling in Pentium Pro, II, III • PPro doesn’t pipeline 80x86 instructions • PPro decode unit translates the Intel instructions into 72-bit "micro-operations" (~ MIPS instructions) • Takes 1 clock cycle to determine length of 80x86 instructions + 2 more to create the micro-operations • Most instructions translate to 1 to 4 micro-operations • Pentium (P5) = 5 stages Pentium Pro, II, III (P6) = 10 stages •10 stage pipeline for micro-operations “PentiumCS 61C L29 4 Intel (Partiall & Review (17)y) Previewed,” Microprocessor Report, 8/28/00A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (18) A Carle, Summer 2005 © UCB Dynamic Scheduling Hardware support for reordering Consider: • Out-of-Order execution (OOO): allow an instruction to execute before prior lw $t0 0($t0) # might miss in mem instructions have executed. add $s1 $s1 $s1 # will be stalled in • Speculation across branches add $s2 $s1 $s1 # pipe waiting for lw • When instruction no longer speculative, write results (instruction commit) Solutions: • Fetch/issue in-order, execute OOO, * Compiler (STATIC) reordering (loops?) commit in order * Hardware (DYNAMIC) reordering • Watch out for hazards! CS 61C L29 Intel & Review (19) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (20) A Carle, Summer 2005 © UCB Hardware for OOO execution Dynamic Scheduling in Pentium Pro Max. instructions issued/clock 3 • Need HW buffer for results of uncommitted Max. instr. complete exec./clock 5 instructions: reorder buffer Reorder Max. instr. commited/clock 3 Buffer • Reorder buffer can be IF Instructions in reorder buffer 40 operand source Issue Regs • Once operand commits, 2 integer functional units (FU), 1 floating result is found in point FU, 1 branch FU, 1 Load FU, 1 Store register Res Stations Res Stations FU Adder Adder • Discard results on mispredicted branches or on exceptions CS 61C L29 Intel & Review (21) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (22) A Carle, Summer 2005 © UCB Pentium, Pentium Pro, Pentium 4 Pipeline Pentium 4 • Still translate from 80x86 to micro-ops • P4 has better branch predictor, more FUs • Clock rates: • Pentium III 1 GHz v. Pentium IV 1.5 GHz • 10 stage pipeline vs. 20 stage pipeline • Faster memory bus: 400 MHz v. 133 MHz • Pentium (P5) = 5 stages Pentium Pro, II, III (P6) = 10 stages Pentium 4 (NetBurst) = 20 stages “PentiumCS 61C L29 4 Intel (Partiall & Review (23)y) Previewed,” Microprocessor Report, 8/28/00A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (24) A Carle, Summer 2005 © UCB Pentium 4 features Block Diagram of Pentium 4 Microarchitecture • Multimedia instructions 128 bits wide vs. 64 bits wide => 144 new instructions • When used by programs?? • Instruction Cache holds micro- operations vs. 80x86 instructions • no decode stages of 80x86 on cache hit • called “trace cache” (TC) • BTB = Branch Target Buffer (branch predictor) • I-TLB = Instruction TLB, Trace Cache = Instruction cache • RF = Register File; AGU = Address Generation Unit • "Double pumped ALU" means ALU clock rate 2X => 2X ALU F.U.s CS 61C L29 Intel & Review (25) A Carle, Summer 2005 © UCB CS 61C L29 Intel & Review (26) A Carle, Summer 2005 © UCB Pentium, Pentium Pro, Pentium 4 Pipeline CS61C: So what's in it for me? (1st lecture) Learn some of the big ideas in CS & engineering: • 5 Classic components of a Computer • Principle of abstraction, systems built as layers • Data can be anything (integers, floating point, characters): a program determines what it is • Stored program concept: instructions just data • Compilation v. interpretation thru system layers • Principle of Locality, exploited via a memory hierarchy (cache) • Pentium (P5) = 5 stages Pentium Pro, II, III (P6) = 10 stages • Greater performance by exploiting parallelism Pentium 4 (NetBurst) = 20 stages (pipelining) • Principles/Pitfalls of Performance Measurement “Pentium 4 (Partially) Previewed,” Microprocessor Report, 8/28/00 CS 61C L29 Intel & Review (27) A Carle, Summer 2005 © UCB CS 61C L29

CS61C : Machine Structures

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support