<<

Instruction Sets

• What’s an instruction set? Instruction Set Architecture and – Set of all instructions understood by the CPU Principles – Each instruction directly executed in hardware • Instruction Set Representation Chapter 2 – Sequence of bits, typically 1-4 words – May be variable length or fixed length – Some bits represent the op code, others represent the operand

Classes of Instruction Set Instruction Set Affects CPU Architectures Performance • Stack Based • Recall – Implicitly use the top of a stack • PUSH X, PUSH Y, ADD, POP Z – ExecTime = Instruction_Count * CPI * Cycle_Time •Z=X+Y – Instruction Set is at the heart of the matter! • Accumulator Based – Implicitly use an accumulator Source Code Instr Fetch • LOAD X, ADD Y, STORE Z Instruction Set • GPR – General Purpose Registers Compiler – Operands are explicit and may be memory or registers Instr Decode • LOAD R1, X or LOAD R1, X •LOADR2,Y ADDR1,Y Object Code Instr Execute • ADD R1, R2, R3 STORE R1, Z • STORE R1, Z Instruction_Count CPI and Cycle Time

1 Comments on Classifications of Comparison Details ISA • Stack-based generates short instructions (one PRO CON operand) but programming may be complex, lots of STACK Simple Address format Stack bottleneck stack overhead Effective decode Lack of random Short Instruction  high access • Accumulator-based also has complexities for code density Many instr’s needed juggling different data values for some code ACC Short Instrucions  High Lots of memory • GPR – most common today code density traffic – Allows large number of registers to exist to hold variables GPR Lots of code generation Larger code size options – Compiler gets the job today of allocating variables to Possibly complex Efficiencies possible effective address registers calculations

How Many Operands? Register-Register (0,3) • Two or Three? • (m, n) means m memory operands and n total –Two:SourceandResult operands in an ALU instruction – Three: Source 1, Source 2, and Result – Pure RISC, register to register operations • Tradeoffs – Advantages – Two operand ISA requires more temporary instructions • Simple, fixed length instruction encodings (e.g. Z = X + Y can’t be done in one instruction) • Decode is simple – Three operand ISA supports fewer instructions but •UniformCPI increases the instruction complexity and size – Disadvantages • Also must consider the types of operands allowed • Higher Instruction Count – Register to Register, Register to Memory, Memory to • Some instructions are short and bit encodings may be wasteful Memory – Instruction density, memory bottlenecks, CPI variations!

2 Register-Memory (1,2) Memory-Memory (3,3)

• Register – Memory ALU Architecture • True memory-memory ALU model, e.g. full • In later evolutions of RISC and CISC orthogonal CISC architecture • Advantages • Advantages – Data can be accessed without loading first – Most compact instruction density, no temporary registers needed – Instruction format easy to encode – Good instruction density • Disadvantages • Disadvantages – Memory access create bottleneck –VariableCPI – Source operand also destination, data overwritten – Largevariationininstructionsize – Need for field may limit # registers – Expensive to implement – CPI varies by operand location • Not used in today’s architectures

Memory Addressing Big-Endian vs. Little-Endian

• What is accessed - byte, word, multiple words? • How is data stored? – today’s machine are byte addressable, due to legacy issues – E.g. given 0xABCD • But main memory is organized in 32 - 64 byte lines • Big-Endian – matches cache model – Store MSByte first (AB CD) – Hex dumps a little easier to read – Retrieve data in, say, 4 byte chunks –Intel • Alignment Problem • Little-Endian – accessing data that is not aligned on one of these – Store LSByte first (CD AB) boundaries will require multiple references – May still get right value when reading different word sizes • E.g. fetching 16 bit integer at byte offset 3 requires two four byte – Motorola chunks to be read in (line 0, line 1) • internally know how data is stored, so does it matter? – Can make it tricky to accurately predict execution time – Yes, networks and sending data byte by byte with mis-aligned data – May need functions, like htons – Compiler should try to align! Some instructions auto- – Some systems have an Endian control bit to select align too

3 Addressing Modes Memory Addressing • The specifies the address of an Mode Example Meaning When used operand we want to access Register Add R4, R3 Regs[R4]Regs[R4]+ Value is in a Regs[R3] register – Register or Location in Memory – The actual memory address we access is called the effective Immediate Add R4, #3 Regs[R4]  Regs[R4] + 3 For constants address Displacement Add R4, 100(R1) Regs[R4]  Regs[R4] + Access local • Effective address may go to memory or a register array Mem[100+Regs[R1]] variables – typically dependent on location in the instruction field Indirect Add R4, (R1) Regs[R4]Regs[R4]+ Pointers – multiple fields may combine to form a memory address Mem[Regs[R1]] – register addresses are usually simple - needs to be fast Indexed Add R3, (R1+R2) Regs[R3]Mem[Regs Traverse an • Effective address generation is important and should be [R1]+Regs[R2]] array fast! Direct Add R1, $1001 Regs[R1]  Regs[R1] + Static data, – Falls into the common case of frequently executed Mem[1001] address constant instructions may be large

Memory Addressing Which modes are used? Mode Example Meaning When • VAX supported all modes! used • Dynamic traces, frequency of modes collected Memory Add R1, @(R3) Regs[R1]Regs[R1] + *p if R3=p – Memory Indirect Indirect Mem[Mem[Regs[R3]]] • TeX: 1%, SPICE: 6%, gcc: 1% Autoinc Add R1, (R2)+ Regs[R1]Regs[R1]+ Stepping –Scaled Mem[Regs[R2]], through arrays • TeX: 0%, SPICE: 16%, gcc: 6% Regs[R2]Regs[R2]+1 in a loop – Register Deferred Autodec Add R1, (R2)- Regs[R1]Regs[R1]+ Same as • TeX: 24%, SPICE: 3%, gcc: 11% Mem[Regs[R2]], above. Can – Immediate Regs[R2]Regs[R2]-1 push/pop for a • TeX:43%, SPICE:17%, gcc:39% stack – Displacement Scaled Add R1, Regs[R1] Regs[R1]+ Index arrays • TeX:32%, SPICE:55%, gcc:40% 100(R2)[R3] Mem[100+Regs[R2] + by a scaling Regs[R3] * d] factor, e.g. • Results say: support displacement, immediate, fast! word offsets • WARNING! Role of the compiler here?

4 Displacement Addressing Mode Displacement Traces • According to data, this is a common case – so optimize for it 0.3

• Major question – size of displacement field? 0.25 %of – If small, may fit into word size, better instruction 0.2 density Displacement 0.15 Int – If large, allows larger range of accessed data FP 0.1 – To resolve, use dynamic traces once again to see the size of displacement actually used 0.05 0 0 2 4 6 8 10 12 14 Number of bits needed for displacement (lg d)

Immediate Addressing Mode Immediate Addressing Mode

• Similar issues as with displacement; how big 0.6 should the operands be? What size data do we use in practice? 0.5 • Tends to be used with constants % 0.4 – Constants tend to be small? 0.3 GCC Tex • What instructions use immediate addressing? 0.2 – Loads: 10% Int, 45% FP 0.1 – Compares: 87% Int, 77% FP – ALU: 58% Int, 78% FP 0 0 4 8 121620242832 – All Instructions: 35% Int, 10% FP # bits needed for an immediate value

5 Control Flow Instruction Set Optimization • Transfers or instructions that change the flow of control •Jump – unconditional branch • See what instructions are executed most – How is target specified? How far away from PC? frequently, make sure they are fast! • Branch • Intel : – when condition is used – Load 22% – How is condition set? – Conditional Branch 20% •Calls – Compare 16% – Where is return address stored? – Store 12% – How are parameters passed? – Add 8% • Returns –AND 6% – How is the result returned? –SUB 5% – Move Reg To Reg 4% • What work does the linker have to do? –Call 1% –Ret 1%

Biggest Deal is Conditional Branch Address Specification • Effective address of the branch target is known at Branch compile time for both conditional and unconditional branches – as a register containing the target address –asaPC-relativeoffset

Conditional • Consider word length addresses, registers, and Jump instructions Integer Call/Return – full address desired? Then pick the register option. – BUT - setup and effective address will take longer. Floating Point – if you can deal with smaller offset then PC relative works 0 20406080100 – PC relative is also position independent - so simple linker duty, preferred when possible • Do more measurements to see what’s possible!

6 Branch Distances Condition Testing Options

Name Test Pro Con 0.35 Condition Code Special PSW Conditions may Extra state to 0.3 bits set by ALU be set for “free” maintain, constrain 0.25 ordering of % 0.2 instructions Float 0.15 Int Condition Comparison Simple, less Uses up a Register result put in ordering register 0.1 register, test constraints 0.05 register 0 Compare and Compare is part One instruction May be too 0123456789101112 Branch of branch instead of two much work per for a branch instruction Bits of branch displacement

Branch Direction What is compared? •GCC •<,>= – Backward branches: 24% – Branches taken: 54% – Int:7% Float:40% •Spice •>,<= – Backward branches: 31% – Int: 7% Float: 23% – Branches taken: 51% •TeX • ==, != – Backward branches: 17% – Int: 86% Float: 37% – Branches taken: 54% • Most backward branches are loops • Over 50% of integer compares were to test for – taken about 90% equality with 0 • Branch statistics are both compiler and application dependent • Loop optimizations may have large effect

• We’ll see the role of this later with branch prediction and pipelining

7 Most Frequently Used Operand Operand Type and Size Types • Operands may have many types, how to distinguish • Double word which? – TeX 0%, Spice 66%, GCC 0% • Annotate with a tag interpreted by hardware • Word – Not used anymore today – TeX 89%, Spice 34%, GCC 91% • The opcode also encodes the type of the operand • Halfword – Amounts to different instructions per type – TeX 0%, Spice 0%, GCC 4% • Typical types • Byte – character – byte (UNICODE?) – TeX 11%, Spice 0%, GCC 5% – short integer – two bytes, 2’s complement • Move underway now to 64 bit machines – integer - one word, 2’s complement – BCD likely to go away – float - one word - IEEE 754 – larger offsets and immediates is likely – double - two words - IEEE 754 – usage of 64 and 128 bit values will increase – BCD or packed decimal

Encoding the Instruction Set Instruction Set Encoding Options Variable (e.g. VAX) • How to actually store, in binary, the instructions we want OpCode and # of ops Operand 1 Operand 2 … Operand N – Depends on previous discussion, operands, addressing modes, number of registers, etc. Fixed (e.g. DLX, SPARC, PowerPC) – Will affect code size, and CPI • Tradeoffs: OpCode Operand 1 Operand 2 Operand 3 – Desiretohavemanyregisters,manyaddressingmodes Hybrid (e.g. x86, IBM 360) – Desire to have the average instruction size small – Desire to encode into lengths the hardware can easily OpCode Operand 1 Operand 2 Operand 3 and efficiently handle • Fixed or variable length? OpCode Operand 1 Operand 2 • Remember data delivered in blocks of cache line sizes OpCode Instruction Size? Complexity?

8 Role of the Compiler Compiler Optimizations • Role of the compiler is critical • High-Level – Difficult to program in assembly, so nobody does it – Done on source with output fed to later passes – Certain ISA’s make assembly even more difficult to – E.g. procedure call changed to inline optimize • Local – Leave it to the compiler – Optimize code only within a basic block (sequential fragment • Compiler writer’s primary goal: of code) – correctness – E.g. common subexpressions – remember value, replace with single copy. Replace variables with constants where • Secondary goal: possible, minimize boolean expressions – speed of the object code • Global • More minor goals: – Extend local optimizations across branches, optimize loops – speed of the compilation – E.g., remove code from loops that compute same value on – debug support each pass and put it before the loop. Simplify array address – Language interoperability calculations.

Example of Register Allocation Compiler Optimizations (cont) [100] c = ‘S’ c=‘S’; [101] sum = 0 [102] i = 1 • Register Allocation sum = 0 ; – What registers should be allocated to what variables? i=1; – NP Complete problem using graph coloring. Must use while(i<=100){ [103] label L1: [104] if i > 100 goto L2 an approximation algorithm sum = sum + i ; false • Machine-dependent Optimization i=i+1; } [105] sum = sum + i – Use SIMD instructions if available true [106] i = i + 1 square = sum * sum; [107] goto L1 – Replace multiply with shift and add sequence printc,sum,square; – Reorder instructions to minimize pipeline stalls [108] label L2: [109] square = sum * sum [110] print c, sum, square

9 Example : Register Allocation Example : Register Allocation • Sum and I should get priority over variable C • Assume only two registers • Reuse R2 for variables I and c=‘S’; available, R1 and R2. c=‘S’; square since there is no point sum = 0 ; What variables should be sum = 0 ; in the program where both i=1; i=1; variables are simultaneously while(i<=100){ assigned, if any? while(i<=100){ live. sum = sum + i ; sum = sum + i ; i=i+1; Variable Register i=i+1; Variable #Uses Register } c?} c 1 none square = sum * sum; sum ? square = sum * sum; sum 103 R1 printc,sum,square; i?printc,sum,square; i 301 R2 square ? square 1 R2

Register Allocation: Constructing Register Allocation Example aGraph s1 • A node is a variable (may be temporary) that is a candidate s1 = ld(x) s2 for register allocation s1 s2 = s1 + 4 s3 • An edge connects two nodes, v1 and v2, if there is some s7 s2 statement in the program where variables v1 and v2 are s3 = s1 ∗ 8 s4 simultaneously live, meaning they would interfere with s4 = s1 - 4 s5 one another s6 s3 s5 = s1/2 s6 • Once this graph is constructed, we try to color it with k s6 = s2 * s3 s7 s4 colors, where k = number of free registers. Coloring s5 s7 = s4 - s5 means no connecting nodes may be the same color. The coloring property ensures that no two variables that interfere with each other are assigned the same register. What is a valid coloring? Canweusethesameregisterfors4thatweusefors1?

10 Impact of Compiler Technology Compiler Take-Aways can be Large Optimization % Faster • ISA should have at least 16 general-purpose Procedure Integration 10% registers – Use for register allocation, simplifies graph coloring Local Optimizations Only 5% • Orthogonality (all addressing modes for all Local + Register Allocation 26% operations) simplifies code generation • Provide primitives, not solutions Local + Global + Register 63% – E.g., a solution to match a language construct may only work with one language (see 2.9) Everything 81% – Primitives can be optimized to create a solution • Bind as many values as possible at compile-time, Stanford UCode Compiler Optimization on Fortran/Pascal Programs not run-time Clear benefit to compiler technology and optimizations!

11