<<

Advanced Architecture Basic architecture

Comppgzuter Organization and Assembly ygg Languages Yung-Yu Chuang

with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, Robert Sedgwick and Kevin Wayne

Basic microcomputer design Basic microcomputer design • clock synchronizes CPU operations • The memory storage unit holds instructions and • (CU) coordinates sequence of data for a running program execution steps • A is a group of wires that transfer data from • ALU performs arithmetic and logic operations one part to ano ther (dt(data, address, contro l)

data bus data bus

registers registers

I/O I/O I/O I/O Central Unit Memory Storage Central Processor Unit Memory Storage Device Device Device Device (CPU) Unit (CPU) Unit #1 #2 #1 #2

ALU CU clklock ALU CU clklock

control bus control bus

address bus address bus Clock Instruction execution cycle • synchronizes all CPU and BUS operations program •machine ( loc k) cyc le measures t ime o f a s ing le instruction queue operation PC ppgrogram • Fetch • clock is used to trigger events I-1 I-2 I-3 I-4 memory fetch •Decode one cycle op1 read op2 • Fetch 1 registers registers operands instruction I-1 register • Execute decod 0 • Store output e te te

• Basic unit of time, 1GHz→clock cycle=1ns wri wri flags ALU

execute • An instruction could take multippyle cycles to (tt)(output) complete, e.g. multiply in 8088 takes 50 cycles

Multi-stage • Pipelining makes it possible for processor to execute instructions in parallel • Instruction execution divided into discrete stages Pipeline Stages S1 S2 S3 S4 S5 S6 Example of a non- 1 I-1 pipelined processor. 2 I-1 3 I-1 For example, 80386. 4 I-1 Many wasted cycles. 5 I-1 6 I-1 7 I-2 Cycles 8 I-2 9 I-2 10 I-2 11 I2I-2 12 I-2 Pipelined execution Pipelined execution

• More efficient use of cycles, greater throughput • Pipelining requires buffers of instructions: (80486 started to use pipelining) – EhEach bbffuffer hhldolds a s ilingle va lue – Ideal scenario: equal work for each stage Stages • SiSometimes it is not poss ible For k stages and S1 S2 S3 S4 S5 S6 • Slowest stage determines the flow rate in the 1 I-1 n instructions, the entire pipe line 2 I-2 I-1 number of

s 3 I-2 I-1 reqqyuired cycles is: ee 4 I-2 I-1 k + (n –1)

Cycl 5 I-2 I-1 6 I2I-2 I1I-1 comparedtd to k*n 7 I-2

Pipelined execution Pipelined execution • Some reasons for unequal work stages • Operand fetch of I2 takes three cycles – A complex step cannot be subdivided conveniently – Pipeline stalls for two cycles – An operation takes variable amount of time to • Caused by hazards execute,,gp e.g. operand fetch time de pends on where – Pipeline stalls reduce overall throughput the operands are located •Registers • •Memory – CCpomplexit y of o peration de pends on the t ype of operation • Add: may take one cycle • MltilMultiply: may tktake severa l cycles Wasted cycles (pipelined) Superscalar • When one of the stages requires two or more A has multiple execution clock cycles, clock cycles are again wasted. pipelines. In the following, note that Stage S4 has left and right pipelines (u and v). Stages exe Stages S1 S2 S3 S4 S5 S6 For k states and n S4 1 I-1 For k stages and n instructions, the 2 I2I-2 I1I-1 instructions, the S1 S2 S3 u v S5 S6 1 I-1 number of required 3 I-3 I-2 I-1 number of required 2 I-2 I-1 4 I-3 I-2 I-1 cycles is:

les 3 I-3 I-2 I-1

cc 5 I3I-3 I1I-1 cycles is: 4 I-4 I-3 I-2 I-1 k + n Cy 6 I-2 I-1 k + (2n –1) 5 I-4 I-3 I-1 I-2

7 I-2 I-1 Cycles 6 I-4 I-3 I-2 I-1 8 I3I-3 I2I-2 7 I-3 I-4 I-2 I-1 9 I-3 I-2 8 I-4 I-3 I-2 10 I-3 9 I-4 I-3 Pentium: 2 pipelines 11 I-3 10 I4I-4 PtiPentium Pro: 3

Pipeline stages Hazards • Pentium 3: 10 • Three types of hazards • PiPentium 4: 203120~31 – Resource hazards • Next-generation micro-architecture: 14 • Occurs when two or more instructions use the same resource, also called structural hazards •ARM7: 3 – Data hazards • Caused by data dependencies between instructions, e.g. result produced by I1 is read by I2 – Control hazards • Default: sequential execution suits pipelining • Altering control flow (e. g., branching) causes problems, introducing control dependencies Data hazards Data hazards add r1, r2, #10 ; write r1 • Forwarding: provides output result as soon as sub r3, r1, #20 ; read r1 possible add r1, r2, #10 ; write r1 sub r3, r1, #20 ; read r1 fetch decode reg ALU wb fetch decode reg ALU wb

fetch decode stall reg ALU wb fetch decode stall reg ALU wb

Data hazards Control hazards • Forwarding: provides output result as soon as bz r1, target possible add r2, r4, 0 ... add r1, r2, #10 ; write r1 target: add r2, r3, 0 sub r3, r1, #20 ; read r1 fetch decode reg ALU wb fetch decode reg ALU wb fetch decode reg ALU wb

fetch decode stall reg ALU wb fetch decode reg ALU wb

fetch decode reg ALU wb fetch decode stall reg ALU wb fetch decode reg ALU Control hazards Control hazards • Braches alter control flow • Delayed branch execution – Requ ire specia l atttittention in piliipelin ing – Effecti vel y reduces the branc h pena lty – Need to throw away some instructions in the – We always fetch the instruction following the branch pipeline • Why throw it away ? • Depends on when we know the branch is taken • Place a useful instruction to execute • Pipeline wastes three clock cycles •This is ca lldlled dldelay slot Delay slot – Called branch penalty

– RdReduci igng branc h pena lty add R2,R3,R4 branch target • Determine branch decision early branch target add R2,R3,R4 sub R5,R6,R7 sub R5,R6,R7 ......

Branch prediction Branch prediction • Three prediction strategies • Static prediction – Fixe d – Improves preditidiction accuracy over FidFixed • Prediction is fixed – Example: branch-never-taken IiInstruction type IiInstruction PdiiPrediction: Correct » Not proper for loop structures Distribution Branch prediction – Static (()%) taken? (()%) • Strategy depends on the branch type Unconditional 70*0.4 = 28 Yes 28 branch – Conditional branch: always not taken Conditional 70*06 0.6 = 42 No 42* 006.6 = 25.2 – Loop: always taken branch –Dynamic Loop 10 Yes 10*0.9 = 9 • Takes run-time history to make more accurate predictions Call/return 20 Yes 20 Overall prediction accuracy = 82.2%

Branch prediction Branch prediction • Dynamic branch prediction • Impact of past n branches on prediction – Uses runtime hithistory accuracy • Takes the past n branch executions of the branch type and makes the prediction Type of mix –Simple strategy n Business Scientific • Prediction of the next branch is the majjyority of the 0 64.1 64.4 70.4 previous n branch executions 1 91.9 95.2 86.6 •Example: n = 3 – If two or more of the last three branches were taken, the 2 93.3 96.5 90.8 prediction is “branch taken” 3 93.7 96.6 91.0 • Depending on the type of mix, we get more than 90% predict ion accuracy 4 94.5 96. 8 91.8 5 94.7 97.0 92.0

Branch prediction Multitasking • OS can run multiple programs at the same time. 00 01 no branch • MlilMultiple s of executi on wi ihithin the same branch Predict Predict no branch no branch program. • Scheduler utility assigns a given amount of CPU no time to each running program. branch no branch • Rapid switching of tasks branch branch – ggpggives illusion that all programs are running at once – the processor must support task switching – scheduling policy, round-robin, priority 10 no 11 branch Predict Predict branch branch branch SRAM vs DRAM

data bus

registers

I/O I/O Central Processor Unit Memory Storage Device Device (CPU) Unit #1 #2 Cache ALU CU clock control bus

address bus

Tran. Access Needs per time refresh? Cost Applications

SRAM 4 or 6 1X No 100X cache memories

DRAM 1 10X Y1XYes 1X MiMain memor ies, frame buffers

The CPU-Memory gap Memory hierarchies

The gap widens between DRAM, disk, and CPU speeds. • Some fundamental and enduring properties of 100,000 000,000 000 hardware and software: 10,000,000 – Fast storage technologies cost more per byte, have 1,000,000 Disk seek time less capacity, and require more power (heat!). 100,000 DRAM access time 10,000 – The gap between CPU and main memory speed is ns SRAM access time 1,000 widening. CPU cycle time 100 – Well-written programs tend to exhibit good locality. 10 1 • They suggest an approach for organizing 1980 1985 1990 1995 2000 memory and storage systems known as a ye ar . register cache memory disk Access time 1 1-10 50-100 20,,,000,000 (cycles) Memory system in practice Reading from memory

• Multiple machine cycles are required when reading L0: registers from memory, because it responds much more slowly Smaller, faster, and than the CPU (e.g.33 MHz). The wasted clock cycles are more expensive (per L1: on-chip L1 cache (SRAM) called wait states. byte) storage devices L2: off-chip L2 cache (SRAM) L1 Data 1 cycle lltatency L3: main memory Regs. 16 KB (DRAM) L2 Unified 4-way assoc Larger, slower, and 128KB--2 MB Main Write-through 4-way assoc cheaper (per byte) Memory 32B lines Write-back storage devices L4: local secondary storage () (local disks) Write allocate Up to 4GB L1 Ins truc tion 32B lines 16 KB, 4-way 32B lines L5: remote secondary storage (tapes, distributed file systems, Web servers) Processor Chip Pentium III

Cache memory Caching in a memory hierarchy • High-speed expensive static RAM both inside Smaller, faster, more level k 48 9 1014 3 Expensive device at and outside the CPU. level k caches a – Level-1 cache: inside the CPU subset of the blocks – LlLevel-2 cac he: ou ts ide the CPU 10 fllk+1from level k+1 • Cache hit: when data to be read is already in 4 Data is copied between levels cache memory in block-sized transfer units • Cache miss: when data to be read is not in cache memory. When? compulsory, capacity 0 1 2 3 Larger, slower, cheaper and conflict. level 4 5 6 7 Storage device at level k+1 • Cache design: cache size, n-way, block size, 8 9 10 11 k+1 is partitioned into

repppylacement policy 12 13 14 15 blocks. General caching concepts Locality

• Program needs object d, which is • Principle of Locality: programs tend to reuse 1412 ReReqquest stored in some block b. data and instructions near those they have used 1214 recently, or that were recently referenced 01 2 3• Cache hit level themselves. 124*4* 9 14 3 – Program finds b in the cache at k level k. E.g., block 14. – Temporal locality: recently referenced items are likely to be referenced in the near future. 124* Request •Cache miss 12 – Spatial locality: items with nearby addresses tend to – b is not at level k, so level k cache must fetch it from level k+1. be referenced close together in time. E.g., block 12. • In general, programs with good locality run 0 1 2 3 – If level k cache is full, then some faster then programs with poor locality level 4*4 5 6 7 current block must be replaced k+1 8 9 10 11 (evicted). Which one is the “victim”? • Locality is the reason why cache and virtual 12 13 14 15 • Placement policy: where can the new memory are designed in architecture and block go? E.g., b mod 4 oppgyerating system. Another exam ple is web • Replacement policy: which block browser caches recently visited webpages. should be evicted? E.g., LRU

Locality example Locality example • Being able to look at code and get a qualitative sum = 0; sense of its locality is important. Does this for (i = 0; i < n; i++) sum += a[i]; function have good locality? return sum; int sum_array_rows(int a[M][N]) • Data { int i, j, sum = 0; – Reference array elements in succession (stride-1 reference pattern): Spatial locality for (i = 0; i < M; i++) – Reference sum each iteration: Temporal locality for (j = 0; j < N; j++) • Instructions sum += a[i][j]; – Reference instructions in sequence: Spatial locality return sum; } stride-1 reference pattern – ClCycle thhthrough loop repea tdltedly: Temporal l ocalit y Locality example Blocked matrix multiply performance • Does this function have good locality? • Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions (ijk and jik) – relatively insensitive to array size. 60 int sum_array_cols(int a[M][N]) { 50 kji int i, j, sum = 0; 40 jki kij ration for (j = 0; j < N; j++) ee 30 ikj jik for (i = 0; i < M; i++) ijk Cycles/it 20 sum += a[i][j]; bijk (bsize = 25) bikj (bsize = 25) return sum; 10

} stride-N reference pattern 0

5 0 0 5 0 2 5 75 5 7 0 100 125 150 175 200 225 250 275 300 325 3 3 4 Array size (n)

Cache-conscious programming • make sure that memory is cache-aligned

RISC v.s. CISC • Split data into hot and cold (list example)

• Use union and bitfields to reduce size and increase locality Trade-offs of instruction sets RISC

compiler • 1980, Patternson and Ditzel (Berkeley),RISC high-level language • Features semantic gap C, C++ – Fixed-length instructions Lisppg, Prolog, Haskell… – Load-store architecture • Before 1980, the trend is to increase instruction – complilexity (one-to-one mapping if possibl)ible) to • Organization bridge the gap. Reduce fetch from memory. – Hard-wired logic Selli ng point: num ber of instruct ions, – Single-cycle instruction addressing modes. (CISC) – Pipeline • Pros: small die size, short development time, • 1980, RISC. Simplify and regularize instructions high performance to introduce advanced architecture for better • Cons: low code density, not compatible performance, pipeline, cache, superscalar.

RISC Design Principles RISC Design Principles • Simple operations • Fixed-length instructions – Simple instructions that can execute in one cycle – Facilit at es effici ent inst ructi on execu tion • Register-to-register operations • Simple instruction format – Only load and store operations access memory – Fixed boundaries for various fields – Rest of the operations on a register-to-register basis • , source operands,… • Simple addressing modes – A few addressing modes (1 or 2) • Large number of registers – Needed to support register-to-register operations – Minimize the procedure call and return overhead CISC and RISC CISC and RISC • CISC – complex instruction set CISC RISC – large instruction set (Intel 486) (MIPS R4000) – high-level operations (simpler for compiler?) – requires interpreter (could take a long #ins truc tions 235 94 time) –examppyles: Intel 80x86 family Addr.modes 11 1 • RISC – reduced instruction set Inst. Size (bytes) 1-12 4 – small instruction set – simple, atomic instructions GP registers 8 32 – directly executed by hardware very quickly – easier to incorporate advanced architecture design –exampp(les: ARM (Advanced RISC Machines ) and DEC Alpha (now Compaq), PowerPC, MIPS

Why RISC? Why RISC? (cont’d) • Simple instructions are preferred • Large register set – Complex instructions are mostly ignored by – Effici ent support for procedure calls and returns • Patterson and Sequin’s study • Due to semantic ggpap – Procedure call/return: 1215% of HLL statements • Simple data structures »Constitute 3133% of machine language instructions » Generate nearly half (45%) of memory references –Comppylex data structures are used relatively infrequently – Small activa tion recor d – Better to support a few simple data types efficiently • Tanenbaum’s study – Only 1.25% of the calls have more than 6 arguments • Synthesize complex ones – More than 93% have less than 6 local scalar variables • Simple addressing modes – Large register set can avoid memory references – Complex addressing modes lead to variable length instructions • Lead to inefficient instruction decoding and scheduling Instruction set design • Issues when determining ISA – ItInstruc tion types – Number of addresses ISA design issues – Address ing modes

Instruction types Operand types • Arithmetic and logic • Instructions support basic data types • Data movement – Charac ters • I/O (memory-mapped, isolated I/O) – Integers – Floati ng-poitint • Flow control – Branches (unconditional, conditional) • Instruction overload • set-then-jump (cmp AX, BX; je target) – Same instruction for different data types • Test-and-jump (beq r1, r2, target) – Example: Pentium mov AL,address ;loads an 8-bit value – Procedure calls (register-based, stack-based) mov AX,address ;loads a 16-bit value • Pentium: ret; MIPS: jr mov EAX, address ;loads a 32-bit value • Register: faster but limited number of parameters • Stack: slower but more general Operand types • Separate instructions – Ins truc tions specify the operan d size – Example: MIPS lb Rdest, address ; loadloadssabte a byte lh Rdest,address ;loads a halfword Number of addresses ;(16 ) lw Rdest,address ;loads a word ;(32 bits) ld Rdest,address ;loads a doubleword ;(64 bits)

Number of addresses Number of addresses • Four categories Number of – 3-address machines instruction operation addresses • two for the source operands and one for the result – 2-address machines 3OP A, B, C A ← B OP C • One address doubles as source and result 2OP A, B A ← A OP B – 1-address machine • machines 1OP A AC ← AC OP A • Accumulator is used for one source and result 0 OP T ← (T-1) OP T – 0-address machines • Stack machines A, B, C: memory or register locations • Operands are taken from the stack AC: accumulator • Result goes onto the stack T: top of stack T-1: second element of stack 3-address 2-address

A  B A  B Exampp,le: RISC machines, TOY Y  Example: IA32 Y  C  (D E) C  (D E) SUB Y, A, B ; Y = A - B opcode A B C MOV Y, A ; Y = A opcode A B MUL T, D, E ; T = D × E SUB Y, B ; YYY = Y - B ADD T, T, C ; T = T + C MOV T, D ; T = D DIV Y, Y, T ; Y = Y / T MUL T, E ; T = T × E ADD T, C ; T = T + C DIV Y, T ; Y = Y / T

1-address 0-address

A  B A  B Exampp()le: IA32’s MUL (EAX) Y  Exampp,le: IA32’s FPU, HP3000 Y  C  (D E) C  (D E) LD D ; AC = D opcode A PUSH A ; A opcode MUL E ; AC = AC × E PUSH B ; ABA, B ADD C ; AC = AC + C SUB ; A-B ST Y ; Y = AC PUSH C ; A-B, C LD A ; AC = A PUSH D ; A-B, C, D SUB B ; AC = AC – B PUSH E ; A-B, C, D, E DIV Y ; AC = AC / Y MUL ; A-B, C, D× E ST Y ; Y = AC ADD ; A-B, C+(D× E) DIV ; (A-B) / (C+(D× E)) POP Y Number of addresses • A basic design decision; could be mixed • Fewer addresses per instruction results in – a less complex processor –shthorter itinstruc tions Addressing modes – longer and more complex programs – longer execution time • The decision has imppggacts on register usage policy as well – 3-address usually means more general- purpose registers – 1-address usually means less

Addressing modes Addressing modes • How to specify location of operands? Trade-off • Common addressing modes for address range, address flexibility, number – Implied of memory references, calculation of addresses – Immediate (lda R1, 1) • Operands can be in three places – Direct (st R1, A) –Registers –Indirect • Register –Regg(ister (add R1,,, R2, R3) – Part of instruction – Register indirect (sti R1, R2) • Constant –Displacement • Immediate addressing mode –Stack • All processors support these two addressing modes –Memory • Difference between RISC and CISC • CISC supports a large variety of addressing modes • RISC follows load/store architecture Implied addressing Immediate addressing instruction • No address field; instruction • Address field contains opcode operand is implied by opcode operand the operand value the instruction ADD 5; AC=AC+5 CLC ; c lear carry • Pros: no extra • A fixed and unvarying memory reference; address faster • Cons: limited range

Direct addressing Indirect addressing instruction • Address field contains instruction • Address field contains opcode address A the effective address opcode address A the address of a of the operand pointer to the Memory Memory ADD A; AC=AC+[A] operand ADD [A]; AC=AC+[[A]] • single memory operand reference • multiple memory • Pros: no additional references address calculation • Pros: large address operand • Cons: limited address space space • Cons: slower Register addressing Register indirect addressing instruction • Address field contains instruction • Address field contains opcode R the address of a opcode R the address of the register register containing a Memory ADD R; AC=AC+R pointer to the operand • Pros: only need a ADD [R]; AC=AC+[R] small address field; • Pros: large address shorter instruction space operand and faster fetch; no • Cons: extra memory memory reference operand reference RitRegisters RitRegisters • Cons: limited address space

Displacement addressing Displacement addressing

instruction • Address field could instruction MOV EAX, [A+ESI*4] opcode R A opcode R A contain a register • Often, register, calle d address and an address indexing register, is Memory MOV EAX, [A+ESI* 4] Memory used for disp lacemen t. • EA=A+[R×S] or vice versa • Usually, a mechanism is provided to • Several variants efficiently increase the + – Base-offset: [EBP+8] + indexing register. –Base-index: [EBX+ESI] operand – Scaled: operand RitRegisters [T+ESI* 4] RitRegisters • Pros: flexible • Cons: complex Stack addressing Addressing modes

instruction • Operand is on top of Mode Meaning Pros Cons opcode the stack Implied Fast fetch Limited instructions

ADD [R]; AC=AC+[R] Immediate Operand=A No memory ref Limited operand

• Pros: large address Direct EA=A Simple Limited address space implicit space Indirect EA=[A] Large address space Multiple memory ref • Pros: short and fast fetch Register EA=R No memory ref Limited address space Register • Cons: limited by FILO EA=[R] Large address space Extra memory ref indirect Stac k order Displacement EA=A+[R] Flexibility Complexity

stack EA=stack top No memory ref Limited applicability

IA32 addressing modes Effective address calculation (IA32)

8

A dummyyp format for one operand 3 3 2 8 or 32 base index s displacement

shifter memory register adder file Based Addressing Based Addressing • Effective address is computed as base + signed displacement – Displacement: – 16-bit addresses: 8- or 16-bit number – 32-bit addresses: 8- or 32-bit number • Useful to access fields of a structure or record • Base regitgister  poin ts to the base a ddress o f the s truc ture • Displacement  relative offset within the structure • Useful to access arrays whose element size is not 2, 4, or 8 bytes • Displacement  points to the beginning of the array • Base register  relative offset of an element within the array

2003 S. Dandamudi Chapter 11: Page 81 To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer, 2003.

Indexed Addressing Indexed Addressing • Effective address is computed as Examples (index * scale factor) + signed displacement add AX,[DI+20] – 16-bit addresses: – We have seen similar usage to access parameters off the stack – displacement: 8- or 16-bit number add AX,marks_table[ESI*4] –scalfle factor: none (i.e., 1) – AblAssembler replaces markblks_table by a cons tan t (i.e., – 32-bit addresses: supplies the displacement) – Each element of takes 4 bytes (the scale factor –displacement: 8- or 32-bit number marks_table – scale factor: 2, 4, or 8 value) • Useful to access elements of an array – ESI needs to hold the element subscript value (particularly if the element size is 2, 4, or 8 add AX,table1[SI] bytes) – SI needs to hold the element offset in bytes •Displacement  ppggyoints to the beginning of the array – When we use the scale factor we avoid such byte counting •  selects an element of the array (array index) • Scaling factor  size of the array element Based-Indexed Addressing Based-Indexed Addressing Based-indexed addressing with no scale factor • Useful in accessing arrays passed on to a • Effecti ve address is compute d as procedure • Base register  points to the beginning of the array base + index + signed displacement • Index register  represents the offset of an element • Useful in accessing two-dimensional arrays relati ve to the base of the array • Displacement  points to the beginning of the array • Base and index registers point to a row and an element Example within that row Assuming BX points to table1 • Useful in accessing arrays of records mov AX, [BX+SI] • Displacement  represents the offset of a field in a record cmp AX,[BX+SI+2] • Base and index registers hold a pointer to the base of the compares tw o su ccessiv e elements of tbl1table1 array and the offset of an element relative to the base of the array

Based-Indexed Addressing

Based-indexed addressing with scale factor • Effective address is computed as base + (index * scale factor) + signed displacement • Useful in accessing two-dimensional arrays when the element size is 2, 4, or 8 bytes • Displacement ==> points to the beginning of the array • Base register ==> holds offset to a row (relative to start of array) • Index register ==> selects an element of the row • Scaling factor ==> size of the array element