
Advanced Architecture Basic architecture Comppgzuter Organization and Assembl yggy Languages Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, Robert Sedgwick and Kevin Wayne Basic microcomputer design Basic microcomputer design • clock synchronizes CPU operations • The memory storage unit holds instructions and •control unit (CU) coordinates sequence of data for a running program execution steps • A bus is a group of wires that transfer data from • ALU performs arithmetic and logic operations one part to another (dt(data, address, contro l) data bus data bus registers registers I/O I/O I/O I/O Central Processor Unit Memory Storage Central Processor Unit Memory Storage Device Device Device Device (CPU) Unit (CPU) Unit #1 #2 #1 #2 ALU CU clklock ALU CU clklock control bus control bus address bus address bus Clock Instruction execution cycle • synchronizes all CPU and BUS operations program counter •machine (c loc k) cycle measures time of a sing le instruction queue operation PC ppgrogram • Fetch • clock is used to trigger events I-1 I-2 I-3 I-4 memory fetch •Decode one cycle op1 read op2 • Fetch 1 registers registers operands instruction I-1 register • Execute decod 0 • Store output e te te • Basic unit of time, 1GHz→clock cycle=1ns wri wri flags ALU execute • An instruction could take multippyle cycles to (tt)(output) complete, e.g. multiply in 8088 takes 50 cycles Multi-stage pipeline • Pipelining makes it possible for processor to execute instructions in parallel • Instruction execution divided into discrete stages Pipeline Stages S1 S2 S3 S4 S5 S6 Example of a non- 1 I-1 pipelined processor. 2 I-1 3 I-1 For example, 80386. 4 I-1 Many wasted cycles. 5 I-1 6 I-1 7 I-2 Cycles 8 I-2 9 I-2 10 I-2 11 I2I-2 12 I-2 Pipelined execution Pipelined execution • More efficient use of cycles, greater throughput • Pipelining requires buffers of instructions: (80486 started to use pipelining) – EhEach bbffuffer hhldolds a silingle value – Ideal scenario: equal work for each stage Stages • SiSometimes it is not possible For k stages and S1 S2 S3 S4 S5 S6 • Slowest stage determines the flow rate in the 1 I-1 n instructions, the entire pipe line 2 I-2 I-1 number of s 3 I-2 I-1 reqqyuired cycles is: ee 4 I-2 I-1 k + (n –1) Cycl 5 I-2 I-1 6 I2I-2 I1I-1 comparedtd to k*n 7 I-2 Pipelined execution Pipelined execution • Some reasons for unequal work stages • Operand fetch of I2 takes three cycles – A complex step cannot be subdivided conveniently – Pipeline stalls for two cycles – An operation takes variable amount of time to • Caused by hazards execute,,gp e.g. operand fetch time de pends on where – Pipeline stalls reduce overall throughput the operands are located •Registers • Cache •Memory – CCpomplexit y of op eration dep ends on the t ype of operation • Add: may take one cycle • MltilMultiply: may tktake several cycles Wasted cycles (pipelined) Superscalar • When one of the stages requires two or more A superscalar processor has multiple execution clock cycles, clock cycles are again wasted. pipelines. In the following, note that Stage S4 has left and right pipelines (u and v). Stages exe Stages S1 S2 S3 S4 S5 S6 For k states and n S4 1 I-1 For k stages and n instructions, the 2 I2I-2 I1I-1 instructions, the S1 S2 S3 u v S5 S6 1 I-1 number of required 3 I-3 I-2 I-1 number of required 2 I-2 I-1 4 I-3 I-2 I-1 cycles is: les 3 I-3 I-2 I-1 cc 5 I3I-3 I1I-1 cycles is: 4 I-4 I-3 I-2 I-1 k + n Cy 6 I-2 I-1 k + (2n –1) 5 I-4 I-3 I-1 I-2 7 I-2 I-1 Cycles 6 I-4 I-3 I-2 I-1 8 I3I-3 I2I-2 7 I-3 I-4 I-2 I-1 9 I-3 I-2 8 I-4 I-3 I-2 10 I-3 9 I-4 I-3 Pentium: 2 pipelines 11 I-3 10 I4I-4 PtiPentium Pro: 3 Pipeline stages Hazards • Pentium 3: 10 • Three types of hazards • PiPentium 4: 203120~31 – Resource hazards • Next-generation micro-architecture: 14 • Occurs when two or more instructions use the same resource, also called structural hazards •ARM7: 3 – Data hazards • Caused by data dependencies between instructions, e.g. result produced by I1 is read by I2 – Control hazards • Default: sequential execution suits pipelining • Altering control flow (e. g., branching) causes problems, introducing control dependencies Data hazards Data hazards add r1, r2, #10 ; write r1 • Forwarding: provides output result as soon as sub r3, r1, #20 ; read r1 possible add r1, r2, #10 ; write r1 sub r3, r1, #20 ; read r1 fetch decode reg ALU wb fetch decode reg ALU wb fetch decode stall reg ALU wb fetch decode stall reg ALU wb Data hazards Control hazards • Forwarding: provides output result as soon as bz r1, target possible add r2, r4, 0 ... add r1, r2, #10 ; write r1 target: add r2, r3, 0 sub r3, r1, #20 ; read r1 fetch decode reg ALU wb fetch decode reg ALU wb fetch decode reg ALU wb fetch decode stall reg ALU wb fetch decode reg ALU wb fetch decode reg ALU wb fetch decode stall reg ALU wb fetch decode reg ALU Control hazards Control hazards • Braches alter control flow • Delayed branch execution – Requ ire specia l atttittention in piliipelin ing – Effecti vel y reduces the branc h penalty – Need to throw away some instructions in the – We always fetch the instruction following the branch pipeline • Why throw it away? • Depends on when we know the branch is taken • Place a useful instruction to execute • Pipeline wastes three clock cycles •This is calldlled dldelay slot Delay slot – Called branch penalty – RdReduci igng branc h penalty add R2,R3,R4 branch target • Determine branch decision early branch target add R2,R3,R4 sub R5,R6,R7 sub R5,R6,R7 . Branch prediction Branch prediction • Three prediction strategies • Static prediction – Fixe d – Improves preditidiction accuracy over FidFixed • Prediction is fixed – Example: branch-never-taken IiInstruction type IiInstruction PdiiPrediction: Correct » Not proper for loop structures Distribution Branch prediction – Static (()%) taken? (()%) • Strategy depends on the branch type Unconditional 70*0.4 = 28 Yes 28 branch – Conditional branch: always not taken Conditional 70*06 0.6 = 42 No 42*06 0.6 = 25. 2 – Loop: always taken branch –Dynamic Loop 10 Yes 10*0.9 = 9 • Takes run-time history to make more accurate predictions Call/return 20 Yes 20 prediction accuracy = 82.2% Overall Branch prediction Branch prediction • Dynamic branch prediction • Impact of past n branches on prediction – Uses runtime hithistory accuracy • Takes the past n branch executions of the branch type and makes the prediction Type of mix –Simple strategy n Compiler Business Scientific • Prediction of the next branch is the majjyority of the 0 64.1 64.4 70.4 previous n branch executions 1 91.9 95.2 86.6 •Example: n = 3 – If two or more of the last three branches were taken, the 2 93.3 96.5 90.8 prediction is “branch taken” 3 93.7 96.6 91.0 • Depending on the type of mix, we get more than 90% predict ion accuracy 4 94. 5 96. 8 91. 8 5 94.7 97.0 92.0 Branch prediction Multitasking • OS can run multiple programs at the same time. 00 01 no branch • MlilMultiple thread s of execution wiihithin the same branch Predict Predict no branch no branch program. • Scheduler utility assigns a given amount of CPU no time to each running program. branch no branch • Rapid switching of tasks branch branch – ggpggives illusion that all programs are running at once – the processor must support task switching – scheduling policy, round-robin, priority 10 no 11 branch Predict Predict branch branch branch SRAM vs DRAM data bus registers I/O I/O Central Processor Unit Memory Storage Device Device (CPU) Unit #1 #2 Cache ALU CU clock control bus address bus Tran. Access Needs per bit time refresh? Cost Applications SRAM 4 or 6 1X No 100X cache memories DRAM 1 10X Y1XYes 1X MiMain memor ies, frame buffers The CPU-Memory gap Memory hierarchies The gap widens between DRAM, disk, and CPU speeds. • Some fundamental and enduring properties of 100, 000,000 hardware and software: 10,000,000 – Fast storage technologies cost more per byte, have 1,000,000 Disk seek time less capacity, and require more power (heat!). 100,000 DRAM access time 10,000 – The gap between CPU and main memory speed is ns SRAM access time 1, 000 widening. CPU cycle time 100 – Well-written programs tend to exhibit good locality. 10 1 • They suggest an approach for organizing 1980 1985 1990 1995 2000 memory and storage systems known as a ye ar memory hierarchy. register cache memory disk Access time 1 1-10 50-100 20,,,000,000 (cycles) Memory system in practice Reading from memory • Multiple machine cycles are required when reading L0: registers from memory, because it responds much more slowly Smaller, faster, and than the CPU (e.g.33 MHz). The wasted clock cycles are more expensive (per L1: on-chip L1 cache (SRAM) called wait states. byte) storage devices L2: off-chip L2 cache (SRAM) L1 Data 1 cycle ltlatency L3: main memory Regs.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages22 Page
-
File Size-