Advanced Architecture Intel Microprocessor History
Total Page:16
File Type:pdf, Size:1020Kb
Advanced Architecture Intel microprocessor history Computer Organization and Assembly Languages Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, Robert Sedgwick and Kevin Wayne Early Intel microprocessors The IBM-AT • Intel 8080 (1972) • Intel 80286 (1982) – 64K addressable RAM – 16 MB addressable RAM – 8-bit registers – Protected memory – CP/M operating system – several times faster than 8086 – 5,6,8,10 MHz – introduced IDE bus architecture – 29K transistors – 80287 floating point unit • Intel 8086/8088 (1978) my first computer (1986) – Up to 20MHz – IBM-PC used 8088 – 134K transistors – 1 MB addressable RAM –16-bit registers – 16-bit data bus (8-bit for 8088) – separate floating-point unit (8087) – used in low-cost microcontrollers now 3 4 Intel IA-32 Family Intel P6 Family • Intel386 (1985) • Pentium Pro (1995) – 4 GB addressable RAM – advanced optimization techniques in microcode –32-bit registers – More pipeline stages – On-board L2 cache – paging (virtual memory) • Pentium II (1997) – Up to 33MHz – MMX (multimedia) instruction set • Intel486 (1989) – Up to 450MHz – instruction pipelining • Pentium III (1999) – Integrated FPU – SIMD (streaming extensions) instructions (SSE) – 8K cache – Up to 1+GHz • Pentium (1993) • Pentium 4 (2000) – Superscalar (two parallel pipelines) – NetBurst micro-architecture, tuned for multimedia – 3.8+GHz • Pentium D (2005, Dual core) 5 6 IA32 Processors ARM history • Totally Dominate Computer Market • 1983 developed by Acorn computers • Evolutionary Design – To replace 6502 in BBC computers – Starting in 1978 with 8086 – 4-man VLSI design team – Added more features as time goes on – Its simplicity comes from the inexperience team – Still support old features, although obsolete – Match the needs for generalized SoC for reasonable power, performance and die size • Complex Instruction Set Computer (CISC) – The first commercial RISC implemenation – Many different instructions with many different formats • 1990 ARM (Advanced RISC Machine), owned by • But, only small subset encountered with Linux programs Acorn, Apple and VLSI – Hard to match performance of Reduced Instruction Set Computers (RISC) – But, Intel has done just that! ARM Ltd Why ARM? Design and license ARM core design but not fabricate • One of the most licensed and thus widespread processor cores in the world – Used in PDA, cell phones, multimedia players, handheld game console, digital TV and cameras – ARM7: GBA, iPod – ARM9: NDS, PSP, Sony Ericsson, BenQ – ARM11: Apple iPhone, Nokia N93, N800 – 90% of 32-bit embedded RISC processors till 2009 • Used especially in portable devices due to its low power consumption and reasonable performance ARM powered products Performance boost • Increasing clock rate is insufficient. Architecture (pipeline/cache/SIMD) becomes more significant. In his 1965 paper, Intel co-founder Gordon Moore observed that “the number of transistors per square inch had doubled every 18 months. 12 Basic microcomputer design • clock synchronizes CPU operations • control unit (CU) coordinates sequence of execution steps Basic architecture • ALU performs arithmetic and logic operations data bus registers I/O I/O Central Processor Unit Memory Storage Device Device (CPU) Unit #1 #2 ALU CU clock control bus address bus Basic microcomputer design Clock • The memory storage unit holds instructions and • synchronizes all CPU and BUS operations data for a running program • machine (clock) cycle measures time of a single • A bus is a group of wires that transfer data from operation one part to another (data, address, control) • clock is used to trigger events data bus one cycle registers 1 I/O I/O Central Processor Unit Memory Storage Device Device (CPU) Unit 0 #1 #2 ALU CU clock • Basic unit of time, 1GHz→clock cycle=1ns control bus • An instruction could take multiple cycles to address bus complete, e.g. multiply in 8088 takes 50 cycles Instruction execution cycle program counter instruction queue PC program •Fetch I-1 I-2 I-3 I-4 Pipeline memory fetch •Decode op1 read op2 •Fetch registers registers operands instruction I-1 register •Execute decode • Store output write write flags ALU execute (output) Multi-stage pipeline Pipelined execution • Pipelining makes it possible for processor to • More efficient use of cycles, greater throughput execute instructions in parallel of instructions: (80486 started to use pipelining) • Instruction execution divided into discrete stages Stages Stages For k stages and S1 S2 S3 S4 S5 S6 S1 S2 S3 S4 S5 S6 Example of a non- 1 I-1 1 I-1 n instructions, the pipelined processor. 2 I-1 number of 3 I-1 2 I-2 I-1 For example, 80386. 4 I-1 3 I-2 I-1 required cycles is: Many wasted cycles. 5 I-1 4 I-2 I-1 6 I-1 k + (n –1) Cycles 5 I-2 I-1 7 I-2 Cycles 8 I-2 6 I-2 I-1 compared to k*n 9 I-2 7 I-2 10 I-2 11 I-2 12 I-2 Pipelined execution Pipelined execution • Pipelining requires buffers • Some reasons for unequal work stages – Each buffer holds a single value – A complex step cannot be subdivided conveniently – Ideal scenario: equal work for each stage – An operation takes variable amount of time to execute, e.g. operand fetch time depends on where • Sometimes it is not possible the operands are located • Slowest stage determines the flow rate in the •Registers entire pipeline • Cache •Memory – Complexity of operation depends on the type of operation • Add: may take one cycle • Multiply: may take several cycles Pipelined execution Wasted cycles (pipelined) • Operand fetch of I2 takes three cycles • When one of the stages requires two or more – Pipeline stalls for two cycles clock cycles, clock cycles are again wasted. • Caused by hazards Stages – Pipeline stalls reduce overall throughput exe S1 S2 S3 S4 S5 S6 1 I-1 For k stages and n 2 I-2 I-1 instructions, the 3 I-3 I-2 I-1 number of required 4 I-3 I-2 I-1 5 I-3 I-1 cycles is: Cycles 6 I-2 I-1 k + (2n –1) 7 I-2 I-1 8 I-3 I-2 9 I-3 I-2 10 I-3 11 I-3 Superscalar Pipeline stages A superscalar processor has multiple execution • Pentium 3: 10 pipelines. In the following, note that Stage S4 • Pentium 4: 20~31 has left and right pipelines (u and v). • Next-generation micro-architecture: 14 Stages For k states and n •ARM7: 3 S4 instructions, the S1 S2 S3 u v S5 S6 1 I-1 number of required 2 I-2 I-1 cycles is: 3 I-3 I-2 I-1 4 I-4 I-3 I-2 I-1 k + n 5 I-4 I-3 I-1 I-2 Cycles 6 I-4 I-3 I-2 I-1 7 I-3 I-4 I-2 I-1 8 I-4 I-3 I-2 9 I-4 I-3 Pentium: 2 pipelines 10 I-4 Pentium Pro: 3 Hazards Data hazards • Three types of hazards add r1, r2, #10 ; write r1 – Resource hazards sub r3, r1, #20 ; read r1 • Occurs when two or more instructions use the same resource, also called structural hazards – Data hazards • Caused by data dependencies between instructions, fetch decode reg ALU wb e.g. result produced by I1 is read by I2 – Control hazards fetch decode stall reg ALU wb • Default: sequential execution suits pipelining • Altering control flow (e.g., branching) causes problems, introducing control dependencies Data hazards Data hazards • Forwarding: provides output result as soon as • Forwarding: provides output result as soon as possible possible add r1, r2, #10 ; write r1 add r1, r2, #10 ; write r1 sub r3, r1, #20 ; read r1 sub r3, r1, #20 ; read r1 fetch decode reg ALU wb fetch decode reg ALU wb fetch decode stall reg ALU wb fetch decode stall reg ALU wb fetch decode stall reg ALU wb Control hazards Control hazards bz r1, target • Braches alter control flow add r2, r4, 0 – Require special attention in pipelining ... – Need to throw away some instructions in the target: add r2, r3, 0 pipeline fetch decode reg ALU wb • Depends on when we know the branch is taken • Pipeline wastes three clock cycles fetch decode reg ALU wb – Called branch penalty – Reducing branch penalty fetch decode reg ALU wb • Determine branch decision early fetch decode reg ALU wb fetch decode reg ALU Control hazards Branch prediction • Delayed branch execution • Three prediction strategies – Effectively reduces the branch penalty –Fixed – We always fetch the instruction following the branch • Prediction is fixed • Why throw it away? –Example: branch-never-taken » Not proper for loop structures • Place a useful instruction to execute – Static • This is called delay slot Delay slot • Strategy depends on the branch type – Conditional branch: always not taken add R2,R3,R4 branch target – Loop: always taken branch target add R2,R3,R4 –Dynamic • Takes run-time history to make more accurate predictions sub R5,R6,R7 sub R5,R6,R7 . Branch prediction Branch prediction • Static prediction • Dynamic branch prediction – Improves prediction accuracy over Fixed – Uses runtime history • Takes the past n branch executions of the branch type and makes the prediction Instruction type Instruction Prediction: Correct Distribution Branch prediction –Simple strategy (%) taken? (%) • Prediction of the next branch is the majority of the Unconditional 70*0.4 = 28 Yes 28 previous n branch executions branch •Example: n = 3 Conditional 70*0.6 = 42 No 42*0.6 = 25.2 – If two or more of the last three branches were taken, the branch prediction is “branch taken” Loop 10 Yes 10*0.9 = 9 • Depending on the type of mix, we get more than 90% prediction accuracy Call/return 20 Yes 20 82.2% Overall prediction accuracy = Branch prediction Branch prediction • Impact of past n branches on prediction 00 01 accuracy no branch branch Predict Predict Type of mix no branch no branch n Compiler Business Scientific 0 64.1 64.4 70.4 no branch 1 91.9 95.2 86.6 no branch branch 2 93.3 96.5 90.8 branch 3 93.7 96.6 91.0 4 94.5 96.8 91.8 10 no 11 branch 5 94.7 97.0 92.0 Predict Predict branch branch branch Multitasking • OS can run multiple programs at the same time.