Computer Organization & Assembly Language Programming
CSE 2312
Lecture 4 Processor
1 Metric Units
The principal metric prefixes.
2 Measuring Computer and CPU Performance
• Elapsed time – Total response time, including all aspects, such as processing, I/O, OS overhead, and idle time – Determines system performance Time spent • CPU time executing this – CPU time spent on processing a given job program’s – Discount I/O time, and other jobs’ shares instructions – User CPU time + system CPU time – Different programs are affected differently by CPU and system performance
3 CPU Clock
• Every action is driven by a clock in the CPU
• Clock time = 1/Frequency – 1 Mhz clock = 10–6 seconds – 1 Ghz clock = 10–9 seconds
4 How Long Does an Instruction Take?
• Digital logic is controlled by a clock
Clock period
Clock (cycles)
Data transfer and computation Update state
• Clock period: duration of a clock cycle – e.g., 250ps = 0.25ns = 250×10–12s
• Clock frequency (rate): cycles per second – e.g., 4.0GHz = 4000MHz = 4.0×109Hz
5 Predicting CPU Time
• Ideal: Only need to know number of instructions
CPU Time InstructionsClock Cycle Time Instructions Clock Rate
• Reality: Some instructions take longer than others Instruction Cycles per Count instruction Instructions Clock cycles Seconds CPU Time Program Instruction Clock cycle
6 Instruction Count and Cycles Per Instruction
• IC is determined by program, ISA, and compiler
• CPI is determined by CPU and other factors – Different instructions have different CPI – Average CPI is affected by instruction mix
Clock Cycles Instruction Count (IC) Cycles per Instruction (CPI) CPU Time ICCPI Clock Cycle Time ICCPI Clock Rate
7 Improving CPU Time Usually a tradeoff
CPU Time CPU Clock CyclesClock Cycle Time CPU Clock Cycles Clock Rate
n Clock Cycles CPIi Instruction Count i i1 n Clock Cycles Instruction Count i CPI CPI i Instruction Count i1 Instruction Count
Relative frequency
8 Compiler Matters!
• Suppose compiler has two choices: – Can use 5 or 6 instructions, as described below: Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 4 1 1
• Which is better? Sequence 2 has lower average CPI, so it is better. • Sequence 1: IC = 5 • Sequence 2: IC = 6 – Clock Cycles – Clock Cycles = 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3 = 10 = 9 – Avg. CPI = 10/5 = 2.0 – Avg. CPI = 9/6 = 1.5 9 Comparing Performance
• Performance = 1 / Execution Time
• “X is n times faster than Y”
n = (Performancex) / (Performancey) = (Execution Timey) / (Execution Timex)
• Example: time taken to run a program – 10s on A, 15s on B… how much faster is A?
– Execution TimeB / Execution TimeA = 15s / 10s = 1.5 – So A is 1.5 times faster than B
10 CPI Example
• Computer A: Cycle Time = 250ps, CPI = 2.0 • Computer B: Cycle Time = 500ps, CPI = 1.2 • Same ISA • Which is faster, and by how much?
CPU TimeA Instruction Count CPIA Cycle TimeA I 2.0 250ps I500ps A is faster… CPU TimeB Instruction Count CPIB Cycle TimeB I1.2500ps I600ps CPU Time I600ps B 1.2 …by this much CPU TimeA I 500ps
11 CPU Example
• Computer A: – 2GHz clock, 10s CPU time • Let’s design Computer B – Aim for 6s CPU time – Can do faster clock, but causes 1.2x clock cycles • How fast must the new clock be?
Clock CyclesB 1.2×Clock CyclesA Clock RateB = = CPU TimeB 6s
Clock CyclesA = CPU TimeA ×Clock Rate A = 10s×2GHz = 20×109 1.2×20×109 24 ×109 Clock Rate = = = 4GHz B 6s 6s
12 Time for a Program
• CPU executes various instructions • A Program has a number of Instructions, how many? – Depends on program and compiler • Each Instruction takes a number of CPU cycles, how many? – Depends on the Instruction Set Architecture (ISA) – ISA: Learn in this course • Each cycle has a fixed time based on CPU and BUS speed. – Depends on the hardware, organization – Computer Architecture – Learn in this course
13 CPU Performance Equation
14 Performance Summary
• Performance depends on – Algorithm: affects IC, possibly CPI – Programming language: affects IC, CPI – Compiler: affects IC, CPI
– Instruction set architecture: affects IC, CPI, Tc
Instructions Clock cycles Seconds CPU Time Program Instruction Clock cycle
15 How Improve Performance?
We must lower execution time!
• Algorithm – Determines number of operations executed • Programming language, compiler, architecture – Determine number of machine instructions executed per operation (IC) • Processor and memory system – Determine how fast instructions are executed (CPI) • I/O system (including OS) – Determines how fast I/O operations are executed
16 Amdahl’s Law
• Improving one aspect of a computer won’t give a proportional improvement in overall performance
T T affected T improved improvement factor unaffected
• Especially true of multicore computers • So make the common case fast!
17 Exercise 1
• Problem – There are 3 classes of instructions, A, B, C. Suppose compiler has two choices: Sequence 1 and Sequence 2, as described below: Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 3 1 1 • Which one is better? Why? • Sequence 1: IC = 5 • Sequence 2: IC = 5 – Clock Cycles = 2×1 + 1×2 – Clock Cycles= 3×1 + 1×2 + 2×3 = 10 + 1×3 = 8 – Avg. CPI = 10/5 = 2.0 – Avg. CPI = 8/5 = 1.6 Sequence 2 has lower average CPI, so it is better. 18 Exercise 2
• Problem: – There are two computers: A and B. – Computer A: Cycle Time = 250ps, CPI = 2.0 – Computer B: Cycle Time = 400ps, CPI = 1.5 – If they have the same ISA, which computer is faster? – How many times it is faster than another?
• Answer: – We know that CPU = IC * CPI * Cycle time
– Therefore, CPU(A) = IC*2*250 = 500*IC – CPU(B) = IC*1.5*400 = 600*IC – So, A is (600/500) = 1.2 times faster.
19 Exercise 3
• Problem: – Computer A has 2GHz clock. It takes 10s CPU time to finish one given task. – We want to design Computer B to finish the same task within 5s CPU time. – The clock cycle number for computer B is 2 times as that of Computer A. – What clock rate should be designed for Computer B?
Clock CyclesB 2×Clock CyclesA • Answer: Clock Rate B = = CPU TimeB 5s
Clock CyclesA = CPU Time A ×Clock Rate A = 10s×2GHz = 20×109 2×20×109 40×109 Clock Rate = = = 8GHz B 5s 5s 20 Central Processing Unit (CPU)
The organization of a simple computer with one CPU and two I/O devices
21 Basic Elements
Other devices: Cache; Virtual Memory Support (MMU); ….
22 Processor
• CPU – Brain of the Computer, that execute programs stored in the main memory by fetching instructions, examining and executing them one after another • Bus – Connect different components – Parallel wires for transmitting address, data and control signals – Could be external to the CPU (connect with memory, I/O), or internal • Control Unit – Fetching instructions from main memory and determining their types • Arithmetic Logic Unit (ALU) – Performing arithmetic operations, such as addition, boolean operations • Registers – High speed memory used to store temporary results and control information – Program Counter (PC): point to the next instruction to be fetched – Instruction Registers (IR): hold the instruction currently being executed
23 CPU Organization
• Instructions: – Register-Memory: memory words being fetched into registers – Register-Register
• Data Path Cycle – The process of running two operands through the ALU and storing results – Define what the machine can do – The faster the data path cycles is, the faster the computer runs The data path of a typical Von Neumann
machine 24 Arithmetic Logic Unit (ALU)
• Conduct different calculations A B – +, -, x, /, – and, or, xor, not, – Shift, … op c • Variants ALU n – Integer, Floating Point, Double Precision 4 z – High performance CPU has v multiple! C • Input – Operands - A, B – Operation code: obtained from Status codes:Usually encoded instruction c - carry out from +, -, x, shift • Output n - result is negative – Result – C z - result is zero – Status codes v - result overflowed 25 Instruction Execution Steps
• Fetch-decode-execute Central to the operation of all computers – Fetch next instruction from memory into instruction register – Change the program counter to point out the following instruction – Determine type of instruction just fetched – If instructions uses a word in memory, determine where it is – Fetch the word, if needed, into a CPU register – Execute the instruction – Go to step 1 to begin executing following instruction
26 An interpreter for a simple computer (written in Java).
• Figure 2-3. An interpreter for a simple computer (written in Java).
. . . An interpreter for a simple computer (cont’d) Interpreting Instructions
• Interpreter – A program that fetches, examines and executes the instructions of other program – Can write a program to imitate the function of a CPU – Main advantage: the ability to design a simple processor to support a rich set of instructions.
• Benefits (simple computer with interpreted instructions) – The ability to fix incorrectly implemented instructions or make up for design deficiencies in the basic hardware – The opportunity to add new instructions at minimal cost even after delivery of the machine – Structured design that permits efficient development, testing and documenting of complex instructions
29 RISC vs. CISC
• Semantic Gap between – What machine can do? – What high-level programming languages required? • Reduced Instruction Set Computer (Lego Building Example) – Did not use the interpretation – Did not have to be backward compatible with existing products – Small number of instructions, 50 • Key of designing RISC instructions – Designed instructions should be able to issued quickly – How long an instruction actually takes matters less than how many could be started per second • Complex Instruction Set Computer – Instructions, around 200-300, DEC VAX and IBM main-frames • Inter (486 up) – A RISC core executes the simplest (most common) instructions – Interpreting the more complicated instructions in the usual CISC way
30 RISC Design Principles for Modern Computers
• Instructions directly executed by hardware – Eliminating a level of interpretation provides high speed for most instructions; – Using CISC instruction with interpretation for less frequently occurring instructions is acceptable • Maximize rate at which instructions are issued – Parallelism can play a major role in improving performance • Instructions should be easy to decode – A critical limit on the rate of issue of instructions is decoding individual instructions to determine what resources they need; – The fewer different formats for instructions, the better • Only loads, stores should reference memory – Access the memory can take a long time – All other instructions should operate only on registers • Provide plenty of registers – Running out of registers leads to flushing them back to memory – Memory access leads to slow speed
31 Instruction-Level Parallelism
• A five-stage pipeline – The state of each stage as a function of time. – Nine clock cycles are illustrated
32 Pipelining
• A five-stage pipeline – Suppose 2ns for the cycle time. – It takes 10ns for an instruction to progress all the way through the five-stage pipeline – So, the machine runs at 100 MIPS? – Actual rate is 500 MIPS • Pipelining – Allow a tradeoff between latency and processor bandwidth – Latency: how long it takes to execute an instruction – Processor bandwidth: how many MIPS the CPU has • Example – Suppose a complex instruction should take 10 ns, under perfect condition, how many stages pipeline we should design to guarantee to execute 500 MIPS?
Each pipeline: 1/500 MIPS = 2 ns 10 ns/ 2ns =5 stages
33 Superscalar Architectures (1)
• Dual five-stage pipelines with a common instruction fetch unit – Fetches pairs of instructions together and put each one into its own pipeline – Two instructions must not conflict over resource usage
– Neither must depend on the results of others 34 Superscalar Architectures (2)
• Implicit idea – S3 stage can issue instructions considerably faster than the S4 stage is able to execute them
A superscalar processor with five functional units.
35 Processor-Level Parallelism (1)
• An array processor (SIMD) – A large number of identical processors that perform the same sequence of instructions on different sets of data – Different from a standard Von Neumann machine 36 Processor-Level Parallelism (2)
• A single-bus multiprocessor – Example: locate where the white ball in a picture • A multicomputer with local memories. 37 Processor-Level Parallelism (3)
• Multiple-Computers (Loosely coupled) – Easier to build • Multiple-Processors (Tightly coupled) – Easier to programming 38 Exercise
• Ex 1: TRUE OR FALSE, Why? – The Data Path Cycle defines what the computer can do. The longer the data path cycles is, the faster the computer runs.
– Answer: F
– Reason: The Data Path Cycle defines what the computer can do. The shorter/faster the data path cycles is, the faster the computer runs.
39 Exercise
• Ex 2: What are the design principles for modern computers? – (a) Instructions directly executed by hardware – (b) Minimize rate at which instructions are issued – (c) Instructions should be easy to decode – (d) Only loads, stores should reference memory – (e) Provide plenty of registers
– Answer: [a, c, d, e]
40 Exercise
• Ex 3: The following diagram gives the organization of a simple computer with one CPU and two I/O devices. – Is it correct? If not, please correct it in the diagram.
• Solution – Incorrect. – In place of Disk, it Disk should be Register. Register – In place of Register, it should be Disk.
41 Exercise
• Ex 4: Consider each instruction has 5 stages in a computer with pipelining techniques. Each stage takes 2 ns. – What is the maximum number of MIPS that this machine is capable of with this 5-stage pipelining techniques? – What is the maximum number of MIPS that this machine is capable of in the absence of pipelining?
• Solution – 1/2ns=500MIPS – 500/5=100MIPS or 1/(5*2ns)=100MIPS
42