Computer Organization & Assembly Language Programming

CSE 2312

Lecture 4 Processor

1 Metric Units

The principal metric prefixes.

2 Measuring Computer and CPU Performance

• Elapsed – Total response time, including all aspects, such as processing, I/O, OS overhead, and idle time – Determines system performance Time spent • CPU time executing this – CPU time spent on processing a given job program’s – Discount I/O time, and other jobs’ shares instructions – User CPU time + system CPU time – Different programs are affected differently by CPU and system performance

3 CPU Clock

• Every action is driven by a clock in the CPU

• Clock time = 1/Frequency – 1 Mhz clock = 10–6 seconds – 1 Ghz clock = 10–9 seconds

4 How Long Does an Instruction Take?

• Digital logic is controlled by a clock

Clock period

Clock (cycles)

Data transfer and computation Update state

• Clock period: duration of a clock cycle – e.g., 250ps = 0.25ns = 250×10–12s

• Clock frequency (rate): cycles per second – e.g., 4.0GHz = 4000MHz = 4.0×109Hz

5 Predicting CPU Time

• Ideal: Only need to know number of instructions

CPU Time  InstructionsClock Cycle Time Instructions  Clock Rate

• Reality: Some instructions take longer than others Instruction Cycles per Count instruction Instructions Clock cycles Seconds CPU Time    Program Instruction Clock cycle

6 Instruction Count and Cycles Per Instruction

• IC is determined by program, ISA, and compiler

• CPI is determined by CPU and other factors – Different instructions have different CPI – Average CPI is affected by instruction mix

Clock Cycles  Instruction Count (IC)  Cycles per Instruction (CPI) CPU Time  ICCPI Clock Cycle Time ICCPI  Clock Rate

7 Improving CPU Time Usually a tradeoff

CPU Time  CPU Clock CyclesClock Cycle Time CPU Clock Cycles  Clock Rate

n Clock Cycles  CPIi Instruction Count i i1 n Clock Cycles  Instruction Count i  CPI   CPI i   Instruction Count i1  Instruction Count 

Relative frequency

8 Compiler Matters!

• Suppose compiler has two choices: – Can use 5 or 6 instructions, as described below: Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 4 1 1

• Which is better? Sequence 2 has lower average CPI, so it is better. • Sequence 1: IC = 5 • Sequence 2: IC = 6 – Clock Cycles – Clock Cycles = 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3 = 10 = 9 – Avg. CPI = 10/5 = 2.0 – Avg. CPI = 9/6 = 1.5 9 Comparing Performance

• Performance = 1 / Execution Time

• “X is n times faster than Y”

n = (Performancex) / (Performancey) = (Execution Timey) / (Execution Timex)

• Example: time taken to run a program – 10s on A, 15s on B… how much faster is A?

– Execution TimeB / Execution TimeA = 15s / 10s = 1.5 – So A is 1.5 times faster than B

10 CPI Example

• Computer A: Cycle Time = 250ps, CPI = 2.0 • Computer B: Cycle Time = 500ps, CPI = 1.2 • Same ISA • Which is faster, and by how much?

   CPU TimeA Instruction Count CPIA Cycle TimeA  I 2.0 250ps  I500ps A is faster…    CPU TimeB Instruction Count CPIB Cycle TimeB  I1.2500ps  I600ps CPU Time I600ps B   1.2  …by this much CPU TimeA I 500ps

11 CPU Example

• Computer A: – 2GHz clock, 10s CPU time • Let’s design Computer B – Aim for 6s CPU time – Can do faster clock, but causes 1.2x clock cycles • How fast must the new clock be?

Clock CyclesB 1.2×Clock CyclesA Clock RateB = = CPU TimeB 6s

Clock CyclesA = CPU TimeA ×Clock Rate A = 10s×2GHz = 20×109 1.2×20×109 24 ×109 Clock Rate = = = 4GHz B 6s 6s

12 Time for a Program

• CPU executes various instructions • A Program has a number of Instructions, how many? – Depends on program and compiler • Each Instruction takes a number of CPU cycles, how many? – Depends on the Instruction Set Architecture (ISA) – ISA: Learn in this course • Each cycle has a fixed time based on CPU and BUS speed. – Depends on the hardware, organization – Computer Architecture – Learn in this course

13 CPU Performance Equation

14 Performance Summary

• Performance depends on – Algorithm: affects IC, possibly CPI – Programming language: affects IC, CPI – Compiler: affects IC, CPI

– Instruction set architecture: affects IC, CPI, Tc

Instructions Clock cycles Seconds CPU Time    Program Instruction Clock cycle

15 How Improve Performance?

We must lower execution time!

• Algorithm – Determines number of operations executed • Programming language, compiler, architecture – Determine number of machine instructions executed per operation (IC) • Processor and memory system – Determine how fast instructions are executed (CPI) • I/O system (including OS) – Determines how fast I/O operations are executed

16 Amdahl’s Law

• Improving one aspect of a computer won’t give a proportional improvement in overall performance

T T  affected  T improved improvement factor unaffected

• Especially true of multicore computers • So make the common case fast!

17 Exercise 1

• Problem – There are 3 classes of instructions, A, B, C. Suppose compiler has two choices: Sequence 1 and Sequence 2, as described below: Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 3 1 1 • Which one is better? Why? • Sequence 1: IC = 5 • Sequence 2: IC = 5 – Clock Cycles = 2×1 + 1×2 – Clock Cycles= 3×1 + 1×2 + 2×3 = 10 + 1×3 = 8 – Avg. CPI = 10/5 = 2.0 – Avg. CPI = 8/5 = 1.6 Sequence 2 has lower average CPI, so it is better. 18 Exercise 2

• Problem: – There are two computers: A and B. – Computer A: Cycle Time = 250ps, CPI = 2.0 – Computer B: Cycle Time = 400ps, CPI = 1.5 – If they have the same ISA, which computer is faster? – How many times it is faster than another?

• Answer: – We know that CPU = IC * CPI * Cycle time

– Therefore, CPU(A) = IC*2*250 = 500*IC – CPU(B) = IC*1.5*400 = 600*IC – So, A is (600/500) = 1.2 times faster.

19 Exercise 3

• Problem: – Computer A has 2GHz clock. It takes 10s CPU time to finish one given task. – We want to design Computer B to finish the same task within 5s CPU time. – The clock cycle number for computer B is 2 times as that of Computer A. – What clock rate should be designed for Computer B?

Clock CyclesB 2×Clock CyclesA • Answer: Clock Rate B = = CPU TimeB 5s

Clock CyclesA = CPU Time A ×Clock Rate A = 10s×2GHz = 20×109 2×20×109 40×109 Clock Rate = = = 8GHz B 5s 5s 20 (CPU)

The organization of a simple computer with one CPU and two I/O devices

21 Basic Elements

Other devices: Cache; Virtual Memory Support (MMU); ….

22 Processor

• CPU – Brain of the Computer, that execute programs stored in the main memory by fetching instructions, examining and executing them one after another • Bus – Connect different components – Parallel wires for transmitting address, data and control signals – Could be external to the CPU (connect with memory, I/O), or internal • Control Unit – Fetching instructions from main memory and determining their types • Arithmetic Logic Unit (ALU) – Performing arithmetic operations, such as addition, boolean operations • Registers – High speed memory used to store temporary results and control information – Program Counter (PC): point to the next instruction to be fetched – Instruction Registers (IR): hold the instruction currently being executed

23 CPU Organization

• Instructions: – Register-Memory: memory words being fetched into registers – Register-Register

• Data Path Cycle – The of running two operands through the ALU and storing results – Define what the machine can do – The faster the data path cycles is, the faster the computer runs The data path of a typical Von Neumann

machine 24 Arithmetic Logic Unit (ALU)

• Conduct different calculations A B – +, -, x, /, – and, or, xor, not, – Shift, … op c • Variants ALU n – Integer, Floating Point, Double Precision 4 z – High performance CPU has v multiple! C • Input – Operands - A, B – Operation code: obtained from Status codes:Usually encoded instruction c - carry out from +, -, x, shift • Output n - result is negative – Result – C z - result is zero – Status codes v - result overflowed 25 Instruction Execution Steps

• Fetch-decode-execute Central to the operation of all computers – Fetch next instruction from memory into instruction register – Change the program counter to point out the following instruction – Determine type of instruction just fetched – If instructions uses a word in memory, determine where it is – Fetch the word, if needed, into a CPU register – Execute the instruction – Go to step 1 to begin executing following instruction

26 An interpreter for a simple computer (written in Java).

• Figure 2-3. An interpreter for a simple computer (written in Java).

. . . An interpreter for a simple computer (cont’d) Interpreting Instructions

• Interpreter – A program that fetches, examines and executes the instructions of other program – Can write a program to imitate the function of a CPU – Main advantage: the ability to design a simple processor to support a rich set of instructions.

• Benefits (simple computer with interpreted instructions) – The ability to fix incorrectly implemented instructions or make up for design deficiencies in the basic hardware – The opportunity to add new instructions at minimal cost even after delivery of the machine – Structured design that permits efficient development, testing and documenting of complex instructions

29 RISC vs. CISC

• Semantic Gap between – What machine can do? – What high-level programming languages required? • Reduced Instruction Set Computer (Lego Building Example) – Did not use the interpretation – Did not have to be backward compatible with existing products – Small number of instructions, 50 • Key of designing RISC instructions – Designed instructions should be able to issued quickly – How long an instruction actually takes matters less than how many could be started per second • Complex Instruction Set Computer – Instructions, around 200-300, DEC VAX and IBM main-frames • Inter (486 up) – A RISC core executes the simplest (most common) instructions – Interpreting the more complicated instructions in the usual CISC way

30 RISC Design Principles for Modern Computers

• Instructions directly executed by hardware – Eliminating a level of interpretation provides high speed for most instructions; – Using CISC instruction with interpretation for less frequently occurring instructions is acceptable • Maximize rate at which instructions are issued – Parallelism can play a major role in improving performance • Instructions should be easy to decode – A critical limit on the rate of issue of instructions is decoding individual instructions to determine what resources they need; – The fewer different formats for instructions, the better • Only loads, stores should reference memory – Access the memory can take a long time – All other instructions should operate only on registers • Provide plenty of registers – Running out of registers leads to flushing them back to memory – Memory access leads to slow speed

31 Instruction-Level Parallelism

• A five-stage pipeline – The state of each stage as a function of time. – Nine clock cycles are illustrated

32 Pipelining

• A five-stage pipeline – Suppose 2ns for the cycle time. – It takes 10ns for an instruction to progress all the way through the five-stage pipeline – So, the machine runs at 100 MIPS? – Actual rate is 500 MIPS • Pipelining – Allow a tradeoff between latency and processor bandwidth – Latency: how long it takes to execute an instruction – Processor bandwidth: how many MIPS the CPU has • Example – Suppose a complex instruction should take 10 ns, under perfect condition, how many stages pipeline we should design to guarantee to execute 500 MIPS?

Each pipeline: 1/500 MIPS = 2 ns 10 ns/ 2ns =5 stages

33 Superscalar Architectures (1)

• Dual five-stage pipelines with a common instruction fetch unit – Fetches pairs of instructions together and put each one into its own pipeline – Two instructions must not conflict over resource usage

– Neither must depend on the results of others 34 Superscalar Architectures (2)

• Implicit idea – S3 stage can issue instructions considerably faster than the S4 stage is able to execute them

A superscalar processor with five functional units.

35 Processor-Level Parallelism (1)

• An array processor (SIMD) – A large number of identical processors that perform the same sequence of instructions on different sets of data – Different from a standard Von Neumann machine 36 Processor-Level Parallelism (2)

• A single-bus multiprocessor – Example: locate where the white ball in a picture • A multicomputer with local memories. 37 Processor-Level Parallelism (3)

• Multiple-Computers (Loosely coupled) – Easier to build • Multiple-Processors (Tightly coupled) – Easier to programming 38 Exercise

• Ex 1: TRUE OR FALSE, Why? – The Data Path Cycle defines what the computer can do. The longer the data path cycles is, the faster the computer runs.

– Answer: F

– Reason: The Data Path Cycle defines what the computer can do. The shorter/faster the data path cycles is, the faster the computer runs.

39 Exercise

• Ex 2: What are the design principles for modern computers? – (a) Instructions directly executed by hardware – (b) Minimize rate at which instructions are issued – (c) Instructions should be easy to decode – (d) Only loads, stores should reference memory – (e) Provide plenty of registers

– Answer: [a, c, d, e]

40 Exercise

• Ex 3: The following diagram gives the organization of a simple computer with one CPU and two I/O devices. – Is it correct? If not, please correct it in the diagram.

• Solution – Incorrect. – In place of Disk, it Disk should be Register. Register – In place of Register, it should be Disk.

41 Exercise

• Ex 4: Consider each instruction has 5 stages in a computer with pipelining techniques. Each stage takes 2 ns. – What is the maximum number of MIPS that this machine is capable of with this 5-stage pipelining techniques? – What is the maximum number of MIPS that this machine is capable of in the absence of pipelining?

• Solution – 1/2ns=500MIPS – 500/5=100MIPS or 1/(5*2ns)=100MIPS

42