3.3.3 Architecture

VON NEUMANN ARCHITECTURE (SISD) 1

FLYNN’S TAXONOMY 1

THE COMPONENTS OF THE CPU 2

CONTROL UNIT - CU 3 ARITHMETIC AND LOGIC UNIT - ALU 3 INCREMENTER 3

THE REGISTERS OF THE CPU 4

MEMORY ADDRESS REGISTER – MAR 4 MEMORY DATA REGISTER – MDR 4 PROGRAM – PC 4 CURRENT – CIR 4 – SR 5 INTERRUPT LINES 5 GENERAL PURPOSE REGISTERS 5

THE FETCH EXECUTE CYCLE 6

FETCH EXECUTE AND INTERRUPTS 7

SPEEDING UP CPU OPERATION 9

CACHE MEMORY 9 PIPELINES/PIPELINING 10 CO-PROCESSORS 11

Created by E Barwell Sept 2009, rev Sep 2011 OTHER ARCHITECTURES 12

MULTIPLE PROCESSORS (PARALLEL PROCESSING – MIMD) 12 ARRAY PROCESSORS (VECTOR PROCESSORS - SIMD) 13 ALGORITHMS TO TAKE ADVANTAGE OF NON- VON NEUMANN ARCHITECTURES 13

COMPUTER ARCHITECTURES - REMINDER 13

SISD 13 SIMD 13 MISD 13 MIMD 13

CISC AND RISC PROCESSORS 14

CLOCK SPEEDS 14 CISC - COMPLEX INSTRUCTION SET COMPUTER 15 RISC - REDUCED INSTRUCTION SET COMPUTER 15

Created by E Barwell Sept 2009, rev Sep 2011 3.3.3 1 (SISD)

Is the term used to describe the classical CPU structure which has the following components, which involves a stored program computer: (Alan Turing proposed a similar structure earlier in the 1930’s but this got looked over during the build up to the war).

 Separate memory containing both program code and data

 Program code and data stored in the same format

 Single to manage execution of instructions

 Instructions executed in sequential manner (one after another)

There is a bottleneck caused by not being able to fetch instructions and perform data operations at the same time. This is known as a SISD architecture as we have single instructions working on single pieces of data one after the other

Flynn’s Taxonomy

This is one of 4 basic architecture designs known collectively as “Flynn’s Taxonomy” – proposed by Michael Flynn in 1966.

SISD SIMD (Single Instruction stream/ Single Data stream) (Single Instruction stream/Multiple Data streams) Classic Von Neumann, execute one instruction at a The instruction stream is applied to each of the data time manipulating a single piece of data at a time. A streams. Instances of the same instruction can run in single core processor system. parallel on multiple cores, servicing different data streams. Array or processing multiple items of data with a single instruction (arrays and matrices). MISD MIMD (Multiple Instruction streams/Single Data stream) (Multiple Instruction streams/Multiple Data streams) Multiple instruction streams can be applied in parallel On a multicore computer, each instruction stream runs to a single data source. Could be used for running on a separate processor with independent data. This is decryption routines on a single data source. the current model for multicore personal and has been used for years on Super Computer systems.

Created by E Barwell Sept 2009, rev Sep 2011 3.3.3 Computer architecture 2 The Components of the CPU

The CPU () or processor is the heart of any computer system. Its only task is to repeatedly fetch machine instructions (machine code) from memory and execute them. External events happen, which can alter the order in which instructions are executed (Interrupts which we will look at later). We program the CPU using machine code.

CPU’s are categorised by the width (in bits) of their Data known as the word length

 CPU with 8 Bit Data Bus, called an 8 Bit CPU (e.g. Z80, Amstrad, 6502, C64, BBC)  CPU with 16 Bit Data Bus, called a 16 Bit CPU (e.g. 68000 – Atari ST, Amiga, MegaDrive)  CPU with 32 Bit Data Bus, called a 32 Bit CPU (e.g. 68040, Pentium)  CPU with 64 Bit Data Bus, called a 64 Bit CPU (e.g. AMD 64)

Below is a schematic of a standard CPU (shown inside the dotted line).

Interrupt Lines ( 1 way)

CENTRAL PROCESSING UNIT REGISTERS General Purpose ALU SR Arithmetic & Logic Unit Status Register r0, r1, r2 etc...

ACCUMULATORS Incrementer

CIR PC Current Instruction Register

CU MDR MAR Memory Data Register Control Unit

)

y ) a

y a w d k

e c w 1 t

( s o c

t 2 l e ( n S

C n e

S U n l n o a B o U

c n

p B S r s

m S e u A t o E B x c T

l R l E l A o D a r

D t D o n t

A o C RANDOM ACCESS MEMOR Y [MAIN STORE ]

Created by E Barwell Sept 2009, rev Sep 2011 3.3.3 Computer architecture 3 Control Unit - CU

Often referred to as “The Brain of the CPU”, it controls the operation of the CPU.

 Receives timing information from an external clock (quoted in MHz/GHz)

 In general the faster the clock the faster the CPU can execute instructions

 Controls the Fetch – Execute cycle.

 Generates signals along the Control Bus to order other parts of the CPU, to do things.

These include signals to:

 Initiate Memory Transfers (move data to and from the CPU)

 Control other components of CPU such as ALU and Incrementer

 Transfer data between internal components/registers of the CPU

 Perform decoding of current instruction

 Check Status Register for Interrupts, take action if present

Arithmetic and Logic Unit - ALU

 Responsible for performing calculations (adding, subtracting, multiplication), logical operations (And, Or, Not) and comparisons (comparing one value with another)

 When an arithmetic instruction is encountered, control is transferred to the ALU for execution

 Uses its own registers known as Accumulators, which hold the results of calculations

 The basic outcomes of the results (is the result zero or negative) of calculations are stored in the CCR (or status register)

Incrementer

 Sole job is to increase the value of the Program Counter during the Fetch – Execute cycle.

 Basically a separate (simple) ALU just to add to or subtract values from the PC

Created by E Barwell Sept 2009, rev Sep 2011 3.3.3 Computer architecture 4 The Registers of the CPU

A register is a storage area inside the CPU. They are in general the same width as the Data Bus. They are dedicated storage areas that work at the same speed as the CPU (unlike RAM) and help the CU keep track of important values.

On the next few pages are descriptions of the role and functions of the standard ones.

Memory Address Register – MAR

 Placing an address in the MAR selects that memory location  It is the same width (same number of bits) as the CPU’s Address Bus

Memory Data Register – MDR  Also known as – MBR – You must remember this  Is connected directly to the Data Bus  It is the same width (same number of bits) as the Data Bus

When reading data from memory the following happens inside the CPU

 The memory address of the data is placed in the MAR  The CU initiates a memory transfer from RAM to CPU (read cycle) using the Data Bus  The data arrives at the CPU and is stored in the MDR.

When writing data to memory the following happens inside the CPU

 The memory address where the data is to be stored is placed in the MAR  The data to be written to memory is placed in the MDR by the CU.  The CU initiates a memory transfer from CPU to RAM (Write cycle) using the Data Bus

Program Counter – PC Also known as Sequence Control Register – SCR – You must remember this.

 Holds the memory address of the next instruction to execute  The PC is the same size as the MAR (and hence the Address Bus)

Current Instruction Register – CIR

 Holds the machine code instruction the CPU is currently processing  When a new instruction is fetched from RAM it is transferred from the MDR to the CIR  Once the instruction is in the CIR the CU can decode and execute it  Holding the instruction in the CIR allows more data to arrive (to be used by the instruction) at the MDR without overwriting the current instruction

Created by E Barwell Sept 2009, rev Sep 2011

Status Register – SR

 Keeps track of the basic results of operations performed by the ALU

 Results are stored as a series of flags (bits), which can be interrogated individually

 Bits are set (1) or cleared (0) depending upon the ALU results

Flags can be either true or false and thus be represented by a single Bit.

 If bit is clear/zero then the flag is false  If bit is set/one then the flag is true

There are four common Flags/Bits inside the SR

Flag Name Purpose Z bit Zero Set if result of an instruction was Zero N bit Negative Set if result of an instruction was Negative C bit Carry Set if result of an instruction generated a Carry V bit Overflow Set if result of an instruction was too large to hold in the Accumulator (the number needed more bits that the accumulator has)

 The flags are used to make decisions

 When comparing 2 values the ALU effectively subtracts them.

 If they are equal the result would be Zero and the Z flag would be set.

 If one was bigger than the other then the N would be set or cleared (depending on which was the biggest)

Interrupt Lines

 Signals from other devices (chips) or software, that are requesting CPU resource (time)  Set various bits (depending on priority level of particular interrupt) in the Status Register  At the end of each Fetch-Execute cycle, the Status register is checked to determine if any interrupts occurred  The CPU will clear the relevant bit in the Status Register, save its state (to resume later) and look up the relevant interrupt vector (effectively the memory address of a program that knows what to do with a particular interrupt) to deal with this interrupt

General Purpose Registers  Can be used to store intermediate values without the need to access RAM  Old 8-bit CPU’s only have one ACC available but modern CPU’s use lots (16+) of general-purpose registers that can be used for the same purpose  Helps speed up program execution as values can be held in registers rather than held in to memory. Ideal for loop variables  Faster because can be read/written at the same speed that the CPU operates, rather than the CPU having to wait for Memory transfers to complete

The Fetch Execute cycle  This is the name given to the of fetching machine instructions from memory, decoding and executing them. The task of the CU1 is to generate all the necessary signals in order to perform this cycle, using among other things the Control Bus to co-ordinate the process.

There are 3 main steps in the Cycle.

 Fetch, Decode, Execute

Here is an overview of the cycle briefly introducing the major registers of the CPU

1. The Program Counter (PC) holds the address of the next instruction to execute, this value is passed to the Memory Address Register (MAR), this makes the Address Bus select this location 2. Data transfer from Main Store (RAM) to the Memory Data Register (MDR) 3. The value in the MDR is passed to the Current Instruction Register (CIR) 4. Once the instruction is held in the CIR, the operation code of the CIR is decoded so the control unit can determine how to execute it. 5. Before the instruction is executed, the Incrementer is used to make the value of the PC point to the next instruction in memory. 6. Data may be fetched from RAM prior to executing an instruction; this data will arrive at the MDR overwriting the current instruction; which is why it is copied to the CIR before attempting to execute it 7. Instruction is executed (once all data is available), this may involve several steps 8. The Status Register is checked to see if any interrupts (requests for attention) have occurred

The Fetch-Execute is a repetitive task, after an instruction has been fetched, decoded and executed we start the cycle again. Here it is again, you must remember this. Data from CU ACTION Data To FETCH INSTRUCTION Step 1 PC CPU transfer  MAR Step 2 RAM (@ MAR) Memory transfer (read)  MDR Step 3 MDR CPU transfer CIR Step 4 CIR DECODE INSTRUCTION CU Step 5 PC CPU transfer  INCREMENTER Step 6 CIR.OPERAND Possible Memory transfers to get data MDR Step 7 EXECUTE INSTRUCTION (actions depend on instruction) Step 8 SR Check for Interrupts

Fetch Execute and interrupts

A low level interrupt is a signal asking the CPU for attention, Interrupts are also known as exceptions. General rule of thumb interrupts are hardware generated, exceptions are software generated.

Interrupts can be generated by,

 Hardware inside the CPU when errors occur (Such as illegal addresses or divide by zero, accessing protected memory locations)

 Timers such as the clock that trigger the CU

 Hardware connected to the computer (such as disk drives, I/O channels, keyboards, mice, printers etc…)

 An interrupt is signalled through hardware, which sets a bit in the SR. It is the job of the CU to check the SR before starting a new Fetch execute cycle, to see if an interrupt has occurred.

 Interrupts alter the normal flow of the Fetch execute cycle. When one occurs the CU must act immediately changing the PC to point at the Interrupt handler for the interrupt that has occurred.

The CPU does not know how to deal with any particular interrupt; this is the job of a program called an Interrupt Handler.

 Each Interrupt that can occur needs its own Interrupt handler.

 The addresses of the Interrupt handlers are stored in a section of memory known as Interrupt Vectors.

 These are allocated by the CPU manufacturer and often reside at the very beginning of memory.

There are 2 categories of interrupt

MI

Maskable interrupts can be disabled in hardware. The CU can be made to ignore them completely. MI are usually given priority levels. Each priority level can be disabled by clearing the relevant bit in the SR.

NMI

Non-maskable interrupts are of the highest priority and cannot be disabled. NMI’s are critical to the function of the CPU. The clock controlling the CU is an example, if it was disabled the CPU would cease to function

WHEN AN INTERRUPT OCCURS THE CU MUST:

 Place the contents of the PC onto the stack

 Increase the Stack Pointer

 Identify the Interrupt # that has occurred

 Load the addr of the interrupt handler stored at the appropriate Interrupt vector

 Place this addr in the PC

 Start a new Fetch Execute cycle

Example of code to handle an interrupt PUSH PC JMP (0)

AN INTERRUPT HANDLER MUST:

1. Preserve (push onto the stack) the contents of any general purpose registers it wishes to use

2. Perform the actions required to handle the interrupt (if it was a mouse interrupt the handler would inform windows of the change in position and the status of the mouse buttons)

3. Restore (pop from the stack) general registers previously stored on stack

4. Issue a return from Exception instruction (RTE) to inform the CU that the interrupt handler has finished

Example interrupt handler structure PUSH GEN_RETGISTERS … Perform instructions to deal with interrupt … POP GEN_REGISTERS RTE

WHEN AN INTERRUPT HAS FINISHED THE CU MUST:

1. Decrease the contents of the SP

2. Load the address stored on top of the stack (POP PC) and place it in the PC

3. Start a new Fetch Execute cycle

Control unit effectively executes this instruction POP PC

Speeding up CPU operation

Von Neumann architecture has a fatal flaw, the Von Neumann bottleneck.

This is caused by the movement of data and instructions to and from memory. Memory is not as fast the CPU so the CPU ends up waiting for data to arrive or to be stored before it can carry on executing.

Speed of conventional processing can be increased by

 Increasing clock speed MHZ (the speed by which the Fetch-Execute cycle operates). The limit on this increase will be reached in the near future

 Increase Data Bus width. Fetch more and more bytes of memory in a single fetch (typical sizes 128 Bit wide 256 Bit wide). Massive physical design problems with circuit boards

These techniques alone are not going to be practical for much longer.

The following three methods are ways to alleviate the problem. Some argue that the Von Neumann approach is the problem and we need to look at radical changes in architecture to remove the bottleneck.

Cache memory Do not confuse this with Disk Cache (i.e. the area of RAM used to temporarily store data that is being read and written to the hard drives).

 Is used to speed up general processing of CPU by storing copies (mirrors) of most frequently accessed memory locations

 Sits between RAM and the CPU and runs very fast, much faster than normal RAM

 Amount is limited from 256KB up to 2MB because it’s expensive in terms of Silicon area and heat generated

When reading a value from RAM

1. Look in Cache to see if the memory location is CPU CACHE RAM mirrored there 2. If it is read quickly from Cache 3. If it isn’t read from slower RAM

When writing a value to RAM

1. Look in Cache to see if the memory location is mirrored there 2. If it is write the value to the Cache AND write the value to RAM 3. If it isn’t write the value to RAM

You should be able to see from this that an extra step is performed when writing to cache memory; as the main memory must also be updated. This is a drawback of cache memory.

Pipelines/Pipelining

 Attempt to make Fetch-Decode-Execute cycle more efficient by splitting the process into 3 separate parts that can operate concurrently (at the same time).

 Used in conjunction (as a bolt on) to traditional sequential processing

In standard fetch-execute only 1/3 of processor time is actually spent executing instructions, rest of time spent fetching and decoding instructions

Standard Processor activity Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Fetch Decode Execute Fetch Decode Execute Fetch Decode

Using a pipeline, we perform these actions in parallel (overlap), so that the processor is always (as much as possible) executing instructions.

Pipelined Processor activity Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Fetch Decode Execute Fetch Decode Execute Fetch Decode Fetch Decode Execute Fetch Decode Execute Fetch

Fetch Decode Execute Fetch Decode Execute Pipeline

 From step 3 onwards the processor is always executing an instruction

 This method gives us a theoretical speed increase of 300% over the conventional Fetch-Execute cycle

 Conditional branch instructions cause problems.

In the above table at any one time, we have the next two instructions to be processed. If we hit a branch instruction the instructions in the pipeline may not be executed, we may move the program counter to another part of memory and execute instructions there.

Problem can be solved using multiple pipelines. Load pipelines with instructions to be executed if the condition is true and if the condition is false. As soon as condition is known processor can to the pipeline with the relevant instructions to be executed CONTROL UNIT

Pipeline 1 (containing instructions if condition is true)

EXECUTING

Pipeline 2 (containing instructions if condition is false) CONDITIONAL INSTRUCTION

Can combine multiple pipelines and multiprocessors (I won’t explain how this may work as it starts getting very complicated)

Co-processors

Coprocessors are additional processors that can be added to a computer in order to help with running the applications the computer has to deal with. They may provide access to specialist instructions which are additional to the instruction set offered by the CPU. They often take advantage of different architectures (see Other Processor architectures) and are used to take execution load from the main CPU so it can concentrate on logic and control of flow.

Floating point co-processors: Used to be common when CPU’s only had the ability to do integer arithmetic. This form is not common as most processors have built in support for floating point operations.

Digital Signal processors (DSP’s) Provide very very specialist support for the processing of signal data, most common uses are DSP’s for sound and image manipulation. They have very small instruction sets and are in no way suitable for general programming. Most people find them very difficult to program because of the small instruction set.

Audio cards are an example of DSP support used in modern computers often installed by means of an expansion card (or daughter board)

Graphics processors Provide very specialist graphics primitive rendering. They are often based around parallel architectures making use of vector processing techniques as well. The GPU’s (Graphics Processing Units) are much more powerful in raw terms than the general CPU but have limited instruction sets making it difficult to use them in more general ways2.

Physics processors Do a similar job to graphics processors in that they are set up to handle real world collision and deformation jobs in real time.

Balance execution load Help improve execution time by balancing execution of instructions between the main CPU by providing a second CPU as a co-processor. This is not a common method anymore due to the advent of architectures but was common on 80’s based systems and a lot of arcade hardware.

2 Although NVidia do have a special programming language that allows developers to make use of the GPU power for different jobs such as real-time physics.

Other Processor architectures

The traditional Von Neumann single instruction sequential execution architecture is not the only way to construct CPU’s here are the 2 other main ways, as with Von Neumann they are not perfect.

Multiple Processors (Parallel processing – MIMD) Parallel processing is an approach used in computer design that moves away from the limitations of the traditional sequential processing technique (fetch-exec) and introduces the concept of concurrently executing tasks (running at the same time).

 Computer may contain 2, 8 or as much as 65000 CPU’s

 More than one task can be executed at the same time. Each task executes on an independent processor.

 To achieve parallel processing, processors need to be able to communicate. This is achieved by communication channels, which are hard wired, dedicated connections between processors. Communication allows tasks to synchronise with each other so they can keep in step.

 To be used effectively programs need to be broken into tasks that can be run concurrently (at the same time)

 Access to shared memory can be a problem. Only one processor can read/write RAM at any time. On chip, cache (contains a copy of most accessed memory locations) can alleviate this problem; this can lead to integrity problems (similar to multi-user databases). Ideally, tasks should be organised so they are not processing the same data.

Suitable uses

 Mathematically intensive analysis

 Weather simulation

 Fluid dynamics

 Astronomical analysis –(spectral, image processing)

Array Processors (Vector processors - SIMD)  Specialist processors

 Contain few instructions. More silicon available for each instruction which allows brute force methods to be implemented (rather than algorithmic ones)

 Apply single operations (Add, multiply etc…) to multiple pieces of data at the same time.

 Often used with matrices or lists, which involve lots of mathematical actions

 (Multiplying two 3 by 3 matrices involves 27 multiplies and 18 additions, this can be done in one operation on an array processor, but would require 45 instructions on a conventional processor)

Used for

 3D graphics

 Statistical analysis

 Image processing

Algorithms to take advantage of non- Von Neumann architectures

 Need to split a program or task into smaller tasks and determine which can run concurrently and which need to be sequential

 Need to determine what synchronisation is necessary between these tasks and when it needs to happen

Any standard algorithms that use a divide and conquer approach are automatically suitable for parallel processing.

 Quick sort, Mpeg & Jpeg compression, fractal calculations are all suited to parallel processing

Computer architectures - reminder

SISD Single Instruction Single Data stream, Serial processing

SIMD Single Instruction Multiple Data stream – Array/Vector processors

MISD Multiple Instruction Single Data stream – theoretical as data will change as each instruction uses it

MIMD Multiple Instruction Multiple Data stream – Parallel processes involved

CISC and RISC processors

Clock speeds

 In simple terms the clock speed of a processor (given in GHz or MHz) determines the speed at which the control unit operates. It doesn’t directly relate to the speed at which processors can execute instructions

 Each action the control unit performs takes up one clock cycle (or tick of the clock)

 Different instructions take different numbers of clock cycles to execute.

 Here’s an example to illustrate this based on the MC68000 processor found in the Sega MegaDrive (as well as other machines)

The ADD instruction takes 4 clock cycles to execute (it adds two integers)

The DIV instruction takes 70 clock cycles to execute (divides one number by another)

If the processor had a clock speed of 8MHz, it could execute 2 million ADD instructions every second and ~11, 800 DIV instructions every second

 In general, complex instructions take longer to execute, and how the manufacturer implements those instructions is a large factor in processor speed. This is why AMD can create processors that run with a slower clock speed, say 2GHz, but still run applications as fast as an Intel Pentium processor running with a clock speed of 3.4GHz

Instruction complexity has led to 2 different general approaches to

CISC - Complex Instruction Set Computer

 Can execute many different instructions (used for general programs – large instruction set)

 Instructions can take several clock cycles to execute

 Uses very high clock speeds (GHz) to increase a processor design’s speed

 Uses large area of silicon to implement the large number of instructions

 Uses a large amount of energy

 Operates at high temperatures and hence needs lots of cooling, because of large surface area of silicon, clock speeds and voltages required

 Because of large instruction set most instructions seldom used

 Uses advanced architectures (pipelines, lots of cache) to keep processing high

 Suitable for desktops and other machines with fixed power supplies

RISC - Reduced Instruction Set Computer

 Has a small set of instructions (specialist programming jobs)

 Not easy to program (at machine level) for general purpose applications.

 General applications often require a larger number of instructions than their CISC counterparts

 Can execute instructions in one clock cycle (i.e. very quickly)

 Uses lower clock speeds 100MHz up

 Very low power, smaller silicon areas needed

 Very small heat loss, due to small surface areas, lower clock speeds and voltages

 Ideal for portable devices (phones, PDA’s etc...)