3.3.3 Computer Architecture
VON NEUMANN ARCHITECTURE (SISD) 1
FLYNN’S TAXONOMY 1
THE COMPONENTS OF THE CPU 2
CONTROL UNIT - CU 3 ARITHMETIC AND LOGIC UNIT - ALU 3 INCREMENTER 3
THE REGISTERS OF THE CPU 4
MEMORY ADDRESS REGISTER – MAR 4 MEMORY DATA REGISTER – MDR 4 PROGRAM COUNTER – PC 4 CURRENT INSTRUCTION REGISTER – CIR 4 STATUS REGISTER – SR 5 INTERRUPT LINES 5 GENERAL PURPOSE REGISTERS 5
THE FETCH EXECUTE CYCLE 6
FETCH EXECUTE AND INTERRUPTS 7
SPEEDING UP CPU OPERATION 9
CACHE MEMORY 9 PIPELINES/PIPELINING 10 CO-PROCESSORS 11
Created by E Barwell Sept 2009, rev Sep 2011 OTHER PROCESSOR ARCHITECTURES 12
MULTIPLE PROCESSORS (PARALLEL PROCESSING – MIMD) 12 ARRAY PROCESSORS (VECTOR PROCESSORS - SIMD) 13 ALGORITHMS TO TAKE ADVANTAGE OF NON- VON NEUMANN ARCHITECTURES 13
COMPUTER ARCHITECTURES - REMINDER 13
SISD 13 SIMD 13 MISD 13 MIMD 13
CISC AND RISC PROCESSORS 14
CLOCK SPEEDS 14 CISC - COMPLEX INSTRUCTION SET COMPUTER 15 RISC - REDUCED INSTRUCTION SET COMPUTER 15
Created by E Barwell Sept 2009, rev Sep 2011 3.3.3 Computer architecture 1 Von Neumann architecture (SISD)
Is the term used to describe the classical CPU structure which has the following components, which involves a stored program computer: (Alan Turing proposed a similar structure earlier in the 1930’s but this got looked over during the build up to the war).
Separate memory containing both program code and data
Program code and data stored in the same format
Single control unit to manage execution of instructions
Instructions executed in sequential manner (one after another)
There is a bottleneck caused by not being able to fetch instructions and perform data operations at the same time. This is known as a SISD architecture as we have single instructions working on single pieces of data one after the other
Flynn’s Taxonomy
This is one of 4 basic architecture designs known collectively as “Flynn’s Taxonomy” – proposed by Michael Flynn in 1966.
SISD SIMD (Single Instruction stream/ Single Data stream) (Single Instruction stream/Multiple Data streams) Classic Von Neumann, execute one instruction at a The instruction stream is applied to each of the data time manipulating a single piece of data at a time. A streams. Instances of the same instruction can run in single core processor system. parallel on multiple cores, servicing different data streams. Array or Vector processor processing multiple items of data with a single instruction (arrays and matrices). MISD MIMD (Multiple Instruction streams/Single Data stream) (Multiple Instruction streams/Multiple Data streams) Multiple instruction streams can be applied in parallel On a multicore computer, each instruction stream runs to a single data source. Could be used for running on a separate processor with independent data. This is decryption routines on a single data source. the current model for multicore personal computers and has been used for years on Super Computer systems.
Created by E Barwell Sept 2009, rev Sep 2011 3.3.3 Computer architecture 2 The Components of the CPU
The CPU (Central Processing Unit) or processor is the heart of any computer system. Its only task is to repeatedly fetch machine instructions (machine code) from memory and execute them. External events happen, which can alter the order in which instructions are executed (Interrupts which we will look at later). We program the CPU using machine code.
CPU’s are categorised by the width (in bits) of their Data Bus known as the word length
CPU with 8 Bit Data Bus, called an 8 Bit CPU (e.g. Z80, Amstrad, 6502, C64, BBC) CPU with 16 Bit Data Bus, called a 16 Bit CPU (e.g. 68000 – Atari ST, Amiga, MegaDrive) CPU with 32 Bit Data Bus, called a 32 Bit CPU (e.g. 68040, Pentium) CPU with 64 Bit Data Bus, called a 64 Bit CPU (e.g. AMD 64)
Below is a schematic of a standard CPU (shown inside the dotted line).
Interrupt Lines ( 1 way)
CENTRAL PROCESSING UNIT REGISTERS General Purpose ALU SR Arithmetic & Logic Unit Status Register r0, r1, r2 etc...
ACCUMULATORS Incrementer
CIR PC Current Instruction Register Program Counter
CU MDR MAR Memory Data Register Control Unit Memory Address Register
)
y ) a
y a w d k
e c w 1 t
( s o c
t 2 l e ( n S
C n e
S U n l n o a B o U
c n
p B S r s
m S e u A t o E B x c T
l R l E l A o D a r
D t D o n t
A o C RANDOM ACCESS MEMOR Y [MAIN STORE ]
Created by E Barwell Sept 2009, rev Sep 2011 3.3.3 Computer architecture 3 Control Unit - CU
Often referred to as “The Brain of the CPU”, it controls the operation of the CPU.
Receives timing information from an external clock (quoted in MHz/GHz)
In general the faster the clock the faster the CPU can execute instructions
Controls the Fetch – Execute cycle.
Generates signals along the Control Bus to order other parts of the CPU, to do things.
These include signals to:
Initiate Memory Transfers (move data to and from the CPU)
Control other components of CPU such as ALU and Incrementer
Transfer data between internal components/registers of the CPU
Perform decoding of current instruction
Check Status Register for Interrupts, take action if present
Arithmetic and Logic Unit - ALU
Responsible for performing calculations (adding, subtracting, multiplication), logical operations (And, Or, Not) and comparisons (comparing one value with another)
When an arithmetic instruction is encountered, control is transferred to the ALU for execution
Uses its own registers known as Accumulators, which hold the results of calculations
The basic outcomes of the results (is the result zero or negative) of calculations are stored in the CCR (or status register)
Incrementer
Sole job is to increase the value of the Program Counter during the Fetch – Execute cycle.
Basically a separate (simple) ALU just to add to or subtract values from the PC
Created by E Barwell Sept 2009, rev Sep 2011 3.3.3 Computer architecture 4 The Registers of the CPU
A register is a storage area inside the CPU. They are in general the same width as the Data Bus. They are dedicated storage areas that work at the same speed as the CPU (unlike RAM) and help the CU keep track of important values.
On the next few pages are descriptions of the role and functions of the standard ones.
Memory Address Register – MAR
Placing an address in the MAR selects that memory location It is the same width (same number of bits) as the CPU’s Address Bus
Memory Data Register – MDR Also known as Memory Buffer Register – MBR – You must remember this Is connected directly to the Data Bus It is the same width (same number of bits) as the Data Bus
When reading data from memory the following happens inside the CPU
The memory address of the data is placed in the MAR The CU initiates a memory transfer from RAM to CPU (read cycle) using the Data Bus The data arrives at the CPU and is stored in the MDR.
When writing data to memory the following happens inside the CPU
The memory address where the data is to be stored is placed in the MAR The data to be written to memory is placed in the MDR by the CU. The CU initiates a memory transfer from CPU to RAM (Write cycle) using the Data Bus
Program Counter – PC Also known as Sequence Control Register – SCR – You must remember this.
Holds the memory address of the next instruction to execute The PC is the same size as the MAR (and hence the Address Bus)
Current Instruction Register – CIR
Holds the machine code instruction the CPU is currently processing When a new instruction is fetched from RAM it is transferred from the MDR to the CIR Once the instruction is in the CIR the CU can decode and execute it Holding the instruction in the CIR allows more data to arrive (to be used by the instruction) at the MDR without overwriting the current instruction
Created by E Barwell Sept 2009, rev Sep 2011
Status Register – SR
Keeps track of the basic results of operations performed by the ALU
Results are stored as a series of flags (bits), which can be interrogated individually
Bits are set (1) or cleared (0) depending upon the ALU results
Flags can be either true or false and thus be represented by a single Bit.
If bit is clear/zero then the flag is false If bit is set/one then the flag is true
There are four common Flags/Bits inside the SR
Flag Name Purpose Z bit Zero Set if result of an instruction was Zero N bit Negative Set if result of an instruction was Negative C bit Carry Set if result of an instruction generated a Carry V bit Overflow Set if result of an instruction was too large to hold in the Accumulator (the number needed more bits that the accumulator has)
The flags are used to make decisions
When comparing 2 values the ALU effectively subtracts them.
If they are equal the result would be Zero and the Z flag would be set.
If one was bigger than the other then the N would be set or cleared (depending on which was the biggest)
Interrupt Lines
Signals from other devices (chips) or software, that are requesting CPU resource (time) Set various bits (depending on priority level of particular interrupt) in the Status Register At the end of each Fetch-Execute cycle, the Status register is checked to determine if any interrupts occurred The CPU will clear the relevant bit in the Status Register, save its state (to resume later) and look up the relevant interrupt vector (effectively the memory address of a program that knows what to do with a particular interrupt) to deal with this interrupt
General Purpose Registers Can be used to store intermediate values without the need to access RAM Old 8-bit CPU’s only have one ACC available but modern CPU’s use lots (16+) of general-purpose registers that can be used for the same purpose Helps speed up program execution as values can be held in registers rather than held in to memory. Ideal for loop variables Faster because can be read/written at the same speed that the CPU operates, rather than the CPU having to wait for Memory transfers to complete
The Fetch Execute cycle This is the name given to the process of fetching machine instructions from memory, decoding and executing them. The task of the CU1 is to generate all the necessary signals in order to perform this cycle, using among other things the Control Bus to co-ordinate the process.
There are 3 main steps in the Cycle.
Fetch, Decode, Execute
Here is an overview of the cycle briefly introducing the major registers of the CPU
1. The Program Counter (PC) holds the address of the next instruction to execute, this value is passed to the Memory Address Register (MAR), this makes the Address Bus select this location 2. Data transfer from Main Store (RAM) to the Memory Data Register (MDR) 3. The value in the MDR is passed to the Current Instruction Register (CIR) 4. Once the instruction is held in the CIR, the operation code of the CIR is decoded so the control unit can determine how to execute it. 5. Before the instruction is executed, the Incrementer is used to make the value of the PC point to the next instruction in memory. 6. Data may be fetched from RAM prior to executing an instruction; this data will arrive at the MDR overwriting the current instruction; which is why it is copied to the CIR before attempting to execute it 7. Instruction is executed (once all data is available), this may involve several steps 8. The Status Register is checked to see if any interrupts (requests for attention) have occurred
The Fetch-Execute is a repetitive task, after an instruction has been fetched, decoded and executed we start the cycle again. Here it is again, you must remember this. Data from CU ACTION Data To FETCH INSTRUCTION Step 1 PC CPU transfer MAR Step 2 RAM (@ MAR) Memory transfer (read) MDR Step 3 MDR CPU transfer CIR Step 4 OPCODE CIR DECODE INSTRUCTION CU Step 5 PC CPU transfer INCREMENTER Step 6 CIR.OPERAND Possible Memory transfers to get data MDR Step 7 EXECUTE INSTRUCTION (actions depend on instruction) Step 8 SR Check for Interrupts
Fetch Execute and interrupts
A low level interrupt is a signal asking the CPU for attention, Interrupts are also known as exceptions. General rule of thumb interrupts are hardware generated, exceptions are software generated.
Interrupts can be generated by,
Hardware inside the CPU when errors occur (Such as illegal addresses or divide by zero, accessing protected memory locations)
Timers such as the clock that trigger the CU
Hardware connected to the computer (such as disk drives, I/O channels, keyboards, mice, printers etc…)
An interrupt is signalled through hardware, which sets a bit in the SR. It is the job of the CU to check the SR before starting a new Fetch execute cycle, to see if an interrupt has occurred.
Interrupts alter the normal flow of the Fetch execute cycle. When one occurs the CU must act immediately changing the PC to point at the Interrupt handler for the interrupt that has occurred.
The CPU does not know how to deal with any particular interrupt; this is the job of a program called an Interrupt Handler.
Each Interrupt that can occur needs its own Interrupt handler.
The addresses of the Interrupt handlers are stored in a section of memory known as Interrupt Vectors.
These are allocated by the CPU manufacturer and often reside at the very beginning of memory.
There are 2 categories of interrupt
MI
Maskable interrupts can be disabled in hardware. The CU can be made to ignore them completely. MI are usually given priority levels. Each priority level can be disabled by clearing the relevant bit in the SR.
NMI
Non-maskable interrupts are of the highest priority and cannot be disabled. NMI’s are critical to the function of the CPU. The clock controlling the CU is an example, if it was disabled the CPU would cease to function
WHEN AN INTERRUPT OCCURS THE CU MUST:
Place the contents of the PC onto the stack
Increase the Stack Pointer
Identify the Interrupt # that has occurred
Load the addr of the interrupt handler stored at the appropriate Interrupt vector
Place this addr in the PC
Start a new Fetch Execute cycle
Example of code to handle an interrupt PUSH PC JMP (0)
AN INTERRUPT HANDLER MUST:
1. Preserve (push onto the stack) the contents of any general purpose registers it wishes to use
2. Perform the actions required to handle the interrupt (if it was a mouse interrupt the handler would inform windows of the change in position and the status of the mouse buttons)
3. Restore (pop from the stack) general registers previously stored on stack
4. Issue a return from Exception instruction (RTE) to inform the CU that the interrupt handler has finished
Example interrupt handler structure PUSH GEN_RETGISTERS … Perform instructions to deal with interrupt … POP GEN_REGISTERS RTE
WHEN AN INTERRUPT HAS FINISHED THE CU MUST:
1. Decrease the contents of the SP
2. Load the address stored on top of the stack (POP PC) and place it in the PC
3. Start a new Fetch Execute cycle
Control unit effectively executes this instruction POP PC
Speeding up CPU operation
Von Neumann architecture has a fatal flaw, the Von Neumann bottleneck.
This is caused by the movement of data and instructions to and from memory. Memory is not as fast the CPU so the CPU ends up waiting for data to arrive or to be stored before it can carry on executing.
Speed of conventional processing can be increased by
Increasing clock speed MHZ (the speed by which the Fetch-Execute cycle operates). The limit on this increase will be reached in the near future
Increase Data Bus width. Fetch more and more bytes of memory in a single fetch (typical sizes 128 Bit wide 256 Bit wide). Massive physical design problems with circuit boards
These techniques alone are not going to be practical for much longer.
The following three methods are ways to alleviate the problem. Some argue that the Von Neumann approach is the problem and we need to look at radical changes in architecture to remove the bottleneck.
Cache memory Do not confuse this with Disk Cache (i.e. the area of RAM used to temporarily store data that is being read and written to the hard drives).
Is used to speed up general processing of CPU by storing copies (mirrors) of most frequently accessed memory locations
Sits between RAM and the CPU and runs very fast, much faster than normal RAM
Amount is limited from 256KB up to 2MB because it’s expensive in terms of Silicon area and heat generated
When reading a value from RAM
1. Look in Cache to see if the memory location is CPU CACHE RAM mirrored there 2. If it is read quickly from Cache 3. If it isn’t read from slower RAM
When writing a value to RAM
1. Look in Cache to see if the memory location is mirrored there 2. If it is write the value to the Cache AND write the value to RAM 3. If it isn’t write the value to RAM
You should be able to see from this that an extra step is performed when writing to cache memory; as the main memory must also be updated. This is a drawback of cache memory.
Pipelines/Pipelining
Attempt to make Fetch-Decode-Execute cycle more efficient by splitting the process into 3 separate parts that can operate concurrently (at the same time).
Used in conjunction (as a bolt on) to traditional sequential processing
In standard fetch-execute only 1/3 of processor time is actually spent executing instructions, rest of time spent fetching and decoding instructions
Standard Processor activity Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Fetch Decode Execute Fetch Decode Execute Fetch Decode
Using a pipeline, we perform these actions in parallel (overlap), so that the processor is always (as much as possible) executing instructions.
Pipelined Processor activity Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Fetch Decode Execute Fetch Decode Execute Fetch Decode Fetch Decode Execute Fetch Decode Execute Fetch
Fetch Decode Execute Fetch Decode Execute Pipeline
From step 3 onwards the processor is always executing an instruction
This method gives us a theoretical speed increase of 300% over the conventional Fetch-Execute cycle
Conditional branch instructions cause problems.
In the above table at any one time, we have the next two instructions to be processed. If we hit a branch instruction the instructions in the pipeline may not be executed, we may move the program counter to another part of memory and execute instructions there.
Problem can be solved using multiple pipelines. Load pipelines with instructions to be executed if the condition is true and if the condition is false. As soon as condition is known processor can switch to the pipeline with the relevant instructions to be executed CONTROL UNIT
Pipeline 1 (containing instructions if condition is true)
EXECUTING
Pipeline 2 (containing instructions if condition is false) CONDITIONAL INSTRUCTION
Can combine multiple pipelines and multiprocessors (I won’t explain how this may work as it starts getting very complicated)
Co-processors
Coprocessors are additional processors that can be added to a computer in order to help with running the applications the computer has to deal with. They may provide access to specialist instructions which are additional to the instruction set offered by the CPU. They often take advantage of different architectures (see Other Processor architectures) and are used to take execution load from the main CPU so it can concentrate on logic and control of flow.
Floating point co-processors: Used to be common when CPU’s only had the ability to do integer arithmetic. This form is not common as most processors have built in support for floating point operations.
Digital Signal processors (DSP’s) Provide very very specialist support for the processing of signal data, most common uses are DSP’s for sound and image manipulation. They have very small instruction sets and are in no way suitable for general programming. Most people find them very difficult to program because of the small instruction set.
Audio cards are an example of DSP support used in modern computers often installed by means of an expansion card (or daughter board)
Graphics processors Provide very specialist graphics primitive rendering. They are often based around parallel architectures making use of vector processing techniques as well. The GPU’s (Graphics Processing Units) are much more powerful in raw terms than the general CPU but have limited instruction sets making it difficult to use them in more general ways2.
Physics processors Do a similar job to graphics processors in that they are set up to handle real world collision and deformation jobs in real time.
Balance execution load Help improve execution time by balancing execution of instructions between the main CPU by providing a second CPU as a co-processor. This is not a common method anymore due to the advent of parallel computing architectures but was common on 80’s based systems and a lot of arcade hardware.
2 Although NVidia do have a special programming language that allows developers to make use of the GPU power for different jobs such as real-time physics.
Other Processor architectures
The traditional Von Neumann single instruction sequential execution architecture is not the only way to construct CPU’s here are the 2 other main ways, as with Von Neumann they are not perfect.
Multiple Processors (Parallel processing – MIMD) Parallel processing is an approach used in computer design that moves away from the limitations of the traditional sequential processing technique (fetch-exec) and introduces the concept of concurrently executing tasks (running at the same time).
Computer may contain 2, 8 or as much as 65000 CPU’s
More than one task can be executed at the same time. Each task executes on an independent processor.
To achieve parallel processing, processors need to be able to communicate. This is achieved by communication channels, which are hard wired, dedicated connections between processors. Communication allows tasks to synchronise with each other so they can keep in step.
To be used effectively programs need to be broken into tasks that can be run concurrently (at the same time)
Access to shared memory can be a problem. Only one processor can read/write RAM at any time. On chip, cache (contains a copy of most accessed memory locations) can alleviate this problem; this can lead to integrity problems (similar to multi-user databases). Ideally, tasks should be organised so they are not processing the same data.
Suitable uses
Mathematically intensive analysis
Weather simulation
Fluid dynamics
Astronomical analysis –(spectral, image processing)
Array Processors (Vector processors - SIMD) Specialist processors
Contain few instructions. More silicon available for each instruction which allows brute force methods to be implemented (rather than algorithmic ones)
Apply single operations (Add, multiply etc…) to multiple pieces of data at the same time.
Often used with matrices or lists, which involve lots of mathematical actions
(Multiplying two 3 by 3 matrices involves 27 multiplies and 18 additions, this can be done in one operation on an array processor, but would require 45 instructions on a conventional processor)
Used for
3D graphics
Statistical analysis
Image processing
Algorithms to take advantage of non- Von Neumann architectures
Need to split a program or task into smaller tasks and determine which can run concurrently and which need to be sequential
Need to determine what synchronisation is necessary between these tasks and when it needs to happen
Any standard algorithms that use a divide and conquer approach are automatically suitable for parallel processing.
Quick sort, Mpeg & Jpeg compression, fractal calculations are all suited to parallel processing
Computer architectures - reminder
SISD Single Instruction Single Data stream, Serial processing
SIMD Single Instruction Multiple Data stream – Array/Vector processors
MISD Multiple Instruction Single Data stream – theoretical as data will change as each instruction uses it
MIMD Multiple Instruction Multiple Data stream – Parallel processes involved
CISC and RISC processors
Clock speeds
In simple terms the clock speed of a processor (given in GHz or MHz) determines the speed at which the control unit operates. It doesn’t directly relate to the speed at which processors can execute instructions
Each action the control unit performs takes up one clock cycle (or tick of the clock)
Different instructions take different numbers of clock cycles to execute.
Here’s an example to illustrate this based on the MC68000 processor found in the Sega MegaDrive (as well as other machines)
The ADD instruction takes 4 clock cycles to execute (it adds two integers)
The DIV instruction takes 70 clock cycles to execute (divides one number by another)
If the processor had a clock speed of 8MHz, it could execute 2 million ADD instructions every second and ~11, 800 DIV instructions every second
In general, complex instructions take longer to execute, and how the manufacturer implements those instructions is a large factor in processor speed. This is why AMD can create processors that run with a slower clock speed, say 2GHz, but still run applications as fast as an Intel Pentium processor running with a clock speed of 3.4GHz
Instruction complexity has led to 2 different general approaches to processor design
CISC - Complex Instruction Set Computer
Can execute many different instructions (used for general programs – large instruction set)
Instructions can take several clock cycles to execute
Uses very high clock speeds (GHz) to increase a processor design’s speed
Uses large area of silicon to implement the large number of instructions
Uses a large amount of energy
Operates at high temperatures and hence needs lots of cooling, because of large surface area of silicon, clock speeds and voltages required
Because of large instruction set most instructions seldom used
Uses advanced architectures (pipelines, lots of cache) to keep processing high
Suitable for desktops and other machines with fixed power supplies
RISC - Reduced Instruction Set Computer
Has a small set of instructions (specialist programming jobs)
Not easy to program (at machine level) for general purpose applications.
General applications often require a larger number of instructions than their CISC counterparts
Can execute instructions in one clock cycle (i.e. very quickly)
Uses lower clock speeds 100MHz up
Very low power, smaller silicon areas needed
Very small heat loss, due to small surface areas, lower clock speeds and voltages
Ideal for portable devices (phones, PDA’s etc...)